DeepAI
Log In Sign Up

Instance-aware multi-object self-supervision for monocular depth prediction

This paper proposes a self-supervised monocular image-to-depth prediction framework that is trained with an end-to-end photometric loss that handles not only 6-DOF camera motion but also 6-DOF moving object instances. Self-supervision is performed by warping the images across a video sequence using depth and scene motion including object instances. One novelty of the proposed method is the use of a multi-head attention of the transformer network that matches moving objects across time and models their interaction and dynamics. This enables accurate and robust pose estimation for each object instance. Most image-to-depth predication frameworks make the assumption of rigid scenes, which largely degrades their performance with respect to dynamic objects. Only a few SOTA papers have accounted for dynamic objects. The proposed method is shown to largely outperform these methods on standard benchmarks and the impact of the dynamic motion on these benchmarks is exposed. Furthermore, the proposed image-to-depth prediction framework is also shown to outperform SOTA video-to-depth prediction frameworks.

READ FULL TEXT VIEW PDF
12/19/2019

Instance-wise Depth and Motion Learning from Monocular Videos

We present an end-to-end joint training framework that explicitly models...
02/04/2021

Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency

We present an end-to-end joint training framework that explicitly models...
10/13/2021

Attentive and Contrastive Learning for Joint Depth and Motion Field Estimation

Estimating the motion of the camera together with the 3D structure of th...
07/14/2020

Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance

Self-supervised monocular depth estimation presents a powerful method to...
06/15/2022

Forecasting of depth and ego-motion with transformers and self-supervision

This paper addresses the problem of end-to-end self-supervised forecasti...
03/09/2021

Self-Supervision by Prediction for Object Discovery in Videos

Despite their irresistible success, deep learning algorithms still heavi...
11/18/2020

Attentional Separation-and-Aggregation Network for Self-supervised Depth-Pose Learning in Dynamic Scenes

Learning depth and ego-motion from unlabeled videos via self-supervision...

I Introduction

Fig. 1: Qualitative results of the proposed method with SOTA Methods [33, 25]. The proposed method produce a high quality depth and it is able to account for dynamic and small object in the scene. Comapred to the other methods, this example shows that the proposed method is able to handle better the dynamic objects i. e. the pedestrian and the bicycle. This observation is validated with quantitative results in Sec. IV-B
Fig. 2: The proposed model architecture consisting of the EfficientNet backbone [27], BiFPN [28], the DPC [2] semantic head, the MaskRCNN instance segmentation head [12], the novel instance pose head, an ego-pose head and a depth head. During training, the FPN features () are extracted for the source and target frames , . These features are pooled using the proposals of the RPN and the ROI Align modules. The class, bounding box and instance mask heads use only the features of frame . The Instance pose head uses both source and target frames as input. This head output a axis-angle parameters for each instance. Similarly, the ego-pose head uses the both source and target frames FPN’ features as input. This head output a axis-angle parameters for the ego-pose. The depth head input the FPN features of the source frame and output a multi-scale depth.

Monocular depth prediction is a prominent problem in computer vision, with many applications in robotics, AR/VR, autonomous driving and its related downstream tasks,

e. g.

automatic emergency braking (AEB). While Lidar scans provide high accuracy depth measurements, generating high quality depth from a monocular camera is attractive due to the cost-efficiency and availability of such systems. Recent developments based on deep learning methods 

[39, 11, 26, 1, 4, 25, 31, 33, 5, 24, 19] have demonstrated competitive depth prediction quality. These methods are formulated via

self-supervised learning

based on making use of monocular video during training. Thus, large and accurate ground-truth datasets are not required. One way to design such a training model is to introduce a depth network alongside a pose network that can estimate the pose between two video frames. These networks are optimized using a photometric loss between the target image and a set of warped images obtained using the pose and the depth.

A common assumption for self-supervised training is the rigid scene assumption, i. e. a static scene and a moving camera. However, this assumption is often violated due to the motion of moving objects present in the scene. A possible solution is to mask the pixels of dynamic objects using either a learned mask [39], semantic guidance [17] or auto-masking [10, 33]. However, these solutions are merely a workaround to avoid non rigid-scenes and they subsequently miss data on moving objects that could otherwise be useful for further constraining depth prediction. [24, 38, 30, 18, 35] have addressed the moving object in the scene. [30, 18, 35] learn a per-object semantic segmentation mask and a motion field that accounts for the moving object. [24, 38] rely on the optical flow which does not explicitly model the DOF motion of objects separately. These networks are optimized for local rigidity and the notion of the class and the possible dynamics for each class is not taken into account.

A proposition is made in this paper to alleviate this assumption. Non-rigid scenes are learnt by factorizing the motion into the dominant ego-pose and a piece-wise rigid pose for each dynamic object explicitly. Therefore, for static objects, only the ego-pose is used for the warping, while the dynamic objects are subject to two transformations using the motion of the camera and the motion of each moving object. The proposed method explicitly models the motion of each object allowing accurate warping of the scene elements. In order to model this motion, The proposed method makes use of the multi-head attention of the transformer network that matches moving objects across time and models their interaction and dynamics. This enables accurate and robust pose estimation for each object instance. The proposed method achieves SOTA results on KITTI benchmark. In summary, the contributions of this paper are:

  • A novel network architecture based on the transformers multi-head attention that explicitly models the dynamics of moving objects.

  • An accurate and robust per-object pose is obtained by matching and modeling the interaction of the objects across time.

  • High quality depth prediction achieving state-of-the-art results on the KITTI benchmark [9].

  • The demonstration that the KITTI benchmark has a bias favoring static scenes and a method to test the quality of moving object depth prediction.

Ii Related work

Ii-a Self-supervised depth prediction

Depth prediction has been successful with self-supervised learning from videos. The seminal work of Zhou et al. [39] introduced the core idea to jointly optimize the pose and depth network using image reconstruction and photometric loss. To account for the ill-posed nature of this task, several works have addressed different challenges. [30, 35, 19] addressed the rigid scene assumption. [11, 26]

proposed more robust image reconstruction losses to reject the outliers such as occlusion during training. 

[5] have addressed learning the camera parameters for better generalization. [1, 4, 25, 31] have addressed the scale ambiguity problem and propose to enforce depth scale and structure consistency. [20, 33] employed a test-time refinement by allowing the model parameters to vary dynamically during inference using a photometric loss. Similarly, the proposed paper jointly optimize the pose and depth using the photometric loss as supervisory signal and addresses specifically the problem of moving objects in the scene.

Ii-B Camera and object motion factorization

Supervising the depth with a photometric loss is problematic when moving objects are present in the scene. This challenge has gained attention in the literature. A common solution is to disentangle the dominant ego-motion and the object motion. [5, 24, 38] leverage an optical flow network to detect moving objects by comparing the optical flow with depth-based mapping. [30] learns a per-object semantic segmentation mask and a motion field is obtained by factorization of the motion of each mask and the ego-motion. [35] relaxes the problem using local rigidity within a predefined window. [19] leverages the geometric consistency of depth, ego-pose and optical flow and categorises each pixel as either rigid motion, non-rigid/object motion or occluded/non-visible regions. A recent work that is closest to the proposed method is Insta-DM [18]. In that method, the source and target images are masked with semantic masks and an object PoseNet is used to learn the pose from the masked RGB images. Alternatively, the method proposed in this paper factorizes the motion into ego-motion and object-motion and exploits a transformer attention network to perform instance segmentation and learn a per-object motion.

Iii Method

Iii-a Problem formulation

The aim of monocular depth prediction is to learn an accurate depth map through the mapping where the target image and target depth. In self-supervised learning, this model is trained via novel view synthesis by warping a set of source frames to the target frame using the learned depth and the target to source pose . Prior methods assume a static scene observed by a camera undergoing ego-motion. This fundamental assumption is often violated when moving objects are present in the scene. A common solution is to mask the dynamic objects’ pixels. These solutions aim to assist the static scene assumption. In this paper, rather than enforcing the rigid scene restriction, a proposition is made to alleviate this restriction. For each pixel, a global rigid-scene pose and a piece-wise rigid pose for each dynamic object is learned. This is more precise and consistent with the non-rigid real-world situations. An instance segmentation network [22] is extended to incorporate the pose information so that the network learns an additional -DOF pose for each instance. Therefore, each instance is represented by the class , bounding box , mask and the additional pose as illustrated in Fig. 2. The per-instance warping is defined as:

(1)

where is the number of dynamic object instances and a identity matrix. For simplicity the homogeneous pose and projection transformations are omitted in Eq. 1. The mask is used to transform only the dynamic object with its pose . Rigid scene points are transformed only with the pose . Using the Eq. 1, the image is obtained by inverse warping.

Iii-B Architecture

In order to explicitly model the motion of the moving objects, an instance pose head is introduced into an instance segmentation network. EfficientPS [22] has demonstrated SOTA results for panoptic and instance segmentation and is therefore adopted in this paper for depth prediction. It consists of the EfficientNet backbone [27], BiFPN [28], MaskRCNN instance segmentation head [12] and the DPC [2] semantic head. The EfficientNet backbone has demonstrated its success as a task agnostic feature extractor for nearly all vision tasks. It is easily scalable allowing more complexity/FLOPS trade-off. The BiFPN allows low-level and high-level feature aggregation thus, enabling a rich representation that accounts for the fine-details and more global abstraction at each feature map. During training, the FPN features () are extracted for the source and target frames. The two pose heads use both source and target features while the instance, semantic and depth heads use only the target features. The model architecture is shown in Fig. 2. The additional heads are detailed in the following.

Iii-B1 Instance pose head

The key idea of this paper is to factorize the motion by explicitly estimating the DOF pose of each object in addition to the dominant ego-pose. In order to accurately estimate this motion, the objects should be matched and tracked temporally and its interaction should be modeled. Inspired by the prior work on object tracking [21, 36], a novel instance pose head that extends the instance segmentation is proposed using transformer module [29]. This head makes use of the multi-head attention to learn the association and interaction of the object across time.

The RPN network yields N proposals. The features of each proposal are pooled using a ROI Align module. These features are extracted for the three frames. Therefore, the input of the instance pose head is . Where is the batch size and is the number of sources images. The first operation is to project these features into the transformer embedding. The linear projection layer flatten the 3 last dimensions and a linear layer is used to learn an embedding of each proposals. This mapping is defined as .

The input of the encoder-decoder transformer is a sequence with features. The transformer-encoder’ multi-head attention enable the matching of target frame proposals with respect to the source proposals across time while the feed-forward learns the matched-motion features. For the transformer-decoder, only the target proposals are used for input. The multi-head attention aggregate the matched-motion features of the encoder to the target proposals and further learn the interactions of the objects by learnig a the attention between the proposals. Finally a linear layer is used to predict the DOF pose per object yielding using a axis-angle convention parameters. The Non-maximum-suppression used for the object detection head is employed to filter the N proposals pose keeping only the relevant objects.

Iii-B2 Ego-pose branch

The ego-pose branch estimates the dominant pose of the camera. The architecture is similar to [10]. Since the low-level features that allow matching are usually extracted in the first layers, the features of the FPN for source and target features are used. This network outputs parameters for the pose transformation using the axis-angle convention.

Iii-B3 Depth branch

The depth branch consists of convolution layers with skip connections from the FPN module as in [10]. Similar to prior work [39, 10, 33], a multi-scale depth is estimated in order to resolve the issue of gradient locality. The prediction of depth at each scale consists of a convolution with a kernel of and a Sigmoid activation. The output of this activation, , is re-scaled to obtain the depth , where and are chosen to constrain between and units, similar to [10].

To maintain self-supervised learning setting, a frozen pretrained EffiecientPS that was trained on the Cityscapes benchmark 

[7] is used. This pretrained model achieve and on Cityscapes test benchmark. As the representation that was trained for panoptic segmentation may ignore details that are crucial for depth prediction. A duplicate of the Backbone and FPN is used for the Depth and pose heads. This allows learning features optimized for depth prediction without degrading the performances of the panoptic segmentation heads.

Iii-C Objective functions

Let be the objective function. The self-supervised setting casts the depth learning problem into image reconstruction problem through the reverse warping. Thus, learning the parameters involves learning such . Learning involves minimizing the objective function:

(2)

where n is the number of training examples. The selected surrogate losses that minimize Eq. 2 are :

  • Photometric loss: Following [39, 10, 25] The photometric loss seeks to reconstruct the target image by warping the source images using the static/dynamic pose and depth. An loss is defined as follows:

    (3)

    where is the reverse warped target image obtained by Eq. 1. This simple is regularized using SSIM [32] that has a similar objective to reconstruct the image. The final photometric loss is defined as:

    (4)
  • Depth smoothness: An edge-aware gradient smoothness constraint is used to regularize the photometric loss. The disparity map is constrained to be locally smooth through the use an image-edge weighted penalty, as discontinuities often occur at image gradients. This regularization is defined as [14]:

    (5)

The final objective function is defined as :

(6)
Method Supervision Resolution Abs Rel Sq Rel RMSE RMSE log
SfMlearner [39] M 640x192 0.183 1.595 6.709 0.270 0.734 0.902 0.959
GeoNet [38] M+F 416x128 0.155 1.296 5.857 0.233 0.793 0.931 0.973
CC [24] M+S+F 832x256 0.140 1.070 5.326 0.217 0.826 0.941 0.975
Chen et al. [3] M+S 512x256 0.118 0.905 5.096 0.211 0.839 0.945 0.977
Monodepth2 [10] M 640x192 0.115 0.903 4.863 0.193 0.877 0.959 0.981
SGDepth [17] M+S 1280x384 0.113 0.835 4.693 0.191 0.879 0.961 0.981
SAFENet [6] M+S 640x192 0.112 0.788 4.582 0.187 0.878 0.963 0.983
Insta-DM [18] M+S 640x192 0.112 0.777 4.772 0.191 0.872 0.959 0.982
PackNetSfm [25] M 640x192 0.111 0.785 4.601 0.189 0.878 0.960 0.982
Johnston et al. [15] M 640x192 0.106 0.861 4.699 0.185 0.889 0.962 0.982
Manydepth [33] M+TS 640x192 0.098 0.770 4.459 0.176 0.900 0.965 0.983
Ours M+S 640x192 0.110 0.719 4.486 0.184 0.878 0.964 0.984
TABLE I: Quantitative performance comparison of on the KITTI benchmark with Eigen split [9]. For Abs Rel, Sq Rel, RMSE and RMSE log lower is better, and for , and higher is better. The Supervision column illustrates the traing modalities: (M) raw images (S)Semantic, (F) optical flow, (TS) Teacher-student. At test-time, all monocular methods (M) scale the estimated depths with median ground-truth LiDAR.The best scores are bold and the second are underlined

Iv Experiment

Iv-a Setting

  • KITTI benchmark [9]: Following the prior work [39, 37, 33, 10, 31], the Eigen et al. [8] split is used with Zhou et al. [39] pre-processing to remove static frames. For evaluation, the metrics of previous works [8] are used for the depth.

  • Implementation details: PyTorch 

    [23]

    is used for all models. The networks are trained for 40 epochs and 20 for the ablation, with a batch size of 2. The Adam optimizer 

    [16] is used with a learning rate of and . Exponential moving average of the model parameters is used with a decay of . As the training proceeds, the learning rate is decayed at epoch 15 to . The SSIM weight is set to and the smoothing regularization weight to . The depth head outputs 4 depth maps. At each scale, the depth is up-scaled to the target image size. The hyperparapmters of EfficientPS are defined in  [22]. Two source images are used and are used. The input images are resized to

    . Two data augmentations were performed: horizontal flips with probability

    and color jitter with .

Iv-B Results

During the evaluation, the depth is capped to 80m. To resolve the scale ambiguity, the predicted depth map is multiplied by the median scaling. The results are reported in Table I. The proposed method achieves state-of-the-art (SOTA) performance and outperform [33] with respect to the Sq Rel with an improvement of . As expected, the proposed method is superior than the prior works that factorize the motion using the optical flow [24, 38] as their estimated motion is only local and do not account for the class and of the object. Besides, it outperforms other similar methods [18] that factorize using the a pose for each object. Fig. 1 illustrates the qualitative result comparison. As observed, the proposed method enables high quality depth prediction. Compared to the SOTA methods, our methods is able to represent well the dynamic objects i. e. the pedestrians that are crossing the street, the biker on the left. As the network did not mask the dynamic objects during training, the dynamic objects are better learned compared to the methods that masks the dynamic objects [33, 25].

Iv-B1 Dynamic and static evaluation

In contrast to training, where the points are categorized into moving and static object’s points, testing is performed on all points that have Lidar ground truth. This does not take into account the relevance of the points and the static/dynamic category. Moving objects are crucial for autonomous driving applications. However, with this testing setup, it is not possible to convey how the model performs on moving objects, especially for methods that masks moving objects during training. This begs the question as to whether a model trained with rigid scene assumption learns to represent the depth of dynamic objects even when it is trained with only static objects?

In order to address this question, the performances of static/dynamic are evaluated separately. A dynamic mask is used to segment moving objects and the assessment can be carried out on each category separately. To avoid biasing the evaluation with the EffiecientPS mask, the evaluation mask is obtained using an independent MaskRCNN [12] trained with detectron2 [34]. The first observation that could be made is that the static objects represents of test points. This suggest that using the mean across all points will bias the evaluation towards the static objects. A better solution is to consider the per static/dynamic category mean. Table II illustrates the evaluation of the the method versus the current SOTA method video-to-depth prediction [33]. The proposed method outperform the SOTA [33] for the dynamic objects with a huge gap while the gap for the static objects is only . The results shows that degradation induced by considering the rigid scene assumption is significant. This exposes the limitation of the current evaluation. The KITTI benchmark is biased towards static scene. In order to unbias the evaluation, the mean per-category is used to balance the influence. The proposed method outperform the video-to-depth prediction method [33] with . The analysis of Table II and Fig. 1 suggests that models with rigid scene assumption is still able to predict a depth for moving object (probably due to the depth smoothness regularization and stationary cars) however, its quality is very degraded compared to the static objects.

Iv-B2 Ablation study

Evaluation Model Abs Rel Sq Rel RMSE RMSE log
All points mean ManyDepth [33] 0.098 0.770 4.459 0.176
Ours 0.110 0.719 4.486 0.184
Only dynamic ManyDepth [33] 0.192 2.609 7.461 0.288
Ours 0.167 1.911 6.724 0.271
Only static ManyDepth[33] 0.085 0.613 4.128 0.150
Ours 0.101 0.624 4.269 0.163
Per category mean ManyDepth[33] 0.139 1,611 5,794 0,219
Ours 0.140 1,267 5,496 0,217
TABLE II: Quantitative performance comparison for dynamic and static objects. The proposed method outperform the SOTA [33] for the dynamic objects with a huge gap
Ablation Backbone Ego-pose input feature Shared backbone Piece-wise rigid pose Abs Rel Sq Rel RMSE RMSE log
A1 Resnet18 [13] Layer5 - - 0.121 0.914 4.890 0.196
A2 EfficientNet-b5 - - 0.132 0.906 4.981 0.205
A3 EfficientNet-b5 - - 0.127 0.983 5.010 0.201
A4 EfficientNet-b5 - - 0.121 0.894 4.886 0.197
A5 EfficientNet-b5 - 0.120 0.925 4.868 0.194
A6 EfficientNet-b5 0.113 0.795 4.689 0.190
A7 EfficientNet-b6 0.110 0.719 4.486 0.184
TABLE III: An ablation study of the proposed method. The evaluation was done on KITTI benchmark using Eigen split [8]. As observed, The effect of the backbone is minimal A1 vs A5, The choice of the input feature for ego-pose head is sensible A2 vs A3 vs A4, The performance of the proposed method is obtained mainly by the introduction of the piece-wise rigid pose A5 vs A6. Increasing the complexity of the model allow better performances and better training stability A6 vs A7

Table III illustrate an ablation study performed to validate the contribution of the proposed method. The results strongly suggest that the performance of the proposed network is mainly obtained by the introduction of the motion factorization through the proposed instance pose head. The first observation that could be made is that the effect of the backbone and sharing the backbone is minor, versus and versus . Introducing the piece-wise rigid pose warping versus induce an improvement of this results suggest that not only the models learn an accurate depth but also accurate instance pose. This result demonstrates that the transformer network is able to match and learn the interaction of the objects across time. The model in is on the same setting of the other SOTA methods [33, 25]. Despite using this low performance baseline, the introduction of the dynamic warping enabled the proposed method to achieve the SOTA results. Another observation is the sensibility of the features used for the static pose , and . As observed, the ego-pose head is sensible to the choice of the level of the features. yields the best results.

An interesting observation during training is that underfit the data (i.e., the validation loss is less than the learning loss). The test performances are not stable, the best model among the 20 epoches is reported for this backbone. In order to resolve this underfitting, the complexity of the model is increased . This allow for a better stability of the training loss and test performance. The best results are obtained using this complexity. The additional instance pose results in an additional run-time overhead during training. The training time for 1 epoch for A5 and A7 is and trained on RTX3090 respectively. However, the additional runtime is only for the training. At test-time, the depth network require only a single pass of the image with roughly 38 FPS for model and 34 FPS for model using a single RTX3090.

V Conclusion

In this paper a novel instance pose head is introduced for self-supervising monocular depth prediction. This head enables the factorization of the scene’s motion. Thus, alleviating the rigid scene assumption. It is shown that it achieves the SOTA results on the KITTI benchmark [9]. The ablation study further validates that the multi-head attention of the transformer network predicts an accurate object pose. Moreover, the impact of the dynamic motion on this benchmark is exposed. Namely, the bias towards static objects where of the test pixels correspond to static objects. A mean per static/dynamic category metric is proposed to unbias the assessment.

References

  • [1] J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M. Cheng, and I. Reid (2019) Unsupervised scale-consistent depth and ego-motion learning from monocular video. Advances in neural information processing systems 32, pp. 35–45. Cited by: §I, §II-A.
  • [2] L. Chen, M. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens (2018) Searching for efficient multi-scale architectures for dense image prediction. Advances in neural information processing systems 31. Cited by: Fig. 2, §III-B.
  • [3] P. Chen, A. H. Liu, Y. Liu, and Y. F. Wang (2019)

    Towards scene understanding: unsupervised monocular depth estimation with semantic-aware representation

    .
    In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2624–2632. Cited by: TABLE I.
  • [4] Y. Chen, C. Schmid, and C. Sminchisescu (2019) Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. Proceedings of the IEEE International Conference on Computer Vision 2019-October, pp. 7062–7071. External Links: Document, 1907.05820, ISBN 9781728148038, ISSN 15505499 Cited by: §I, §II-A.
  • [5] Y. Chen, C. Schmid, and C. Sminchisescu (2019) Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7063–7072. Cited by: §I, §II-A, §II-B.
  • [6] J. Choi, D. Jung, D. Lee, and C. Kim (2020)

    Safenet: self-supervised monocular depth estimation with semantic-aware feature extraction

    .
    arXiv preprint arXiv:2010.02893. Cited by: TABLE I.
  • [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §III-B3.
  • [8] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems, Vol. 3, pp. 2366–2374. External Links: 1406.2283, ISSN 10495258 Cited by: 1st item, TABLE III.
  • [9] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: 3rd item, TABLE I, 1st item, §V.
  • [10] C. Godard, O. M. Aodha, M. Firman, and G. Brostow (2019) Digging into self-supervised monocular depth estimation. Proceedings of the IEEE International Conference on Computer Vision 2019-October (1), pp. 3827–3837. External Links: Document, 1806.01260, ISBN 9781728148038, ISSN 15505499 Cited by: §I, 1st item, §III-B2, §III-B3, TABLE I, 1st item.
  • [11] A. Gordon, H. Li, R. Jonschkowski, and A. Angelova (2019) Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. Proceedings of the IEEE International Conference on Computer Vision 2019-Octob, pp. 8976–8985. External Links: Document, 1904.04998, ISBN 9781728148038, ISSN 15505499 Cited by: §I, §II-A.
  • [12] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: Fig. 2, §III-B, §IV-B1.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: TABLE III.
  • [14] P. Heise, S. Klose, B. Jensen, and A. Knoll (2013) Pm-huber: patchmatch with huber regularization for stereo matching. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2360–2367. Cited by: 2nd item.
  • [15] A. Johnston and G. Carneiro (2020) Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp. 4756–4765. Cited by: TABLE I.
  • [16] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: 2nd item.
  • [17] M. Klingner, J. A. Termöhlen, J. Mikolajczyk, and T. Fingscheidt (2020) Self-supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance.

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    12365 LNCS, pp. 582–600.
    External Links: Document, 2007.06936, ISBN 9783030585648, ISSN 16113349 Cited by: §I, TABLE I.
  • [18] S. Lee, S. Im, S. Lin, and I. S. Kweon (2021) Learning monocular depth in dynamic scenes via instance-aware projection consistency. arXiv preprint arXiv:2102.02629. Cited by: §I, §II-B, TABLE I, §IV-B.
  • [19] C. Luo, Z. Yang, P. Wang, Y. Wang, W. Xu, R. Nevatia, and A. Yuille (2019) Every pixel counts++: joint learning of geometry and motion with 3d holistic understanding. IEEE transactions on pattern analysis and machine intelligence 42 (10), pp. 2624–2641. Cited by: §I, §II-A, §II-B.
  • [20] R. McCraith, L. Neumann, A. Zisserman, and A. Vedaldi (2020) Monocular depth estimation with self-supervised instance adaptation. arXiv preprint arXiv:2004.05821. Cited by: §II-A.
  • [21] T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer (2021) Trackformer: multi-object tracking with transformers. arXiv preprint arXiv:2101.02702. Cited by: §III-B1.
  • [22] R. Mohan and A. Valada (2021) Efficientps: efficient panoptic segmentation. International Journal of Computer Vision 129 (5), pp. 1551–1579. Cited by: §III-A, §III-B, 2nd item.
  • [23] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. Cited by: 2nd item.
  • [24] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and M. J. Black (2019)

    Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation

    .
    In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2019-June, pp. 12232–12241. External Links: Document, 1805.09806, ISBN 9781728132938, ISSN 10636919 Cited by: §I, §I, §II-B, TABLE I, §IV-B.
  • [25] V. G. Rares, Ambrus, S. Pillai, A. Raventos, and A. Gaidon (2020) 3D packing for self-supervised monocular depth estimation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2482–2491. External Links: Document, 1905.02693, ISSN 10636919 Cited by: Fig. 1, §I, §II-A, 1st item, TABLE I, §IV-B2, §IV-B.
  • [26] C. Shu, K. Yu, Z. Duan, and K. Yang (2020) Feature-Metric Loss for Self-supervised Learning of Depth and Egomotion. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12364 LNCS, pp. 572–588. External Links: Document, 2007.10603, ISBN 9783030585280, ISSN 16113349 Cited by: §I, §II-A.
  • [27] M. Tan and Q. Le (2019)

    Efficientnet: rethinking model scaling for convolutional neural networks

    .
    In

    International conference on machine learning

    ,
    pp. 6105–6114. Cited by: Fig. 2, §III-B.
  • [28] M. Tan, R. Pang, and Q. V. Le (2020) Efficientdet: scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10781–10790. Cited by: Fig. 2, §III-B.
  • [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Cited by: §III-B1.
  • [30] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki (2017) Sfm-net: learning of structure and motion from video. arXiv preprint arXiv:1704.07804. Cited by: §I, §II-A, §II-B.
  • [31] L. Wang, Y. Wang, L. Wang, Y. Zhan, Y. Wang, and H. Lu (2021) Can Scale-Consistent Monocular Depth Be Learned in a Self-Supervised Scale-Invariant Manner?. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12727–12736. Cited by: §I, §II-A, 1st item.
  • [32] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: 1st item.
  • [33] J. Watson, O. Mac Aodha, V. Prisacariu, G. Brostow, and M. Firman (2021) The temporal opportunist: self-supervised multi-frame monocular depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1164–1174. Cited by: Fig. 1, §I, §I, §II-A, §III-B3, TABLE I, 1st item, §IV-B1, §IV-B2, §IV-B, TABLE II.
  • [34] Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019) Detectron2. Cited by: §IV-B1.
  • [35] D. Xu, A. Vedaldi, and J. F. Henriques (2021) Moving slam: fully unsupervised deep learning in non-rigid scenes. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4611–4617. Cited by: §I, §II-A, §II-B.
  • [36] Y. Xu, Y. Ban, G. Delorme, C. Gan, D. Rus, and X. Alameda-Pineda (2021) Transcenter: transformers with dense queries for multiple-object tracking. arXiv preprint arXiv:2103.15145. Cited by: §III-B1.
  • [37] Z. Yang, P. Wang, W. Xu, L. Zhao, and R. Nevatia (2018) Unsupervised learning of geometry from videos with edge-aware depth-normal consistency. In Thirty-Second AAAI conference on artificial intelligence, Cited by: 1st item.
  • [38] Z. Yin and J. Shi (2018) GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1983–1992. External Links: Document, 1803.02276, ISBN 9781538664209, ISSN 10636919 Cited by: §I, §II-B, TABLE I, §IV-B.
  • [39] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 2017-January, pp. 6612–6621. External Links: Document, 1704.07813, ISBN 9781538604571 Cited by: §I, §I, §II-A, 1st item, §III-B3, TABLE I, 1st item.