Instance-aware multi-object self-supervision for monocular depth prediction

by   Houssem-eddine Boulahbal, et al.

This paper proposes a self-supervised monocular image-to-depth prediction framework that is trained with an end-to-end photometric loss that handles not only 6-DOF camera motion but also 6-DOF moving object instances. Self-supervision is performed by warping the images across a video sequence using depth and scene motion including object instances. One novelty of the proposed method is the use of a multi-head attention of the transformer network that matches moving objects across time and models their interaction and dynamics. This enables accurate and robust pose estimation for each object instance. Most image-to-depth predication frameworks make the assumption of rigid scenes, which largely degrades their performance with respect to dynamic objects. Only a few SOTA papers have accounted for dynamic objects. The proposed method is shown to largely outperform these methods on standard benchmarks and the impact of the dynamic motion on these benchmarks is exposed. Furthermore, the proposed image-to-depth prediction framework is also shown to outperform SOTA video-to-depth prediction frameworks.


3D Object Aided Self-Supervised Monocular Depth Estimation

Monocular depth estimation has been actively studied in fields such as r...

Instance-wise Depth and Motion Learning from Monocular Videos

We present an end-to-end joint training framework that explicitly models...

Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency

We present an end-to-end joint training framework that explicitly models...

STDepthFormer: Predicting Spatio-temporal Depth from Video with a Self-supervised Transformer Model

In this paper, a self-supervised model that simultaneously predicts a se...

Forecasting of depth and ego-motion with transformers and self-supervision

This paper addresses the problem of end-to-end self-supervised forecasti...

Attentional Separation-and-Aggregation Network for Self-supervised Depth-Pose Learning in Dynamic Scenes

Learning depth and ego-motion from unlabeled videos via self-supervision...

D^2NeRF: Self-Supervised Decoupling of Dynamic and Static Objects from a Monocular Video

Given a monocular video, segmenting and decoupling dynamic objects while...

Please sign up or login with your details

Forgot password? Click here to reset