Log In Sign Up

End-to-end depth from motion with stabilized monocular videos

by   Clément Pinard, et al.

We propose a depth map inference system from monocular videos based on a novel dataset for navigation that mimics aerial footage from gimbal stabilized monocular camera in rigid scenes. Unlike most navigation datasets, the lack of rotation implies an easier structure from motion problem which can be leveraged for different kinds of tasks such as depth inference and obstacle avoidance. We also propose an architecture for end-to-end depth inference with a fully convolutional network. Results show that although tied to camera inner parameters, the problem is locally solvable and leads to good quality depth prediction.


page 3

page 4

page 5

page 6


N-QGN: Navigation Map from a Monocular Camera using Quadtree Generating Networks

Monocular depth estimation has been a popular area of research for sever...

MobileDepth: Efficient Monocular Depth Prediction on Mobile Devices

Depth prediction is fundamental for many useful applications on computer...

Learning structure-from-motionfrom motion

This work is based on a questioning of the quality metrics used by deep ...

DeMoN: Depth and Motion Network for Learning Monocular Stereo

In this paper we formulate structure from motion as a learning problem. ...

Unsupervised Monocular Depth and Ego-motion Learning with Structure and Semantics

We present an approach which takes advantage of both structure and seman...

Code Repositories


PyTorch DepthNet Training on Still Box dataset

view repo

1 Introduction

Scene understanding from vision is a core problem for autonomous vehicles and for UAVs in particular. In this paper we are specifically interested in computing the depth of each pixel from a pair of consecutives images captured by a camera. We assume our camera’s velocity (and thus movement between two frames) is known, as most UAV flight systems include a speed estimator, allowing to settle the scale invariance ambiguity.

Solving this problem could be beneficial for applying depth-based sense and avoid algorithms for lightweight embedded systems that only have a monocular camera and cannot directly provide an RGB-D image. This could allow such devices to go without heavy or power expensive dedicated devices such as ToF camera, LiDar or Infra Red emitter/receiver [hitomi20153d] that would greatly lower autonomy. In addition, along with some being unable to operate under sunlight (e.g. IR and ToF), most RGBD sensor suffer from range limitations and can be inefficient in case we need long-range trajectory planning [hadsell2009learning]. The faster an UAV is, the longer range we will need to efficiently avoid obstacles. Unlike RGB-D sensors, depth from motion is robust to high speeds since it will be normalized by the displacement between two frames. Given the difficulty of the task, several learning approaches have been proposed to solve it.

A large number of datasets has been developed in order to propose supervised learning and validation for fundamental vision tasks, such as optical flow

[geiger2012we, DFIB15, weinzaepfel:hal-00873592] stereo disparity and even 3D scene flow [menze2015object, MIFDB16]. These different measures can help figure up scene structure and camera motion, but they remain low-level in terms of abstraction. End-to-end learning of a certain high semantic value such as three dimensional geometry may be hard to compute on a totally unrestricted monocular camera movement.

We focus on RGB-D datasets that would allow supervised learning of depth. RGB pairs (preferably with the corresponding displacement) being the input, and D the desired output. Our choice today to learn depth from motion in existing RGB-D datasets is either unrestricted w.r.t. ego-motion [firman-cvprw-2016, sturm12iros], or a simple stereo vision, equivalent to lateral movement [geiger2012we, scharstein2002taxonomy].

We thus propose a new dataset, described Part 3, which aims at proposing a bridge between the two by assuming that rotation is canceled on the footage that contains only random translations.

a) b)
Figure 1: Camera stabilization can be done via a) mechanic gimbal or b) dynamic cropping from fish-eye camera, for drones or c) hand-held cameras

This assumption about videos without rotation appears realistic for two reasons :

  1. Hardware rotation compensation is mainly a solved problem, even for consumer products, with IMU-stabilized cameras on consumer drones or hand-held steady-cam (Fig 1).

  2. this movement is somewhat related to human vision and vestibulo-ocular reflex (VOR) [VOR]. Our eyes orientation is not induced by head rotation, our inner ear among other biological sensors allows us to compensate parasite rotation when looking at a particular direction.

This assumption allows to dramatically simplify links between optical flow and depth and leverage much simpler computation. The main benefit being the camera movement’s dimensionality, reduced from 6 (translation and rotation) to 3 (only translation). However, as discussed in Part 4, depth is not computed as simply as with stereo vision and requires being able to compute higher abstractions to avoid a possible indeterminate form, especially for forward movements.

Using the proposed dataset, we then show that depth can be learned as an end-to-end problem just like other usual Deep Learning problems. With a trained artificial neural network, we perform much better depth accuracy than flow based methods and are confident this will be efficiently leveraged for sense and avoid algorithms.

2 Related Work

2.1 Monocular vision based sense and avoid

Sense and avoid problems are mostly approached using a dedicated sensor for 3D analysis. However, some work has been done trying to leverage Optical flow from Monocular camera [souhila2007optical, zingg2010mav]. These works enlighten the difficulty in estimating depth solely with flow, especially when the camera is pointed toward movement. One can note that rotation compensation was already used with fish-eye camera in order to have a more direct link between flow and depth. Another work [coombs1998real] also demonstrated that basic obstacle avoidance could be achieved in cluttered environments such as a closed room.

Some interesting work concerning obstacle avoidance from Monocular camera [lecun2005off, hadsell2009learning, michels2005high] showed that single frame analysis can be more efficient than depth from stereo for path planning. However, these works were not applied on UAV, on which depth cannot be directly deduced from distance to horizon, because obstacles and paths are now three-dimensional

More recently, Giusti et al. [giusti2016machine] showed that a monocular system can be trained to follow a hiking path. But once again, only 2D movement is approached, asking a UAV going forward to change its yaw based on likeliness to be following a traced path.

2.2 Depth inference

Deep Learning and Convolutional Neural Networks has recently been widely used for numerous kinds of vision problem such as classification

[krizhevsky2012imagenet] and hand-written digits recognition [lecun1998gradient].

Depth from vision is one the problems studied with neural network, and has been addressed not only with image pairs, but also single images [eigen2014depth, saxena2005learning]. Depth inference from stereo has also been widely studied [luo2016efficient, zbontar2015computing], and not necessarily in a supervised way [DBLP:journals/corr/KondaM13, DBLP:journals/corr/GargBR16].

Current state of the art methods for depth from monocular view tend to use motion, and especially structure from motion, and most algorithm do not rely on deep learning [cadena2016past, mur2016orb, klein2007parallel]. Prior knowledge w.r.t. scene is used to infer a sparse depth map with its density usually growing over time. These techniques also called SLAM are typically used with unstructured movement, produce very sparse point-cloud based 3D maps and require heavy calculation to keep track of the scene structure and align newly detected 3D points to the existing ones. SLAM is not widely used for obstacle avoidance, but more for off-line 3D scan.

Our goal is to compute a dense (where every point has a valid depth) quality depth map using only two images, and without prior knowledge on the scene and movement, apart from the lack of rotation and the scale factor.

2.3 Navigation datasets

As discussed earlier, numerous datasets exist with depth groundtruth, but to our knowledge, no dataset propose only translational movement. Some provide IMU data along with frames [smith2009new], that could be used to compensate rotation but their small size only allows us to use it as a validation set.

3 Still Box Dataset

Still Box Dataset
image size number of scenes total size (GB)
64x64 80K 19
128x128 16K 12
256x256 3.2K 8.5
512x512 3.2K 33
Table 1: datasets sizes
Scenes parameters
field of view
max render distance
primitives number
texture ratio
size range of meshes (m)
distance range of meshes (m)
length (frames)
nominal shift
speed equivalent (for 30fps)
Table 2: datasets parameters
Figure 2: Some examples of our renderings with associated depth maps (red is close, purple is far)

For our dataset we used the rendering software Blender to generate an arbitrary number of random rigid scenes, composed of basic 3d primitives (cubes, spheres, cones and tores) randomly textured from an image set scrapped from Flickr (see Fig 2).

These objects are randomly placed and sized in the scene, so that they are mostly in front of the camera, with possible variations including objects behind camera, or even camera inside an object. Scenes in which camera goes through objects are discarded. To add difficulty we also applied uniform textures on a proportion or of the primitives. Each primitive thus has a uniform probability (corresponding to texture ratio) of being textured from a color-ramp and not from a photograph.

Walls are added at large distances as if the camera was inside a box (hence the name). The camera is moving at a fixed speed value, but to a random direction (uniform distribution), which is constant for each scene. It can be anything from forward/backward movement to lateral movement (which is then equivalent to stereo vision). Tables 

1 and 2 show a summary of our scenes parameters. They can be changed at will, and are stored in a metadata JSON file to keep track of it. Our dataset is then composed of 4 sub-datasets with different resolutions, 64px dataset being the largest in term of number of samples, 512px being the heaviest in data.

4 End-to-end learning of Depth Inference

4.1 Why not disparity ?

Flow Estimation and disparity (which is essentially magnitude of optical flow vectors) are problems to which exist a lot of very convincing methods

[ilg2016flownet, 2017arXiv170304309K]. Knowing depth and displacement in our dataset, we could be able to easily get disparity and train a network for it using existing methods. We consider a picture with coordinates, and optical center at

Definition 1

Disparity is defined by the norm of a flow vector of a point .

Definition 2

Focus of Expansion is defined by the point FOE where each flow vector of a point is headed from. Note that this property is true only when considering no rotation and a rigid scene. One can note than for a pure translation, FOE is the projection of the displacement vector

Theorem 1

For a random rotation-less displacement of norm of a pinhole camera, with a focal length of , depth is an explicit function of disparity ,focus of expansion FOE and optical center

This result is in a useful form for limit values. Lateral movement corresponds to and then

When approaching FOE, knowing depth is a bounded positive value, we can deduce :

limit of disparity is this case is and we use its inverse. As a consequence, small errors on disparity estimation will result in diverging values of depth near focus of expansion while it corresponds to the direction the camera is moving to, which is clearly problematic for depth-based obstacle avoidance.

Given the random direction of our camera’s displacement, computing depth from disparity is therefore much harder than for a classic stereo rig. To tackle this problem, we decided to set up an end-to-end learning workflow, by training a neural network to explicitly predict the depth of every pixel in the scene, from an image pair with constant displacement value .

4.2 Dataset set augmentation

The way we store data in 10 images long videos, with each frame paired with its ground truth depth allows us to set a posteriori distances distribution with a variable temporal shift between two frames. If we use a baseline shift of 3 frames, we can e.g. assume a depth three times as great for two consecutive frames (shift of 1). In addition, we can also consider negative shift, which will only change displacement direction without changing speed value compared to opposite shift. This allows us, given a fixed dataset size, to get more evenly distributed depth values to learn, and also to de-correlate images from depth, preventing any over-fitting during training, that would result in a scene recognition algorithm and would perform poorly on a validation set.

4.3 Depth Inference training

Typical Conv Module
SpatialConv, 3x3
Typical ConvTranspose Module
SpatialConvTranspose, 4x4
SpatialConv, 3x3

Input image pair

Conv1, stride 2

Conv2, stride 2

Conv3, stride 2


Conv4, stride 2


Conv5, stride 2


Conv6, stride 2











Final depth output

MultiScale L1 Loss


Up Depth6


Up Depth5


Up Depth4


Up Depth3
















Figure 3: DepthNet structure parameters

Our network, which is broadly inspired from FlowNetS [DFIB15] and called DepthNet is described Fig 3. This network was initially used for flow inference. The main idea behind this network is that upsampled feature maps are concatenated with corresponding earlier convolution outputs. Higher semantic information is then associated with information more closely linked to pixels (since it went through less strided convolutions) which is then used for reconstruction.

This has been proven very efficient for flow and disparity computing while keeping a very simple supervised learning process. The architecture is admittedly very simple and one could leverage some more advanced work for flow and disparity, such as FlowNetC or GC-Net [2017arXiv170304309K] among many others. The main point of this experimentation is to show that direct depth estimation can be beneficial regarding unknown translation. Like FlowNetS, we use a multi-scale criterion, with a L1 reconstruction error for each scale.



  • is the weight of the scale, arbitrarily chosen as in our experiments.

  • are the height and width of the output.

  • is the scaled depth groundtruth, using average pooling.

As said earlier, we apply data augmentation to the dataset using different shifts, along with classic methods such a flips and rotations. We also clamp depth to a maximum of 100m, and provide sample pair without shift, assuming its depth is 100m everywhere.

Figure 4: result for 64x64 images, upper-let : input (before being downscaled to 64x64), lower-left : Ground Truth depth, lower-right : our network output (16x16), upper-right : error, green is no error, red is overestimated depth, blue is sub estimated
Figure 5: (On top, and ) Result for forward movement, showing that the network is also doing shape identification

Fig 4 shows results from 64px dataset. Like FlowNetS, results are downsampled by a factor of 4, which gives 16x16 Depth Maps.

One can notice that although the network is still fully convolutional, feature map sizes go down to 1x1 and then behave exactly like a Fully Connected Layer, which can serve to figure out implicitly motion direction and spread this information across the outputs. The second noticeable fact is that near FOE, (see Fig 5

for centered FOE, i.e. perfect forward movement) the network has no problem inferring depth, which means that it uses neighbor disparity and interpolates when no other information is available.

This can be interpreted as 3d shapes identification, along with their magnification : pixels belonging to the same shape are deemed to have close and continuous depth values, resulting in a FOE-independent depth inference.

4.4 From 64px to 512px Depth inference

Figure 6: some results on 512x512 images, same color code as for 64x64 input
Figure 7: some results on real images input. Up is from a Bebop drone footage, down is from a gimbal stabilized smartphone video
Network L1Error RMSE
train test train test
FlowNetS 2.44 4.77
DepthNet 2.44
DepthNet 4.90
Table 3: quantitative results for depth inference networks. FlowNetS is modified with 1 channel outputs (instead of 2 for flow), trained from scratch for depth with Still Box
Network size 980Ti
DepthNet 7.33
FlowNetS N/A N/A
DepthNet 7.33
DepthNet 7.33
DepthNet 7.33 N/A
Table 4: Size (millions of parameters) and Inference speeds (fps) on different devices. Batch sizes are and (when applicable). A batch size of means depth maps are computed at the same time

One could think that a fully convolutional network such as ours can not solve depth extraction for pictures greater than 64x64. The main idea is that for a fully convolutional network, each pixel is applied the same operation. For disparity, this makes sense because the problem is essentially similarity from different picture shifts. Wherever we are on the picture, the operation is the same. For depth inference when FOE is not diverging (forward movement is non negligible), result from Theorem 1 apparently shows that once you know the FOE, you then get different operations to do depending on your distance from it and from the optical center . The only possible strategy for a fully convolutional network would be to compute the position in the frame as well and to apply the compensating scaling to the output.

This problem then seems very difficult, if not impossible for a network as simple as ours, and if we run the training directly on 512x512 images, the network fails to converge to better results than with 64x64 images (while better resolution would help getting more precision). However, if we take the converged network and apply a fine-tuning on it with 512x512 images, we get much better results. Fig 6 shows training results for mean L1 reconstruction error, and shows that our deemed-impossible problem seems to be easily solved with multi-scale fine-tuning. As Table 3 shows, best results are obtained with multiple fine-tuning, with intermediate scales , , , and finally pixels. Subscript values indicate finetuning processes. FlowNetS is performing better than DepthNet but by a fairly light margin while being 5 times heavier and most of the time much slower, as shown Table 4.

Fig 7 shows qualitative results from our validation set, and from real condition drone footage, on which we were careful to avoid camera rotation. These results did not benefit from any fine-tuning from real footage, indicating that our Still Box Dataset, although not realistic in its scenes structures and rendering, appears to be sufficient for learning to produce decent depthmaps in real conditions.

4.5 Quality measurement

As our network is leveraging the reduced dimensionality of our dataset due to its lack of rotation, it is hard to compare our method to anything else. Disparity estimation is equivalent to a lateral translation that our network has been trained on, and could be used to compare to other algorithms but this reduced context seems unfair compared to methods designed especially for it.

Other datasets provide ego motion with 6-DOF on which our network has not been trained and is certain to give poor results. On the other hand, we could test some SLAM methods but they work better when applied to long image sequences and not only image pairs. In short, our method is setting state of the art, but for a very particular problem that we hope will gain interest with time.

5 UAV navigation use-case

We assumed in learning depth inference from a moving camera, assuming its velocity is always the same. When running during flight, such a system can easily deduce the real depth map from the drone speed , knowing that the training speed was (here )


One of the drawbacks of this learning method is that the value (which is focal length divided by sensor size per pixel) of our camera must be the same as the one used in training. Our dataset creation framework however allows us to change this value very easily for training. One must also be sure to have pinhole equivalent frames like during training.

5.1 Multiple shifts inference

Depending of the depth distribution of the groundtruth depth map, it may be useful to adjust frame shift. For example, when flying high above the ground, big structure detection and avoidance requires knowing precise distance values that are outside the typical range of any RGB-D sensor. The logical strategy would then be to increase the temporal shift between the frame pairs provided to DepthNet as inputs.

More generally, one must ensure a well distributed depth map from 0 to 100m to get high quality depth inference. This problem can be solved with two (among other) solutions:

  • Deduce optimal shift from precedent inference distribution, e.g:

    where is 50m (because our network outputs from 0 to 100m) and is the mean of precedent output, i.e. :

  • Use batch inference to compute depth with multiple shifts . As shown in Table 4, batch size greater than 1 can be used to some extent (especially for low resolution) to efficiently compute multiple depth maps.

    These multiple depth maps can then be either combined to construct a high quality depth map, or used separately to run two different obstacle avoidance algorithm, e.g. one dedicated for long range path planning (and then a high value ) and the other for reactive and short range collision avoidance with low . While one depth map will display closer areas at zero distance but further regions with precision, the other will set far regions to infinity (or 100m for DepthNet) but closer region with high resolution as flow is lowered compared to a high shift, and potentially within the range the network has been trained on.

6 Conclusion and future work

We propose a novel way of computing dense depth maps from motion, along with a very comprehensive dataset for stabilized footage analysis. This algorithm can then be used for depth-based sense and avoid algorithm in a very flexible way, in order to cover all kinds of path planning, from collision avoidance to long range obstacle bypassing.

Future works include implementation of such a path planning algorithm, and construction of a real condition fine tuning dataset, using UAVs footages and a preliminary thorough 3D offline scan. This would allow us to measure quantitative quality of our network for real footages and not only subjective as for now.

We also believe that our network can be extended to reinforcement learning applications that will potentially result in a complete end-to-end sense and avoid neural network for monocular cameras.


Appendix A Appendix A : Proof of Theorem 1

For a random rotation-less displacement of norm , depth is an explicit function of disparity, focus of expansion FOE and optical center


We assume no rotation. which means FOE is projection of B on A.


Let m = be


let be a point . For camera B we have

relative movement of is

so we have

If we compute for :

Similarly, with , we get:


We consider disparity as norm of the flow expressed in frame B (which is correlated to depth at this frame).


Consequently, we can deduce depth at frame B from disparity :


From our dataset construction, we know that Let us call

From 3, we get:

and then from 6 we get