Log In Sign Up

Multi range Real-time depth inference from a monocular stabilized footage using a Fully Convolutional Neural Network

by   Clément Pinard, et al.

Using a neural network architecture for depth map inference from monocular stabilized videos with application to UAV videos in rigid scenes, we propose a multi-range architecture for unconstrained UAV flight, leveraging flight data from sensors to make accurate depth maps for uncluttered outdoor environment. We try our algorithm on both synthetic scenes and real UAV flight data. Quantitative results are given for synthetic scenes with a slightly noisy orientation, and show that our multi-range architecture improves depth inference. Along with this article is a video that present our results more thoroughly.


page 2

page 3

page 4

page 5


End-to-end depth from motion with stabilized monocular videos

We propose a depth map inference system from monocular videos based on a...

Learning structure-from-motionfrom motion

This work is based on a questioning of the quality metrics used by deep ...

Flight Dynamics-based Recovery of a UAV Trajectory using Ground Cameras

We propose a new method to estimate the 6-dof trajectory of a flying obj...

Efficient Multi-Frequency Phase Unwrapping using Kernel Density Estimation

In this paper we introduce an efficient method to unwrap multi-frequency...

Localizing Adverts in Outdoor Scenes

Online videos have witnessed an unprecedented growth over the last decad...

TöRF: Time-of-Flight Radiance Fields for Dynamic Scene View Synthesis

Neural networks can represent and accurately reconstruct radiance fields...

I Introduction

Scene understanding from vision is a core problem for autonomous vehicles and for UAVs in particular. In this paper we are specifically interested in computing the depth of each pixel from image sequences captured by a camera. We assume our camera’s velocity (and thus displacement between two frames) is known, as most UAV flight systems include a speed estimator, allowing to settle the scale invariance ambiguity of the depth map.

Solving this problem could be beneficial for several problems such as environment scanning or applying depth-based sense and avoid algorithms for lightweight embedded systems that only have a monocular camera. Not relying on depth Sensors such as stereo vision, ToF camera, LiDar or Infra Red emitter/receiver allows to free the UAV from their weight, cost and limitations. Specifically, along with some RGB-D sensors being unable to operate under sunlight (e.g. IR and ToF), most of them suffer from range limitations and can be inefficient in case we need long-range information such as trajectory planning [hadsell2009learning]. Unlike RGB-D sensors, depth from motion is flexible w.r.t. displacement and thus robust to high speeds or high distances as choosing among previous frames gives us a wide range of different displacements. For estimating such depth maps, we designed an end-to-end learning architecture, based on a synthetic dataset and a fully convolutional neural network that takes as input an image pair taken at different times. No preprocessing such as optical flow computation, nor visual odometry is applied to the input, while the depth is directly provided as an output. [Pinard_uavg]

We created a dataset of image pairs with random translation movements, with no rotation, and a constant displacement magnitude applied during the whole training.

a) b) c)
Fig. 1: Camera stabilization can be done via a) mechanic gimbal or b) dynamic cropping from fish-eye camera, for drones or c) hand-held cameras

The assumption about videos without rotation appears realistic for two reasons:

  • Hardware rotation compensation is mainly a solved problem, even for consumer products, with IMU-stabilized cameras on consumer drones or hand-held steady-cam (Fig 1).

  • this movement is somewhat related to human vision and vestibulo-ocular reflex (VOR) [VOR]. Our eyes orientation is not induced by head rotation, our inner ear among other biological sensors allows us to compensate parasite rotation when looking at a particular direction.

Using the trained network, we propose an algorithm for real condition depth inference from a stabilized UAV. Displacement from sensors is used to compute real depth map, as it only differs from the synthetic constant displacement images by a scale factor. Our network output also allows us to a posteriori optimize the depth inference. By adjusting frame shift to get a displacement that would make the network get the same disparity distribution as during its training, we lower the depth error for next inference. For example, with large distances, ideal displacement between two frames is higher, and thus the shift is also higher for a given speed. Moreover, we use multiple batch inference to compute multiple depth maps centered around a particular range, and fuse them to get a high precision for both close and far objects, no matter the distance, given a sufficient displacement from the UAV.

Ii Related Work

Deep Learning and Convolutional Neural Networks have recently been widely used for numerous kinds of vision problem such as classification [krizhevsky2012imagenet] and hand-written digits recognition [lecun1998gradient].

Depth from vision is one of the problems studied with neural network, and has been addressed with a wide range of training solution. Some datasets [geiger2012we, Silberman:ECCV12] allow a neural network to learn end-to-end depth or disparity [luo2016efficient, zbontar2015computing, eigen2014depth]. Reprojection error has also been used for unsupervised training for depth from a single image [2017arXiv170407804V, zhou2017unsupervised] or for disparity between two frames of a stereo rig [DBLP:journals/corr/KondaM13, DBLP:journals/corr/GargBR16].

Depth from a single image, although interesting, suffers from a major drawback which is overfitting. No motion is given to the network during inference, and the resulting depth is inferred from context, whereas they can be decorrelated. This technique can be sufficient for road driving context with an obvious road in front of the camera, but for a UAV flight usage, we may have to deal with very heterogeneous scenes. On the other hand, depth from a stereo pair is only implying a single lateral movement, and lacks a forward component to appear realistic for any aerial stabilized footage.

For depth from more complex movement from a monocular camera, current state of the art methods tend to use motion, and especially structure from motion, and most algorithm do not rely on deep learning [cadena2016past, mur2016orb, klein2007parallel]. Prior knowledge w.r.t. scene is used to infer a sparse depth map with its density usually growing over time. These techniques also called SLAM are typically used with unstructured movement (translation and rotation with varying magnitudes), produce very sparse point-cloud based 3D maps and require heavy calculation to keep track of the scene structure and align newly detected 3D points to the existing ones.

Our goal is to compute a dense depth map (where every point has a valid depth) using only two frames from the same camera, at different times, and without prior knowledge on the scene and direction of movement, apart from the lack of rotation and the scale factor.

Iii End-to-end learning of Depth Inference

Inspired by flow estimation and disparity (which is essentially magnitude of optical flow vectors), a problem to which exist a lot of very convincing methods

[ilg2016flownet, 2017arXiv170304309K], we set up an end-to-end learning workflow, by training a neural network to explicitly predict the depth of every pixel in a scene, from an image pair with constant displacement value.

Iii-a Still Box Dataset

Fig. 2: Some examples of our renderings with associated depth maps (red is close, purple is far)

We design our own synthetic dataset, using the rendering software Blender, to generate an arbitrary number of random rigid scenes, composed of basic 3d primitives (cubes, spheres, cones and tores) randomly textured from an image set scrapped from Flickr (see Fig 2).

These objects are randomly placed and sized in the scene, and walls are added at large distances as if the camera was inside a box (hence the name). The camera is moving at a fixed speed value, but to an uniformly distributed random direction, which is constant for each scene. It can be anything from forward/backward movement to lateral movement (which is then equivalent to stereo vision).

Iii-B Dataset augmentation

In our dataset, we store data in 10 images long videos, with each frame paired with its ground truth depth. This allows us to set a posteriori distances distribution with a variable temporal shift between two frames. If we use a baseline shift of 3 frames, we can e.g. assume a depth three times as great as for two consecutive frames (shift of 1). In addition, we can also consider negative shift, which will only change displacement direction without changing speed value. This allows us, given a fixed dataset size, to get more evenly distributed depth values to learn, and also to de-correlate images from depth, preventing over-fitting during training, that would result in a scene recognition algorithm and would poorly perform on a validation set.

Iii-C Depth Inference training

Typical Conv Module
SpatialConv, 3x3
Typical Deconv Module
SpatialConvTranspose, 4x4
SpatialConv, 3x3

Input image pair

Conv1, stride 2

Conv2, stride 2

Conv3, stride 2


Conv4, stride 2


Conv5, stride 2


Conv6, stride 2











Final depth output

MultiScale L1 Loss


Up Depth6


Up Depth5


Up Depth4


Up Depth3
















Fig. 3: DepthNet structure parameters, Conv and Deconv modules detailed above

Our network is broadly inspired from FlowNetS [DFIB15] (initially used for flow inference) and called DepthNet. It is described in details in [Pinard_uavg], we provide here a summary of its structure (Fig 3

) and performances. Each convolution (apart from depth modules) is followed by a Spatial Batch Normalization and ReLU activation layer. Batch normalization helps convergence and stability during training by normalizing a convolution’s output (0 mean and standard deviation of 1) over a batch of multiple inputs


, and Rectified Linear Unit (ReLU) is the typical activation layer

[DBLP:journals/corr/XuWCL15]. Depth Module are convolution modules, reducing the input to feature map, which is expected to be the depth map, at a given scale. One should note that FlowNetS initially used LeakyReLU which has a non-null slope for negative values, but tests showed that ReLU performed better for our problem.

The main idea behind this network is that upsampled feature maps are concatenated with corresponding earlier convolution outputs (e.g. Conv2 output with Deconv5 output). Higher semantic information is then associated with information more closely linked to pixels (since it went through less downsampling convolutions) which is then used for reconstruction.

This multi-scale architecture has been proven very efficient for flow and disparity computing while keeping a very simple supervised learning process.

The main point of this experimentation is to show that direct depth estimation can be efficient regarding unknown translation direction. Like FlowNetS, we use a multi-scale criterion, with a L1 reconstruction error for each scale:



  • is the weight of the scale, arbitrarily chosen.

  • are the height and width of the output.

  • is the scaled depth groundtruth, using average pooling.

  • is the ouput of the network at scale .

As said earlier, we apply data augmentation to the dataset using different shifts, along with classic methods such a flips and rotations. We also clip depth to a maximum of 100m, and provide sample pairs without shift, assuming its depth is 100m everywhere. As a consequence, the trained network will only be able to infer depth lower than 100m.

Fig. 4: Result on 512x512 images from DepthNet. Upper-left: input, lower-left: Ground Truth depth, lower-right: our network output (128x128), upper-right: error, green is no error, red is overestimated depth, blue is underestimated
Network L1Error RMSE
train test train test
FlowNetS 2.44 4.77
DepthNet 2.44
DepthNet 4.90
TABLE I: Quantitative results for depth inference networks. FlowNetS is modified with 1 channel outputs (instead of 2 for flow), trained from scratch for depth with Still Box, subscript indicates fine tuning process.

We applied training on several input size images, from 64x64 to 512x512. Fig 4 shows training results for mean L1 reconstruction error. Like FlowNetS, network output are downsampled by a factor of 4 with reference to the input size. As Table I shows, best results are obtained with multiple fine-tuning, with intermediate scales , , , and finally pixels. Subscript values indicate finetuning processes. FlowNetS is performing better than DepthNet but by a fairly light margin while being 5 times heavier and most of the time much slower.

Iv UAV navigation use-case

Fig. 5: Result on x real images input from a Bebop drone footage

Iv-a Optimal frame shift determination

We learned depth inference from a moving camera, assuming its velocity is always the same. Results from real condition drone footage, on which we were careful to avoid camera rotation can be seen Fig 5. These results did not benefit from any fine-tuning from real footage, indicating that our Still Box Dataset, although not realistic in its scenes structures and rendering, appears to be sufficiently heterogeneous for learning to produce decent depth maps in real conditions. When running during flight, such a system can deduce the real depth map from the network output and the drone displacement, knowing that the training displacement was (here )


The actual correct interpretation of the output of DepthNet is rather a percentage than a distance. meaning max distance for a given displacement . We can introduce a function and a dimension-less parameter for computing actual depth using the displacement as the only distance related factor.


Depending of the depth distribution of the ground-truth depth map, it may be useful to adjust frame shift . For example, when flying high above the ground with low speed, big structure detection and avoidance requires knowing precise distance values that are outside the typical range of any RGB-D sensor. The logical strategy would then be to increase the temporal shift between the frame pairs provided to DepthNet as inputs. More generally, one must provide inputs to DepthNet in order to ensure a well distributed depth output within its typical range. Depth-wise normalized error which is the essential quality measurement for values that we want to rescale, will diverge when ground truth depth approaches . Indeed, in addition to being equivalent to an infinite optical flow, the depth-wise error cannot tend to , which will make the expression tend to at We thus need to choose the optimal spatial displacement and corresponding temporal shift to minimize error on the next inference, assuming the same depth distribution, to avoid too low or too high equivalent ground-truth. We chose the space displacement as:


With the mean of depth values and the optimal mean output of , e.g. . is then computed numerically to get the frame shift with the closest corresponding displacement possible.

Iv-B Multiple shifts inference

Frame acquisition

Frame Timestamp

Speeds from sensors

Speeds Timestamps

frames buffer

speeds buffer

frame pairs picker

numeric integration


fusion of outputs

Final Depth




Fig. 6: Multiple shifts architecture. We used different planes. Numeric integration, given a desired displacement gives the closest possible displacement between frames , along with corresponding shift . As discussed in part IV, the fusion block computes pixel-wise weights from to make a weighted mean of

As neural network are traditionally computed within massively parallel architectures such as GPUs, multiple depth maps can be computed efficiently at the same time in a batch, especially for low resolution. Batch inference can then be used to compute depth with multiple shifts . These multiple depth maps can then be combined to construct a higher quality depth map, with high precision for both long and short range. We propose a dynamic range algorithm, described Fig 6 to compute an combine different depth maps.

Instead of only one optimal displacement from , we use K-mean clustering algorithm [macqueen1967] on the depth map to find a list of clusters on which each shift will focus. The clustering outputs a list of centroids and corresponding and . is an arbitrary chosen value, usually ranging from to .

Final DepthMap is then computed from fusing these outputs using a weighted mean for each pixel. Each weight is actually a linear interpolation from

to according to distance of depth from a target value . That way, fusion will favor values that are closer to this optimal value. An value is added to solve fusion when every depth map is off its wanted range.

Fig. 7: real condition application of the multi-shift algorithm with Tiny DepthNet Clamped. First image is input. Last two are outputs of the network, for shifts of and with a drone flying forward at and at an altitude of , with corresponding displacements from sensors. Second is fused output, capped to up

For our use-case, we set , , and . is the index of frame shift, are the spatial indices. Fig 7 shows a result of the proposed algorithm for a batch size of . Notice how the high shift detects buildings while low shift detects trees.

Iv-C Clamped DepthNet

Our proposed algorithm is actually suffering a problem for real condition videos, because we assume a perfect stabilization. Therefore, on very far objects (e.g. the sky), any minor optical flow caused by a default in stabilization will result in a massive error in depth. Moreover, our network being very good at recognizing shapes and giving it the same depth everywhere, this can result in the whole sky being computed as relatively close. We thus propose a network designed for a simpler problem: during training on still box, we clamp depth from to , with a shift of images (instead of for DepthNet). These new parameters allow the network to only focus on mid range objects, dismissing close and far objects with respectively a too large and too small optical flow. This training workflow is very well suited for multiple shift depth inference. Every image pair will have a dedicated depth to analyze, allowing the fusion to not be bothered with redundant data, because of the high initial range of DepthNet.

Fig. 8: results for synthetic x scenes with noisy orientation. DepthNet has been tested with 1 and 2 planes, DepthNet Clamped with 1 to 3 planes and Tiny DepthNet Clamped with 1 to 4 planes. Y axis is Absolute mean error (m) divided by ground-truth depth, X axis is inference speed, in ms

Figure 8 shows results for multiples synthetic x scenes with ground truth, along with inference speed and a small noise added to camera initial orientation at each frame. , with being a 3-dimensional random unit vector and a constant fixed to 0.001. We also report performance a thin version of our clamped network, that shows better results than DepthNet with 1 plane only in this noisy setup. The thin network has the same depth, but every convolution has an output half the number of feature maps of the original DepthNet. These results have been obtained on a Quadro K2200m powered laptop.

V Conclusion and future work

We proposed a novel way of computing dense depth maps from motion, along with a very comprehensive dataset for stabilized footage analysis and a technique for dynamic range real flight computing. This algorithm can then be used for depth-based sense and avoid algorithms in a very flexible way, in order to cover all kinds of path planning, from collision avoidance to long range obstacle bypassing.

A more thorough presentation of the results can be viewed in this video.

Future works include implementation of such a path planning algorithm, and construction of a real condition fine tuning dataset, using UAVs footages and a preliminary thorough 3D offline scan. This would allow us to measure quantitative quality of our network for real footages and not only subjective as for now. We could also use unsupervised techniques, using re-projection errors as in [zhou2017unsupervised].

We also believe that our network can be extended to reinforcement learning applications that will potentially result in a complete end-to-end sense and avoid neural network for monocular cameras.

The major drawback of our algorithm is however the necessity for a scene to be rigid. This is obviously never the case, and even though UAV footage are less prone to moving objects like in autonomous driving problems, we will have this issue whenever a moving target is to be followed. To solve this problem, an explicit movement equation for both the camera and the moving targets may have to be computed, as in [2017arXiv170407804V]. In any case, this problem will be a challenge and may not be solvable with fully Convolutional networks only as we did in this article.