I Introduction
Scene understanding from vision is a core problem for autonomous vehicles and for UAVs in particular. In this paper we are specifically interested in computing the depth of each pixel from image sequences captured by a camera. We assume our camera’s velocity (and thus displacement between two frames) is known, as most UAV flight systems include a speed estimator, allowing to settle the scale invariance ambiguity of the depth map.
Solving this problem could be beneficial for several problems such as environment scanning or applying depthbased sense and avoid algorithms for lightweight embedded systems that only have a monocular camera. Not relying on depth Sensors such as stereo vision, ToF camera, LiDar or Infra Red emitter/receiver allows to free the UAV from their weight, cost and limitations. Specifically, along with some RGBD sensors being unable to operate under sunlight (e.g. IR and ToF), most of them suffer from range limitations and can be inefficient in case we need longrange information such as trajectory planning [hadsell2009learning]. Unlike RGBD sensors, depth from motion is flexible w.r.t. displacement and thus robust to high speeds or high distances as choosing among previous frames gives us a wide range of different displacements. For estimating such depth maps, we designed an endtoend learning architecture, based on a synthetic dataset and a fully convolutional neural network that takes as input an image pair taken at different times. No preprocessing such as optical flow computation, nor visual odometry is applied to the input, while the depth is directly provided as an output. [Pinard_uavg]
We created a dataset of image pairs with random translation movements, with no rotation, and a constant displacement magnitude applied during the whole training.
a)  b)  c) 

The assumption about videos without rotation appears realistic for two reasons:

Hardware rotation compensation is mainly a solved problem, even for consumer products, with IMUstabilized cameras on consumer drones or handheld steadycam (Fig 1).

this movement is somewhat related to human vision and vestibuloocular reflex (VOR) [VOR]. Our eyes orientation is not induced by head rotation, our inner ear among other biological sensors allows us to compensate parasite rotation when looking at a particular direction.
Using the trained network, we propose an algorithm for real condition depth inference from a stabilized UAV. Displacement from sensors is used to compute real depth map, as it only differs from the synthetic constant displacement images by a scale factor. Our network output also allows us to a posteriori optimize the depth inference. By adjusting frame shift to get a displacement that would make the network get the same disparity distribution as during its training, we lower the depth error for next inference. For example, with large distances, ideal displacement between two frames is higher, and thus the shift is also higher for a given speed. Moreover, we use multiple batch inference to compute multiple depth maps centered around a particular range, and fuse them to get a high precision for both close and far objects, no matter the distance, given a sufficient displacement from the UAV.
Ii Related Work
Deep Learning and Convolutional Neural Networks have recently been widely used for numerous kinds of vision problem such as classification [krizhevsky2012imagenet] and handwritten digits recognition [lecun1998gradient].
Depth from vision is one of the problems studied with neural network, and has been addressed with a wide range of training solution. Some datasets [geiger2012we, Silberman:ECCV12] allow a neural network to learn endtoend depth or disparity [luo2016efficient, zbontar2015computing, eigen2014depth]. Reprojection error has also been used for unsupervised training for depth from a single image [2017arXiv170407804V, zhou2017unsupervised] or for disparity between two frames of a stereo rig [DBLP:journals/corr/KondaM13, DBLP:journals/corr/GargBR16].
Depth from a single image, although interesting, suffers from a major drawback which is overfitting. No motion is given to the network during inference, and the resulting depth is inferred from context, whereas they can be decorrelated. This technique can be sufficient for road driving context with an obvious road in front of the camera, but for a UAV flight usage, we may have to deal with very heterogeneous scenes. On the other hand, depth from a stereo pair is only implying a single lateral movement, and lacks a forward component to appear realistic for any aerial stabilized footage.
For depth from more complex movement from a monocular camera, current state of the art methods tend to use motion, and especially structure from motion, and most algorithm do not rely on deep learning [cadena2016past, mur2016orb, klein2007parallel]. Prior knowledge w.r.t. scene is used to infer a sparse depth map with its density usually growing over time. These techniques also called SLAM are typically used with unstructured movement (translation and rotation with varying magnitudes), produce very sparse pointcloud based 3D maps and require heavy calculation to keep track of the scene structure and align newly detected 3D points to the existing ones.
Our goal is to compute a dense depth map (where every point has a valid depth) using only two frames from the same camera, at different times, and without prior knowledge on the scene and direction of movement, apart from the lack of rotation and the scale factor.
Iii Endtoend learning of Depth Inference
Inspired by flow estimation and disparity (which is essentially magnitude of optical flow vectors), a problem to which exist a lot of very convincing methods
[ilg2016flownet, 2017arXiv170304309K], we set up an endtoend learning workflow, by training a neural network to explicitly predict the depth of every pixel in a scene, from an image pair with constant displacement value.Iiia Still Box Dataset
We design our own synthetic dataset, using the rendering software Blender, to generate an arbitrary number of random rigid scenes, composed of basic 3d primitives (cubes, spheres, cones and tores) randomly textured from an image set scrapped from Flickr (see Fig 2).
These objects are randomly placed and sized in the scene, and walls are added at large distances as if the camera was inside a box (hence the name). The camera is moving at a fixed speed value, but to an uniformly distributed random direction, which is constant for each scene. It can be anything from forward/backward movement to lateral movement (which is then equivalent to stereo vision).
IiiB Dataset augmentation
In our dataset, we store data in 10 images long videos, with each frame paired with its ground truth depth. This allows us to set a posteriori distances distribution with a variable temporal shift between two frames. If we use a baseline shift of 3 frames, we can e.g. assume a depth three times as great as for two consecutive frames (shift of 1). In addition, we can also consider negative shift, which will only change displacement direction without changing speed value. This allows us, given a fixed dataset size, to get more evenly distributed depth values to learn, and also to decorrelate images from depth, preventing overfitting during training, that would result in a scene recognition algorithm and would poorly perform on a validation set.
IiiC Depth Inference training
Typical Conv Module 

SpatialConv, 3x3 
SpatialBatchNorm 
ReLU 
Typical Deconv Module 

SpatialConvTranspose, 4x4 
SpatialConv, 3x3 
SpatialBatchNorm 
ReLU 
Our network is broadly inspired from FlowNetS [DFIB15] (initially used for flow inference) and called DepthNet. It is described in details in [Pinard_uavg], we provide here a summary of its structure (Fig 3
) and performances. Each convolution (apart from depth modules) is followed by a Spatial Batch Normalization and ReLU activation layer. Batch normalization helps convergence and stability during training by normalizing a convolution’s output (0 mean and standard deviation of 1) over a batch of multiple inputs
[ioffe2015batch], and Rectified Linear Unit (ReLU) is the typical activation layer
[DBLP:journals/corr/XuWCL15]. Depth Module are convolution modules, reducing the input to feature map, which is expected to be the depth map, at a given scale. One should note that FlowNetS initially used LeakyReLU which has a nonnull slope for negative values, but tests showed that ReLU performed better for our problem.The main idea behind this network is that upsampled feature maps are concatenated with corresponding earlier convolution outputs (e.g. Conv2 output with Deconv5 output). Higher semantic information is then associated with information more closely linked to pixels (since it went through less downsampling convolutions) which is then used for reconstruction.
This multiscale architecture has been proven very efficient for flow and disparity computing while keeping a very simple supervised learning process.
The main point of this experimentation is to show that direct depth estimation can be efficient regarding unknown translation direction. Like FlowNetS, we use a multiscale criterion, with a L1 reconstruction error for each scale:
(1) 
where

is the weight of the scale, arbitrarily chosen.

are the height and width of the output.

is the scaled depth groundtruth, using average pooling.

is the ouput of the network at scale .
As said earlier, we apply data augmentation to the dataset using different shifts, along with classic methods such a flips and rotations. We also clip depth to a maximum of 100m, and provide sample pairs without shift, assuming its depth is 100m everywhere. As a consequence, the trained network will only be able to infer depth lower than 100m.
Network  L1Error  RMSE  

train  test  train  test  
FlowNetS  
DepthNet  
FlowNetS  2.44  4.77  
DepthNet  
DepthNet  2.44  
DepthNet  4.90  
DepthNet  
DepthNet 
We applied training on several input size images, from 64x64 to 512x512. Fig 4 shows training results for mean L1 reconstruction error. Like FlowNetS, network output are downsampled by a factor of 4 with reference to the input size. As Table I shows, best results are obtained with multiple finetuning, with intermediate scales , , , and finally pixels. Subscript values indicate finetuning processes. FlowNetS is performing better than DepthNet but by a fairly light margin while being 5 times heavier and most of the time much slower.
Iv UAV navigation usecase
Iva Optimal frame shift determination
We learned depth inference from a moving camera, assuming its velocity is always the same. Results from real condition drone footage, on which we were careful to avoid camera rotation can be seen Fig 5. These results did not benefit from any finetuning from real footage, indicating that our Still Box Dataset, although not realistic in its scenes structures and rendering, appears to be sufficiently heterogeneous for learning to produce decent depth maps in real conditions. When running during flight, such a system can deduce the real depth map from the network output and the drone displacement, knowing that the training displacement was (here )
(2) 
The actual correct interpretation of the output of DepthNet is rather a percentage than a distance. meaning max distance for a given displacement . We can introduce a function and a dimensionless parameter for computing actual depth using the displacement as the only distance related factor.
(3) 
Depending of the depth distribution of the groundtruth depth map, it may be useful to adjust frame shift . For example, when flying high above the ground with low speed, big structure detection and avoidance requires knowing precise distance values that are outside the typical range of any RGBD sensor. The logical strategy would then be to increase the temporal shift between the frame pairs provided to DepthNet as inputs. More generally, one must provide inputs to DepthNet in order to ensure a well distributed depth output within its typical range. Depthwise normalized error which is the essential quality measurement for values that we want to rescale, will diverge when ground truth depth approaches . Indeed, in addition to being equivalent to an infinite optical flow, the depthwise error cannot tend to , which will make the expression tend to at We thus need to choose the optimal spatial displacement and corresponding temporal shift to minimize error on the next inference, assuming the same depth distribution, to avoid too low or too high equivalent groundtruth. We chose the space displacement as:
(4) 
With the mean of depth values and the optimal mean output of , e.g. . is then computed numerically to get the frame shift with the closest corresponding displacement possible.
IvB Multiple shifts inference
As neural network are traditionally computed within massively parallel architectures such as GPUs, multiple depth maps can be computed efficiently at the same time in a batch, especially for low resolution. Batch inference can then be used to compute depth with multiple shifts . These multiple depth maps can then be combined to construct a higher quality depth map, with high precision for both long and short range. We propose a dynamic range algorithm, described Fig 6 to compute an combine different depth maps.
Instead of only one optimal displacement from , we use Kmean clustering algorithm [macqueen1967] on the depth map to find a list of clusters on which each shift will focus. The clustering outputs a list of centroids and corresponding and . is an arbitrary chosen value, usually ranging from to .
Final DepthMap is then computed from fusing these outputs using a weighted mean for each pixel. Each weight is actually a linear interpolation from
to according to distance of depth from a target value . That way, fusion will favor values that are closer to this optimal value. An value is added to solve fusion when every depth map is off its wanted range.(5) 
(6) 
For our usecase, we set , , and . is the index of frame shift, are the spatial indices. Fig 7 shows a result of the proposed algorithm for a batch size of . Notice how the high shift detects buildings while low shift detects trees.
IvC Clamped DepthNet
Our proposed algorithm is actually suffering a problem for real condition videos, because we assume a perfect stabilization. Therefore, on very far objects (e.g. the sky), any minor optical flow caused by a default in stabilization will result in a massive error in depth. Moreover, our network being very good at recognizing shapes and giving it the same depth everywhere, this can result in the whole sky being computed as relatively close. We thus propose a network designed for a simpler problem: during training on still box, we clamp depth from to , with a shift of images (instead of for DepthNet). These new parameters allow the network to only focus on mid range objects, dismissing close and far objects with respectively a too large and too small optical flow. This training workflow is very well suited for multiple shift depth inference. Every image pair will have a dedicated depth to analyze, allowing the fusion to not be bothered with redundant data, because of the high initial range of DepthNet.
Figure 8 shows results for multiples synthetic x scenes with ground truth, along with inference speed and a small noise added to camera initial orientation at each frame. , with being a 3dimensional random unit vector and a constant fixed to 0.001. We also report performance a thin version of our clamped network, that shows better results than DepthNet with 1 plane only in this noisy setup. The thin network has the same depth, but every convolution has an output half the number of feature maps of the original DepthNet. These results have been obtained on a Quadro K2200m powered laptop.
V Conclusion and future work
We proposed a novel way of computing dense depth maps from motion, along with a very comprehensive dataset for stabilized footage analysis and a technique for dynamic range real flight computing. This algorithm can then be used for depthbased sense and avoid algorithms in a very flexible way, in order to cover all kinds of path planning, from collision avoidance to long range obstacle bypassing.
A more thorough presentation of the results can be viewed in this video. http://perso.enstaparistech.fr/~manzaner/Download/ECMR2017/DepthNetResults.mp4
Future works include implementation of such a path planning algorithm, and construction of a real condition fine tuning dataset, using UAVs footages and a preliminary thorough 3D offline scan. This would allow us to measure quantitative quality of our network for real footages and not only subjective as for now. We could also use unsupervised techniques, using reprojection errors as in [zhou2017unsupervised].
We also believe that our network can be extended to reinforcement learning applications that will potentially result in a complete endtoend sense and avoid neural network for monocular cameras.
The major drawback of our algorithm is however the necessity for a scene to be rigid. This is obviously never the case, and even though UAV footage are less prone to moving objects like in autonomous driving problems, we will have this issue whenever a moving target is to be followed. To solve this problem, an explicit movement equation for both the camera and the moving targets may have to be computed, as in [2017arXiv170407804V]. In any case, this problem will be a challenge and may not be solvable with fully Convolutional networks only as we did in this article.