DepthNet
PyTorch DepthNet Training on Still Box dataset
view repo
We propose a depth map inference system from monocular videos based on a novel dataset for navigation that mimics aerial footage from gimbal stabilized monocular camera in rigid scenes. Unlike most navigation datasets, the lack of rotation implies an easier structure from motion problem which can be leveraged for different kinds of tasks such as depth inference and obstacle avoidance. We also propose an architecture for endtoend depth inference with a fully convolutional network. Results show that although tied to camera inner parameters, the problem is locally solvable and leads to good quality depth prediction.
READ FULL TEXT VIEW PDFPyTorch DepthNet Training on Still Box dataset
Scene understanding from vision is a core problem for autonomous vehicles and for UAVs in particular. In this paper we are specifically interested in computing the depth of each pixel from a pair of consecutives images captured by a camera. We assume our camera’s velocity (and thus movement between two frames) is known, as most UAV flight systems include a speed estimator, allowing to settle the scale invariance ambiguity.
Solving this problem could be beneficial for applying depthbased sense and avoid algorithms for lightweight embedded systems that only have a monocular camera and cannot directly provide an RGBD image. This could allow such devices to go without heavy or power expensive dedicated devices such as ToF camera, LiDar or Infra Red emitter/receiver [hitomi20153d] that would greatly lower autonomy. In addition, along with some being unable to operate under sunlight (e.g. IR and ToF), most RGBD sensor suffer from range limitations and can be inefficient in case we need longrange trajectory planning [hadsell2009learning]. The faster an UAV is, the longer range we will need to efficiently avoid obstacles. Unlike RGBD sensors, depth from motion is robust to high speeds since it will be normalized by the displacement between two frames. Given the difficulty of the task, several learning approaches have been proposed to solve it.
A large number of datasets has been developed in order to propose supervised learning and validation for fundamental vision tasks, such as optical flow
[geiger2012we, DFIB15, weinzaepfel:hal00873592] stereo disparity and even 3D scene flow [menze2015object, MIFDB16]. These different measures can help figure up scene structure and camera motion, but they remain lowlevel in terms of abstraction. Endtoend learning of a certain high semantic value such as three dimensional geometry may be hard to compute on a totally unrestricted monocular camera movement.We focus on RGBD datasets that would allow supervised learning of depth. RGB pairs (preferably with the corresponding displacement) being the input, and D the desired output. Our choice today to learn depth from motion in existing RGBD datasets is either unrestricted w.r.t. egomotion [firmancvprw2016, sturm12iros], or a simple stereo vision, equivalent to lateral movement [geiger2012we, scharstein2002taxonomy].
We thus propose a new dataset, described Part 3, which aims at proposing a bridge between the two by assuming that rotation is canceled on the footage that contains only random translations.
a)  b) 
c) 
This assumption about videos without rotation appears realistic for two reasons :
Hardware rotation compensation is mainly a solved problem, even for consumer products, with IMUstabilized cameras on consumer drones or handheld steadycam (Fig 1).
this movement is somewhat related to human vision and vestibuloocular reflex (VOR) [VOR]. Our eyes orientation is not induced by head rotation, our inner ear among other biological sensors allows us to compensate parasite rotation when looking at a particular direction.
This assumption allows to dramatically simplify links between optical flow and depth and leverage much simpler computation. The main benefit being the camera movement’s dimensionality, reduced from 6 (translation and rotation) to 3 (only translation). However, as discussed in Part 4, depth is not computed as simply as with stereo vision and requires being able to compute higher abstractions to avoid a possible indeterminate form, especially for forward movements.
Using the proposed dataset, we then show that depth can be learned as an endtoend problem just like other usual Deep Learning problems. With a trained artificial neural network, we perform much better depth accuracy than flow based methods and are confident this will be efficiently leveraged for sense and avoid algorithms.
Sense and avoid problems are mostly approached using a dedicated sensor for 3D analysis. However, some work has been done trying to leverage Optical flow from Monocular camera [souhila2007optical, zingg2010mav]. These works enlighten the difficulty in estimating depth solely with flow, especially when the camera is pointed toward movement. One can note that rotation compensation was already used with fisheye camera in order to have a more direct link between flow and depth. Another work [coombs1998real] also demonstrated that basic obstacle avoidance could be achieved in cluttered environments such as a closed room.
Some interesting work concerning obstacle avoidance from Monocular camera [lecun2005off, hadsell2009learning, michels2005high] showed that single frame analysis can be more efficient than depth from stereo for path planning. However, these works were not applied on UAV, on which depth cannot be directly deduced from distance to horizon, because obstacles and paths are now threedimensional
More recently, Giusti et al. [giusti2016machine] showed that a monocular system can be trained to follow a hiking path. But once again, only 2D movement is approached, asking a UAV going forward to change its yaw based on likeliness to be following a traced path.
Deep Learning and Convolutional Neural Networks has recently been widely used for numerous kinds of vision problem such as classification
[krizhevsky2012imagenet] and handwritten digits recognition [lecun1998gradient].Depth from vision is one the problems studied with neural network, and has been addressed not only with image pairs, but also single images [eigen2014depth, saxena2005learning]. Depth inference from stereo has also been widely studied [luo2016efficient, zbontar2015computing], and not necessarily in a supervised way [DBLP:journals/corr/KondaM13, DBLP:journals/corr/GargBR16].
Current state of the art methods for depth from monocular view tend to use motion, and especially structure from motion, and most algorithm do not rely on deep learning [cadena2016past, mur2016orb, klein2007parallel]. Prior knowledge w.r.t. scene is used to infer a sparse depth map with its density usually growing over time. These techniques also called SLAM are typically used with unstructured movement, produce very sparse pointcloud based 3D maps and require heavy calculation to keep track of the scene structure and align newly detected 3D points to the existing ones. SLAM is not widely used for obstacle avoidance, but more for offline 3D scan.
Our goal is to compute a dense (where every point has a valid depth) quality depth map using only two images, and without prior knowledge on the scene and movement, apart from the lack of rotation and the scale factor.
As discussed earlier, numerous datasets exist with depth groundtruth, but to our knowledge, no dataset propose only translational movement. Some provide IMU data along with frames [smith2009new], that could be used to compensate rotation but their small size only allows us to use it as a validation set.
Still Box Dataset  

image size  number of scenes  total size (GB) 
64x64  80K  19 
128x128  16K  12 
256x256  3.2K  8.5 
512x512  3.2K  33 
Scenes parameters  

field of view  
max render distance  
primitives number  
texture ratio  
size range of meshes (m)  
distance range of meshes (m)  
displacement  
length (frames)  
nominal shift  
speed equivalent (for 30fps) 
For our dataset we used the rendering software Blender to generate an arbitrary number of random rigid scenes, composed of basic 3d primitives (cubes, spheres, cones and tores) randomly textured from an image set scrapped from Flickr (see Fig 2).
These objects are randomly placed and sized in the scene, so that they are mostly in front of the camera, with possible variations including objects behind camera, or even camera inside an object. Scenes in which camera goes through objects are discarded. To add difficulty we also applied uniform textures on a proportion or of the primitives. Each primitive thus has a uniform probability (corresponding to texture ratio) of being textured from a colorramp and not from a photograph.
Walls are added at large distances as if the camera was inside a box (hence the name). The camera is moving at a fixed speed value, but to a random direction (uniform distribution), which is constant for each scene. It can be anything from forward/backward movement to lateral movement (which is then equivalent to stereo vision). Tables
1 and 2 show a summary of our scenes parameters. They can be changed at will, and are stored in a metadata JSON file to keep track of it. Our dataset is then composed of 4 subdatasets with different resolutions, 64px dataset being the largest in term of number of samples, 512px being the heaviest in data.Flow Estimation and disparity (which is essentially magnitude of optical flow vectors) are problems to which exist a lot of very convincing methods
[ilg2016flownet, 2017arXiv170304309K]. Knowing depth and displacement in our dataset, we could be able to easily get disparity and train a network for it using existing methods. We consider a picture with coordinates, and optical center atDisparity is defined by the norm of a flow vector of a point .
Focus of Expansion is defined by the point FOE where each flow vector of a point is headed from. Note that this property is true only when considering no rotation and a rigid scene. One can note than for a pure translation, FOE is the projection of the displacement vector
For a random rotationless displacement of norm of a pinhole camera, with a focal length of , depth is an explicit function of disparity ,focus of expansion FOE and optical center
This result is in a useful form for limit values. Lateral movement corresponds to and then
When approaching FOE, knowing depth is a bounded positive value, we can deduce :
limit of disparity is this case is and we use its inverse. As a consequence, small errors on disparity estimation will result in diverging values of depth near focus of expansion while it corresponds to the direction the camera is moving to, which is clearly problematic for depthbased obstacle avoidance.
Given the random direction of our camera’s displacement, computing depth from disparity is therefore much harder than for a classic stereo rig. To tackle this problem, we decided to set up an endtoend learning workflow, by training a neural network to explicitly predict the depth of every pixel in the scene, from an image pair with constant displacement value .
The way we store data in 10 images long videos, with each frame paired with its ground truth depth allows us to set a posteriori distances distribution with a variable temporal shift between two frames. If we use a baseline shift of 3 frames, we can e.g. assume a depth three times as great for two consecutive frames (shift of 1). In addition, we can also consider negative shift, which will only change displacement direction without changing speed value compared to opposite shift. This allows us, given a fixed dataset size, to get more evenly distributed depth values to learn, and also to decorrelate images from depth, preventing any overfitting during training, that would result in a scene recognition algorithm and would perform poorly on a validation set.
Typical Conv Module 

SpatialConv, 3x3 
SpatialBatchNorm 
ReLU 
Typical ConvTranspose Module 

SpatialConvTranspose, 4x4 
SpatialConv, 3x3 
SpatialBatchNorm 
ReLU 
Our network, which is broadly inspired from FlowNetS [DFIB15] and called DepthNet is described Fig 3. This network was initially used for flow inference. The main idea behind this network is that upsampled feature maps are concatenated with corresponding earlier convolution outputs. Higher semantic information is then associated with information more closely linked to pixels (since it went through less strided convolutions) which is then used for reconstruction.
This has been proven very efficient for flow and disparity computing while keeping a very simple supervised learning process. The architecture is admittedly very simple and one could leverage some more advanced work for flow and disparity, such as FlowNetC or GCNet [2017arXiv170304309K] among many others. The main point of this experimentation is to show that direct depth estimation can be beneficial regarding unknown translation. Like FlowNetS, we use a multiscale criterion, with a L1 reconstruction error for each scale.
(1) 
where
is the weight of the scale, arbitrarily chosen as in our experiments.
are the height and width of the output.
is the scaled depth groundtruth, using average pooling.
As said earlier, we apply data augmentation to the dataset using different shifts, along with classic methods such a flips and rotations. We also clamp depth to a maximum of 100m, and provide sample pair without shift, assuming its depth is 100m everywhere.
Fig 4 shows results from 64px dataset. Like FlowNetS, results are downsampled by a factor of 4, which gives 16x16 Depth Maps.
One can notice that although the network is still fully convolutional, feature map sizes go down to 1x1 and then behave exactly like a Fully Connected Layer, which can serve to figure out implicitly motion direction and spread this information across the outputs. The second noticeable fact is that near FOE, (see Fig 5
for centered FOE, i.e. perfect forward movement) the network has no problem inferring depth, which means that it uses neighbor disparity and interpolates when no other information is available.
This can be interpreted as 3d shapes identification, along with their magnification : pixels belonging to the same shape are deemed to have close and continuous depth values, resulting in a FOEindependent depth inference.
Network  L1Error  RMSE  

train  test  train  test  
FlowNetS  
DepthNet  
FlowNetS  2.44  4.77  
DepthNet  
DepthNet  2.44  
DepthNet  4.90  
DepthNet  
DepthNet 
Network  size  980Ti 

TX1  
FlowNetS  
DepthNet  7.33  
FlowNetS  N/A  N/A  
DepthNet  7.33  
DepthNet  7.33  
DepthNet  7.33  N/A 
One could think that a fully convolutional network such as ours can not solve depth extraction for pictures greater than 64x64. The main idea is that for a fully convolutional network, each pixel is applied the same operation. For disparity, this makes sense because the problem is essentially similarity from different picture shifts. Wherever we are on the picture, the operation is the same. For depth inference when FOE is not diverging (forward movement is non negligible), result from Theorem 1 apparently shows that once you know the FOE, you then get different operations to do depending on your distance from it and from the optical center . The only possible strategy for a fully convolutional network would be to compute the position in the frame as well and to apply the compensating scaling to the output.
This problem then seems very difficult, if not impossible for a network as simple as ours, and if we run the training directly on 512x512 images, the network fails to converge to better results than with 64x64 images (while better resolution would help getting more precision). However, if we take the converged network and apply a finetuning on it with 512x512 images, we get much better results. Fig 6 shows training results for mean L1 reconstruction error, and shows that our deemedimpossible problem seems to be easily solved with multiscale finetuning. As Table 3 shows, best results are obtained with multiple finetuning, with intermediate scales , , , and finally pixels. Subscript values indicate finetuning processes. FlowNetS is performing better than DepthNet but by a fairly light margin while being 5 times heavier and most of the time much slower, as shown Table 4.
Fig 7 shows qualitative results from our validation set, and from real condition drone footage, on which we were careful to avoid camera rotation. These results did not benefit from any finetuning from real footage, indicating that our Still Box Dataset, although not realistic in its scenes structures and rendering, appears to be sufficient for learning to produce decent depthmaps in real conditions.
As our network is leveraging the reduced dimensionality of our dataset due to its lack of rotation, it is hard to compare our method to anything else. Disparity estimation is equivalent to a lateral translation that our network has been trained on, and could be used to compare to other algorithms but this reduced context seems unfair compared to methods designed especially for it.
Other datasets provide ego motion with 6DOF on which our network has not been trained and is certain to give poor results. On the other hand, we could test some SLAM methods but they work better when applied to long image sequences and not only image pairs. In short, our method is setting state of the art, but for a very particular problem that we hope will gain interest with time.
We assumed in learning depth inference from a moving camera, assuming its velocity is always the same. When running during flight, such a system can easily deduce the real depth map from the drone speed , knowing that the training speed was (here )
(2) 
One of the drawbacks of this learning method is that the value (which is focal length divided by sensor size per pixel) of our camera must be the same as the one used in training. Our dataset creation framework however allows us to change this value very easily for training. One must also be sure to have pinhole equivalent frames like during training.
Depending of the depth distribution of the groundtruth depth map, it may be useful to adjust frame shift. For example, when flying high above the ground, big structure detection and avoidance requires knowing precise distance values that are outside the typical range of any RGBD sensor. The logical strategy would then be to increase the temporal shift between the frame pairs provided to DepthNet as inputs.
More generally, one must ensure a well distributed depth map from 0 to 100m to get high quality depth inference. This problem can be solved with two (among other) solutions:
Deduce optimal shift from precedent inference distribution, e.g:
where is 50m (because our network outputs from 0 to 100m) and is the mean of precedent output, i.e. :
Use batch inference to compute depth with multiple shifts . As shown in Table 4, batch size greater than 1 can be used to some extent (especially for low resolution) to efficiently compute multiple depth maps.
These multiple depth maps can then be either combined to construct a high quality depth map, or used separately to run two different obstacle avoidance algorithm, e.g. one dedicated for long range path planning (and then a high value ) and the other for reactive and short range collision avoidance with low . While one depth map will display closer areas at zero distance but further regions with precision, the other will set far regions to infinity (or 100m for DepthNet) but closer region with high resolution as flow is lowered compared to a high shift, and potentially within the range the network has been trained on.
We propose a novel way of computing dense depth maps from motion, along with a very comprehensive dataset for stabilized footage analysis. This algorithm can then be used for depthbased sense and avoid algorithm in a very flexible way, in order to cover all kinds of path planning, from collision avoidance to long range obstacle bypassing.
Future works include implementation of such a path planning algorithm, and construction of a real condition fine tuning dataset, using UAVs footages and a preliminary thorough 3D offline scan. This would allow us to measure quantitative quality of our network for real footages and not only subjective as for now.
We also believe that our network can be extended to reinforcement learning applications that will potentially result in a complete endtoend sense and avoid neural network for monocular cameras.
For a random rotationless displacement of norm , depth is an explicit function of disparity, focus of expansion FOE and optical center
We assume no rotation. which means FOE is projection of B on A.
Let m = be
(3) 
let be a point . For camera B we have
relative movement of is
so we have
If we compute for :
Similarly, with , we get:
(4) 
We consider disparity as norm of the flow expressed in frame B (which is correlated to depth at this frame).
(5) 
Consequently, we can deduce depth at frame B from disparity :
(6) 
From our dataset construction, we know that Let us call
From 3, we get:
and then from 6 we get
(7) 