[CVPR 2020] Estimation of the visible and hidden traversable space from a single color image
Understanding the shape of a scene from a single color image is a formidable computer vision task. However, most methods aim to predict the geometry of surfaces that are visible to the camera, which is of limited use when planning paths for robots or augmented reality agents. Such agents can only move when grounded on a traversable surface, which we define as the set of classes which humans can also walk over, such as grass, footpaths and pavement. Models which predict beyond the line of sight often parameterize the scene with voxels or meshes, which can be expensive to use in machine learning frameworks. We introduce a model to predict the geometry of both visible and occluded traversable surfaces, given a single RGB image as input. We learn from stereo video sequences, using camera poses, per-frame depth and semantic segmentation to form training data, which is used to supervise an image-to-image network. We train models from the KITTI driving dataset, the indoor Matterport dataset, and from our own casually captured stereo footage. We find that a surprisingly low bar for spatial coverage of training scenes is required. We validate our algorithm against a range of strong baselines, and include an assessment of our predictions for a path-planning task.READ FULL TEXT VIEW PDF
[CVPR 2020] Estimation of the visible and hidden traversable space from a single color image
Computerized agents, for example a street cleaning robot or an augmented reality character, need to know how to explore both the visible and the hidden, unseen world. For AR agents, all paths must be planned and executed without camera egomotion, so no new areas of the real scene are revealed as the character moves. This makes typical approaches for path planning in unknown [65, 74] and dynamic environments less effective.
We introduce Footprints, a model for estimating both the visible and hidden traversable geometry given just a single color image (Figure 1). This enables an agent to know where they can walk or roll, beyond the immediately visible surfaces. Importantly, we model not just the surfaces’ overall shapes and geometry [21, 22], but also where moving and static objects in the scene preclude walking. We refer to these occupied regions of otherwise traversable surfaces as object footprints.
Previous approaches rely on bounding box estimates [27, 36, 57], which are limited to cuboid object predictions. Other approaches to estimating missing geometry have required complete, static training environments, which have either been small in scale  or synthetic [6, 63]. Surprisingly, our method can create plausible predictions of hidden surfaces given only partial views of real moving scenes at training time. We make three contributions:
We introduce a lightweight representation for hidden geometry estimation from a single color image, with a method to learn this from video depth data.
We present an algorithm to learn from videos with moving objects and incomplete observations of the scene, through masking of moving objects, a prior on missing data, and use of depth to give additional information.
We have produced human-annotated hidden surface labels for all 697 images in the KITTI test set . These are available to download from the project website. We also introduce evaluation methods for this task.
Our method is related to prior work in robotics, path planning, and geometry estimation and reconstruction.
If multiple camera views of a scene are available, camera poses can be found and a 3D model of a static scene can be reconstructed . The addition of a segmentation algorithm enables the floor surface geometry to be be found [1, 41]. In our work, we make floor geometry predictions given just a single image as input. Other multi-view approaches include occupancy maps in 2D  and 3D [46, 67, 75], where new observations are fused into a single map.
The planning of paths of virtual characters or robots in environments with known geometry is a well-studied problem [5, 18, 33, 54, 66]. Our prediction of walkable surfaces beyond the line of sight shares concepts with works which allow for path planning in environments where not all geometry can be observed [65, 74]. Gupta et al.  learn to plan paths with a walkable geometry belief map similar to our world model, while  learn potential navigable routes for a robot from watching video. Rather than directly planning paths, though, in our work we directly learn and predict geometry, which is useful for path planning and more.
A well-studied task for geometry estimation is the prediction of a depth map given a single color image as input. The best results here come from supervised learning, e.g.[9, 14]. Acquiring supervised data for geometry estimation is hard, however, so a popular approach is self-supervised learning, where training data can be monocular [20, 52, 79] or stereo [15, 19, 49, 76] images. Depths are learned by minimising a reprojection loss between a target image and a warped source view. Like these works, we also learn from arbitrary videos to predict geometry, but our geometry predictions extend beyond the line of sight of the camera.
We fall into the category of works which predict geometry for parts of the scene which are not visible in the input view. For example, [48, 64] perform view extrapolation, where semantics and geometry outside the camera frustum are predicted. In contrast, we make predictions for geometry which is inside the camera frustum, but which is occluded behind objects in the scene.
Predicting the occupancy of unobserved voxels from a single view is one popular representation for hidden geometry prediction [6, 10, 63]. Training data for dense scene completion is difficult to acquire, though, often making synthetic data necessary [6, 63]. Further, voxels can be slow to process and computation hard to scale for geometry prediction, making their use in real-time or on mobile platforms difficult. Meshes are a more lightweight representation  but incorporating meshes in a learning framework is still an active research topic; a typical approach is to go via an intermediary voxel representation, e.g. . A complementary source of information is physical stability as a cue to complete scenes .
Recent works have taken a lightweight approach to predicting hidden scene structure by decomposing the visible image into layers of color and depth behind the immediately visible scene [8, 40, 60, 68]. Similarly, amodal segmentation [12, 51, 80] aims to predict overlapping semantic instance masks which extend beyond the line of sight. However, amodal segmentation doesn’t label the contact points necessary to know the location of objects. Amodal segmentation would label a ‘traversable surface’ as continuous under a car or person.
Similar to amodal segmentation are approaches that predict the floor map from a single color image, for example [55, 71]. Similarly [21, 22] complete support surfaces in outdoor and indoor scenes respectively. The aim of these approaches is to predict support surfaces as if all objects were absent (Figure 2(c)), akin to amodal segmentation, while we aim to predict the walkable floor surface taking obstacles into account (Figure 2(d)). The Manhattan layout assumption can be useful to help infer the ground surface in indoor scenes (e.g. [27, 35, 36, 57]), however, is less applicable outdoors. Our task is motivated by prior work , though our approach is novel.
One method to estimate the full extent of partially observed objects is via 3D detection, for example 3D bounding boxes [32, 37, 39, 53, 62]. Generic object bounding box detectors have been used to estimate indoor free space [28, 36, 57]. Bounding boxes only give convex footprints for ‘things’ in the image, so aren’t suitable for the geometry of ‘stuff’  such as walls, piles of items, or shrubbery. To the best of our knowledge, object detection has not been effectively combined with amodal segmentation to give traversable surfaces. We compare to recent object detection baselines and show that our approach is better suited to our task (Section 5). Another detection approach is to fit 3D human models to help estimate the hidden layout [13, 42], while our aim also has similarities to , who aim to recover the places in a scene a human can stand, sit and reach. Such methods often operate with a static scene assumption and work best when the whole scene has been “explored” by the humans.
In comparison to these related works, we predict the hidden and visible traversable surfaces from a single image, taking all obstacles (whether ‘things’ or ‘stuff’) into account.
Our goal is to predict both the visible and hidden traversable surface for a single color image . A surface is defined as traversable if it is visually identifiable as one of a predefined set of semantic classes, listed in our supplementary material. The visible traversable surface can be represented with two single-channel maps:
A visible depth map giving the distance from the camera to each visible pixel in the scene, e.g. .
Together, model the extent and geometry of all the visible ground which can be traversed – Figure 2(b). However, to know about how an agent could move through areas of the scene beyond the line of sight, we also need to model geometric information about ground surfaces which are occluded by objects. To this end, our representation also incorporates two channels which model the hidden traversable surface:
A hidden ground segmentation mask , which represents the extent of the entire traversable floor surface inside the camera frustum, including occluded parts. Each pixel is if the camera ray associated with pixel intersects with a walkable surface at any point (even behind objects visible in this view) and otherwise. This can also be seen as a top-down floor map reprojected into the camera view .
A depth map which gives the geometry of the hidden ground surface. Each contains the depth from the camera to the (visible or hidden) ground for pixel . If the camera ray at pixel doesn’t intersect any traversable surface (i.e. ), then is .
Our four-channel representation is a rich world model which enables many tasks in robotics and augmented reality, while being lightweight and able to be predicted by our standard image-to-image network.
A semantic segmentation algorithm also gives us the pixels which an agent could walk on, but only those which are visible by the camera (i.e. ). Our model also represents the location of walkable ground surfaces which are not visible to the camera.
Assuming a planar floor surface, fitting a plane to the visible ground would give an estimate of the geometry of the walkable surface. However, this planar model does not give the extents of the walkable surface, meaning an agent traversing the scene would walk into objects.
Our image-space predictions are lightweight and memory efficient, and furthermore output is pixel-wise aligned with the input space. Given that our main focus is on where we can walk, our representation is the minimal necessary representation.
We could represent the world in top-down view instead of in reprojected camera space. While this would allow us to model the world outside the camera frustum, we would add complexity, with more complicated training and reliance on good test-time camera-pose estimation.
It is possible to estimate using off-the-shelf prediction models, e.g. . However, training a model to estimate requires additional sources of information. Human labeling is expensive and difficult to do at scale as we are asking an annotator to label occluded parts of a scene. Instead, we exploit two readily available sources of information: freely captured video and depth data. We use these to divide pixels from each training image into three disjoint sets. contains indices of pixels which are deemed to be traversable; the indices of pixels which we are confident cannot be traversed, and the indices of pixels which we have no information about. These unknown predictions come about by our use of freely captured video for training; some areas of the scene have never been observed, and we have no information about whether these regions are traversable or not.
Freely captured video is easy to obtain and gives us the ability to generate training data for geometry behind visible objects. We use other frames in the video to provide information about what the geometry and shape of the walkable surface is by projecting observations from each frame back into the target camera.
We use off-the-shelf tools to estimate camera intrinsics and depth maps for each frame and relative camera poses between source frames and the target frame . We then forward warp [56, 70] the depth values of traversable pixels from the source frame into the target frame. This results in a sparse depth map , representing the geometry and extents of the traversable ground visible in frame rendered from the viewpoint of . We repeat this forward-warping for nearby frames, obtaining the set .
Due to inaccuracies in floor segmentation, depth maps, and camera poses, many of the reprojected floor map images will have errors. We therefore perform a robust aggregation of the multiple noisy segmentation and depth maps to form a single training image. Our traversable labelset is formed from pixels for which at least reprojected depth maps contain a nonzero value, i.e.
where is the Iverson bracket, is the set of all pixel indices in this image and is the th pixel in . See Figure 3 for an overview.
We subsequently obtain our ground depth map by taking the median depth value (ignoring zeros) associated with each pixel if and only if there is a valid depth value at this location:
We supervise our prediction with a loss .
While is constructed from depth images of multiple source images, models trained on alone typically incorrectly estimate object footprint boundaries, often entirely missing the footprints of thin objects such as poles and pedestrians. Such mistakes are due to inaccuracies in camera pose tracking, traversable segmentation masks and visible depth maps, resulting in sometimes poor reprojections into the target frame that are not excluded by our robust aggregation method. To tackle this problem, we exploit depth data from the target image to estimate , the set of pixels in the image which are definitely not traversable. Subsequently, we redefine to not include pixels in .
To find , we first project all points in the depth map
from camera space into world space. Next, we fit a plane to those points which are classified as visible ground in our segmentation maskusing RANSAC 
. We then move each point in the world along the normal vector of the plane such that they now lie on the plane, and ‘splat’ in a small grid around the resulting position. After reprojecting these points back into camera space, we apply a filtering step (see supplementary material for details) to remove erroneous regions, and obtain the set of pixels. An example is shown in Figure 4(c).
The computation of and utilizes multiple frames and makes the significant and unrealistic assumption that our training data comes from a static world, when in fact many objects will undergo significant motion between frames and . To combat this, we identify and remove pixels from our training loss which are associated with moving objects. We could use semantic segmentation to remove non-static object classes, such as cars; however, this would prevent us learning about the hidden geometry of any cars, including parked ones. We could train on static scenes , but would be limited by the availability of existing general-purpose datasets. We instead observe that most classes of moving objects are static at least some of the time. For example, while it is hard to learn the geometry of a moving vehicle, we can learn the shape of parked cars and apply this knowledge to moving cars at test time. Similarly, footprints of humans can be learned by observing those which are relatively static in training.
We compute a per-pixel binary mask , where is zero for pixels depicting non-static objects. To compute for frame , we computed the induced flow [69, 81] from frame to , using and camera motion. This estimated where pixels would have moved to assuming a static scene. We also separately estimate frame-to-frame optical flow. Pixels where the induced and optical flow differ are often pixels on moving objects; we set to if the endpoints of the two flow maps differ by more than pixels, and otherwise. An example of is shown in Figure 4(b).
Our training loss comprises four parts, one for each output channel .
— This is supervised using standard binary cross-entropy loss.
— Hidden depths are also supervised with the log loss, but we only apply the loss for pixels .
Our final loss is the sum of each subloss over all pixels:
To generate training signals for KITTI and our casually captured stereo data, camera extrinsics and intrinsics are estimated using ORB-SLAM2 , while depth maps are inferred from stereo pairs using . Segmentation masks are estimated using a simple image-to-image network trained using the ADE20K  and Cityscapes  datasets, and optical flow is estimated using [30, 47]. Our network architecture is based on , modified to predict four sigmoided output channels. We adjust our training resolution to approximately match the aspect ratio of the training images: for the Matterport dataset, for KITTI, and for our own stereo data. For Matterport, camera intrinsics, relative locations, and depth maps are provided. Thus we need only estimate segmentation masks, and do so using the same pretrained network finetuned on a small subset of 5,000 labelled Matterport images. Except in some ablations, we set .
|Input||Ground truth||Ours||Ours no depth mask||Bounding box |
Quantifying the accuracy of our predictions indoors and outdoors (Matterport and KITTI),
Illustrating their quality across different scenarios,
Ablating to gauge the benefits of different design decisions, and
Illustrating a use-case where Footprints are used for path planning (Sec 6).
We focus our evaluation here on hidden traversable surface estimation i.e., , and we evaluate , and in the supplementary material.
There are two aspects of predictions which are of interest: (1) The ability to estimate the overall extents of the traversable freespace in the image, and (2) the ability to estimate the footprint base of objects in the scene which must be avoided. To capture this we introduce two evaluation settings. The first, freespace evaluation, addresses (1) by evaluating our thresholded prediction of over all pixels in the image using the standard binary detection metrics of IoU and F1. The second is footprints evaluation addressing (2), where we focus on the evaluation of object footprints by evaluating only within the ground region. To evaluate all methods equivalently, we evaluate within the true ground segmentation (KITTI) and the convex hull of the true visible ground (Matterport) — see Figure 5.
We compare against several baselines, to demonstrate the efficacy of our method across tasks:
— is set as the visible ground mask .
— We estimate as the convex hull of the visible ground mask .
— Footprints of objects are estimated using 3D bounding box detectors  for outdoor scenes and  indoors; we evaluate both the ‘ScanNet’ and ‘SunRGBD’ models from . Estimated object footprints are subtracted from the convex hull baseline for the final prediction. Unlike our method,  make predictions with access to the structured-light-inferred depth map at test time; we include their state-of-the-art results as an upper bound on what a bounding box method could achieve.
— On indoor scenes, we use  to estimate the voxelized scene from a depth input. Voxels estimated as ‘floor’ are reprojected into the camera.
— We train a model to estimate footprints using only depth images at training time, without our multi-frame reprojection. For this we train a binary classifier to predict if it expects each pixel to be a member of or not, and subtract these pixels from the convex hull.
|Freespace eval.||Footprint eval.|
|Bounding box ||0.794||0.879||0.187||0.292|
|Nothing traversable ()||0.000||0.000||0.089||0.153|
|Everything traversable ()||0.344||0.506||0.000||0.000|
|Freespace eval.||Footprint eval.|
|Project down baseline||0.344||0.506||0.082||0.144|
|Ours w/o moving object masks||0.795||0.878||0.227||0.347|
|as above w/o eqn. (3b)||0.797||0.879||0.218||0.333|
|Ours w/o eqn. (3b)||0.793||0.877||0.225||0.343|
|Freespace eval.||Footprint eval.|
|Nothing traversable ()||0.000||0.000||0.186||0.291|
|Everything traversable ()||0.480||0.611||0.000||0.000|
|Bounding box  (Scannet)||0.450||0.557||0.333||0.469|
|Bounding box  (Sun RGBD)||0.451||0.559||0.315||0.450|
|Voxel SSCNet ||0.492||0.615||0.087||0.136|
|Visible ground ()||0.505||0.628||0.404||0.542|
, suggesting that a careful choice of hyperparameters could further improve performance. Methods markedhave access to structured-light depth data at test time. Voxel SSCNet(+)’s geometry estimation failed on 178 scenes; we ignore these when averaging.
We first train and evaluate on the well-established KITTI benchmark  using the Eigen split . To evaluate quantitatively, we generate human annotations for the entire test set. Labelers were instructed to draw a polygon bounding the hidden and visible walkable surface, and to separately label the footprint of each occluding object in the scene. Due to the nature of our task, labelers had to estimate the hidden extents of many objects, which seems like an error-prone task. However, this follows work in amodal labeling where consistency between labelings was found to be reasonably high . These annotations are available from the project website.
We present quantitative results of our method alongside baselines in Table 1. Here we demonstrate the superior performance of our method in both freespace and footprint evaluation. Qualitative results can be seen in Figure 6. We see that Ours finds the footprints of a wider variety of objects than Bounding Box as we are not limited to predefined classes. We also better capture the overall shape of the traversable ground. Additionally, we ablate our method in Table 2, showing that our full method helps to improve results.
We use the Matterport dataset  for training and evaluation on indoor scenes. Here, camera poses and structured-light depth maps are provided, and the ground truth floor masks and geometry are rendered from the dataset’s semantically annotated mesh representation. We only train and evaluate on images from the forward- and downward-facing cameras on their rig, leaving us with 49,286 training images. We evaluate on the first 500 images from the test set. Results are shown in Figure 8 and Tables 3 and 4, where we again outperform all baselines. SSCNet  performs poorly as this method was mainly trained on synthetic data, where the footprints of objects are not separately delineated from the ground plane. We therefore create a reworking of their method, SSCNet+. Here, the voxel predictions of chairs, beds, sofas, tables and TVs are projected to the floor and subtracted to give more accurate footprint estimates. SSCNet+ achieves higher footprint scores than SSCNet, but lower freespace scores.
|RMSE||Abs. rel.||Sq. rel.|
Table 5 compares our inference speed with competing methods. For a fair comparison, all methods were assessed with a batch size of one. Our simple image-to-image architecture is significantly quicker than alternatives, lending itself more readily to mobile deployment.
|Preprocessing (s)||Inference (s)|
|Voxels (SSCNet) ||43||66|
|Bounding box ||-||0.417|
|Bounding box ||-||0.520|
|Bounding box  (Scannet)||0.569||0.157|
|Bounding box  (Sun RGBD)||0.575||0.162|
|Predicted visible ground||0.512||0.126|
|Nothing traversable ()||0.616||0.198|
|B. box |
One important use case for our system is to assist in the planning of paths, e.g. for an augmented reality character. For each Matterport test image we choose a random pixel on the ground truth ‘visible ground’ mask as the start point and a pixel in the ground truth ‘hidden ground’ mask as the end point. We plan a path between the two with A* , where the cost of traversing pixel is , where is the unthresholded sigmoid output. A planned path is ‘failed’ if it leaves the ground truth traversable area at any point; we also count the fraction of pixels in each path which leave the ground truth traversable area as ‘collisions’. Results are shown in Table 6, and examples of planned paths are shown in Figure 1 and in the supplementary material.
In this work we have presented a novel representation for predicting scene geometry beyond the line of sight, and we have shown how to learn to predict this geometry using only stereo or depth-camera video as input. We demonstrated our system’s performance on a range of challenging datasets, and compared against several strong baselines. Future work could address temporal consistency or persistent predictions.
Thanks to Eugene Valassakis for his help preparing this work’s precursor . Special thanks also to Galen Han and Daniyar Turmukhambetov for help capturing, calibrating and preprocessing our handheld camera footage, and to Kjell Bronder for facilitating the dataset annotation.
A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern.. Cited by: Figure 1, §6.
LiteFlowNet: a lightweight convolutional neural network for optical flow estimation. In CVPR, Cited by: §4.5.
3D bounding box estimation using deep learning and geometry. In CVPR, Cited by: Figure 6, §5, Table 1, Table 5.
ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. Transactions on Robotics. Cited by: §4.5.
A reimplementation of LiteFlowNet using PyTorch. Note: https://github.com/sniklaus/pytorch-liteflownet Cited by: §4.5.
Adversarial collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. arXiv:1805.09806. Cited by: §2.2.