High-speed, autonomous navigation is predicated on the ability to reason about the environment for effective, collision-free path planning. Existing approaches operate on current sensor readings to update an occupancy map corresponding to an internal representation of where obstacles exist in the environment. These occupancy maps are then used by planning algorithms to generate a collision-free path to a target goal. One of the limitations of this approach is that the planning horizon is limited to the field of view (FOV) of the sensor.
On the other hand, behavioral neuroscience and biological psychology point to the potential role of prediction for navigation in animals and humans. In particular, the hypocampus appears to exhibit some neuronal structures as well as firing sequences that could support not only mapping but also predictive mapping capabilities. Indeed, humans continuously make predictions of what to expect based on past experiences. This allows us to adjust our control policy in real time depending on how close our observations match our predictions . The advantage is most evident while running along a hallway approaching a T-intersection. Even though we cannot see the left or right paths, we generally assume the straight lines will continue and we can predict how the hallway will appear as we turn the corner. Because of this prediction, we do not adjust our running speed unless our prediction is inaccurate. Following this intuition, we believe future predictions of occupancy maps can enable risk sensitive control policies for mobile and aerial vehicles. By being able to predict occupancy maps, we can enable faster navigation as the planning horizon can extend beyond the sensor’s limited FOV.
This concept is similar to image completion, a problem for which multiple solutions have been suggested in the past [1, 3]. We take an alternate approach, leveraging the fact that structural information from the observed geometry of the world that can help us make useful predictions about the environment. Deep neural networks have significant advantages over other approaches when used for image completion or image generation 5]. These networks use a minimax game adversarial training with opponent generative and discriminative networks, and are capable of encoding a latent representation of images used to generate new examples from latent space.
In this paper, we demonstrate the ability to generate future predictions of occupancy maps without an explicit model using a variety of different neural network architectures with two examples shown in Fig. 1.
The main contributions of this work include:
A dataset consisting of simulated and physical occupancy maps that can be used to train and validate neural networks for predicting occupancy maps,
A framework to evaluate the performance and accuracy of different neural network architectures,
Qualitative and quantitative analysis of the prediction capabilities, performance and accuracy of various network architectures,
Validation of our approach using occupancy maps generated by a physical LIDAR sensor.
Ii Related Work
Model Predictive Control
High-speed navigation has been an active area of research primarily focusing on trajectory optimization, path planning and state estimation. Several papers have investigated model predictive control (MPC) techniques for navigation including[18, 13] however these approaches typically model the vehicle dynamics to predict vehicle motion and not necessarily the environment.
Deep Learning for Generative Models
Oh et. al. used feedforward and recurrent neural networks to perform action-conditional video prediction using Atari games with promising results. These have also been used in image completion, e.g., by Ulyanov et al. . In addition, GANs have demonstrated a promising method for image generation . Isola et al. proposed an approach for training conditional GANs which create one image from another image .
Deep Learning for Navigation
More recently, several papers have described approaches to combine elements of deep neural networks with autonomous navigation. These include using deep neural networks for model predictive control . Tamar et al. proposed Value Iteration Networks, which embed a planner inside a deep neural net architecture 
. Several papers investigate the use of deep reinforcement learning to develop collision-free planning without the need of an internal map, however, these are still restricted by the sensor’s FOV[20, 11].
While each of these papers makes promising contributions to their respective fields, none of the prior works use neural networks and in particular, GANs to generate predictions of future occupancy maps, nor do they focus on extending the planning horizon beyond the sensor’s FOV.
Iii Proposed Architectures
The goal of our network architecture is to learn a function that maps an input occupancy map to an expanded occupancy map that extends beyond the FOV of the sensor. More formally, we are learning the function
where represents the state, in this case, the input occupancy map as an image, represents the output occupancy map and represents percent increase of the expanded occupancy map. Components of the function include an encoding function which maps the state space, input occupancy maps to a hidden state and , which is a decoding function mapping the hidden state to an expanded, predicted occupancy map.
In our experiments, we compare several different neural network architectures including:
(A) a feedforward network based on a U-Net architecture (unet_ff)
(B) a feedforward network based on the ResNet architecture (resnet_ff)
(C) a GAN using the feedforward network from (a) as the generative network (gan)
A 5-layer multilayer perceptron was also evaluated as a baseline, however in our testing, it did not produce reliable predictions.
Iii-a U-Net Feedforward Model
The U-Net feedforward model is based on the network architecture defined by Ronneberger et. al  and consists of skip connections which allows a direct connection between layers and enabling the option to bypass the bottleneck associated with the downsampling layers in order to perform an identity operation. Similar to 
filter and stride length of 2. The number of filters for the 8 layers in the encoder network are: (64, 128, 256, 512, 512, 512, 512, 512). The decoder network consists of 8 upsampling layers with the following number of filters: (512, 1024, 1024, 1024, 1024, 512, 256, 128).
Iii-B ResNet Feedforward Model
The ResNet feedforward model is based on the work by Johnson et. al  which consists of 2 convolution layers with stride 2, 9 residual blocks as defined by  and two deconvolution layers with with a stride of
. A key reason this network was selected was based on the ability to learn identify functions, which is key to image translation as well as the success in image-to-image translation demonstrated by the CycleGAN network.
Iii-C GAN Model
The GAN networks is based on the pix2pix architecture  which has demonstrated impressive results in general purpose image translation including generating street scenes, building facades and aerial images to maps. This network uses the U-Net Feedforward model defined in section III-A and consists of a 6 layer discriminator network with filter sizes: (64, 128, 256, 512, 512, 512).
Iv Simulated Data Experiments
Our approach to testing occupancy map prediction using the networks defined above first involved generating a dataset and then performing qualitative and quantitative analysis of the predicted images compared to the ground truth.
Iv-a Data Collection
A dataset of approximately 6000 images of occupancy map subsets was created by simulating a non-holonomic robot moving through a two-dimensional map with a planar LIDAR sensor in C++ with ROS and the OctoMap library . Two maps, shown in Fig. 2, were created in Solidworks with the path width varying between 3.5 m to 10 m. These were converted into OctoMap’s binary tree format using binvox [14, 15] followed by OctoMap’s binvox2bt tool. The result is an occupancy map with all unoccupied space set as free. We require space outside of the walls, shown as grey in Fig. 2, to be marked as unknown to provide a ground truth for our estimated maps. These ground truth maps were created by fully exploring the original occupancy maps.
The robot is modeled as a Dubin’s car, with a state vectorand inputs where () is the robot’s position, is the velocity, and and are the heading angle and angular velocity, respectively. For simplicity, the robot is constrained to move at fixed forward velocity of 0.5 m/s. A planar LIDAR sensor with a scanning area of 270 and range of 20 m is used to simulate returns given the robot’s current pose against the ground truth map. These simulated returns are used to create the “estimated” occupancy map. Path planning is done with nonlinear model-predictive control and direct transcription at 10 Hz. At each time step, a subset of the maps (both the estimated and ground truth) are saved. A 5 m by 5 m square centered around the robot’s pose was chosen with a resolution of 0.05 m. At each time step, the robot’s current state and action space are also logged. Occupancy maps are expanded over time, so our simulation performs a continuous trajectory and the data set is built consecutively instead of randomly sampling throughout a map. A total of six trajectories were simulated. Four paths were used for training data (5221 images) and two were used as a test set (1090 images). Ground truth datasets of the expanded occupancy maps were also generated. These expanded occupancy maps range from 1.10x to 2.00x expansion in increments of 0.10x, e.g., a 2.00x expansion results in a 10 m by 10 m square subset centered around the robot.
Iv-B Training Details
We trained each variant of the neural network using the expanded ground truth occupancy maps from scratch for 200 epochs with a batch size of 1. A total of 15 training sessions were performed to evaluate each of the three neural network architectures across five expansion increases (1.10x, 1.30x, 1.50x, 1.70x, and 2.00x). We use the Adam optimizer with an initial learning rate of 0.0002 and momentum parameters. In the feedforward models, L1 loss was used as proposed in PatchGan 
and in the GAN model L1+discriminator loss was used. The decoder layers of the network used a dropout rate of 0.50 and weights were initialized from a Normal distribution (
). All models were implemented using PyTorch.
Iv-C Simulation Results
We evaluated the performance of each neural network architecture across a span of five increasing occupancy map predictions. Fig. 3 provides a snapshot of the qualitative assessment of the predicted images for each of the neural networks. This example was selected because it demonstrates that even with very little information, the U-Net feedforward model was able to accurately predict the presence of the surrounding obstacles while the other networks were unable to detect it. Table I provides the structural similarity index metric (SSIM) for each of the networks. Based on the SSIM metric, it can be seen that the U-Net feedforward model outperforms the other networks at 1.10x and 1.30x expansion confirming the qualitative assessment. The quality of the prediction generally decreases as the expansion percentage increases and with expansions 1.50x and above the three networks achieve similar performance.
V Physical Experiments
Our next experiment focused on validating our approach with occupancy maps generated by a physical LIDAR sensor.
V-a Data Collection and Training Details
In this experiment, we teleoperated a TurtleBot2 robot with a mounted Hokuyo UST-20LX LIDAR sensor (shown in Fig. 4(a)) around a building. The OctoMap library  along with a custom C++ implementation of a particle filter running at 20 Hz was used for simultaneous localization and mapping. The final map was used as ground truth (shown in Fig. 4(b)). At each time step a 5 m by 5 m square subset centered around the robot’s current pose of both the ground truth and estimated maps was saved (100 images). Expanded ground truth occupancy maps were generated ranging from 1.10x to 2.00x in 0.10x increments.
Our objective was to evaluate whether training performed on a simulated dataset could be directly transferred to occupancy maps generated by a physical LIDAR sensor. For this reason, we opted to not fine tune the networks using the physical dataset.
V-B Physical Results
Fig. 5 represents sample predictions obtained by running the networks trained using simulation data on the occupancy maps generated by the physical sensor. Table II displays the SSIM metric across each of the networks. In the physical experiments, the data is more inconclusive. Similar to the simulation experiments, the quality of the predictions generally decrease as the predicted distance increases, however there was no noticeable difference across the three networks.
The ability to perform predictions is key to navigation. This capability is also motivated from the perspective of behavioral neuroscience and psychology. In particular it has been found  that certain neuronal structures point to mapping capabilities and may be involved in encoding predictive mapping events based on past experience. The net product is that neurons do not activate solely based on current visual input, but also based on a sequence of locations, so as to enable prediction (see ). In this paper, our goal is to develop techniques that enable future predictions of occupied space for robotic navigation.
The main intuition behind our predictive approach is that knowledge of the geometry of existing occupied space can serve as a prior for generating predictions. Prior to deep learning, the best methods of generating predictions were through explicit models
, however, modeling observations and experiences can be difficult if not impossible. Deep learning enables the ability to find hidden representations that encode prior knowledge by collecting datasets that represent experiences. In our work, we leveraged the power of deep learning to encode prior knowledge of likely spatial structures and used this representation to generate future predictions without an explicit model.
Based on the above experiments, the proposed approach is generally very stable, particularly when predicting occupancy maps representing 1.10x or 1.30x expansion increases. Considering U-Net’s superior performance on the simulated data, we use it next to demonstrate the general robustness of our approach in Fig. 7 where we display five randomly selected images from the test dataset. A promising benefit of our method is that with very little information, predictions can be extremely accurate as shown in Fig. 3. This is further evidenced by the supplementary video demonstrating accurate frame-by-frame prediction of a robot navigating a hallway in simulation. As expected, when the predicted area of the occupancy map increases, the results exhibit more uncertainty as demonstrated by the 2.00x predictions in Fig. 6. While falling short of the exact ground truth, these examples still contain useful information beyond the observed input, which can can be exploited by the planning algorithms.
Overall, when compared to the simulated data, the physical data performs worse quantitatively. This is likely due, to the fact that the physical data exhibits more details that are hard to predict given the simulated training data, which does not have the same level of detail (e.g., chairs, boxes, people walking through the scene). Using augmentation methods may help address this issue.
Looking back at the physical data prediction from Fig. 5, one notes that at a high-level the predictions are not only informative, but also all predictions are qualitatively correct as they all point to the coming of a T-like intersection. This suggests that from the perspective of the end goal of assessing navigational risks, selecting the navigation behavior, or simply deciding on if/when to decelerate, this high level qualitative information is very useful.
We note also that in the physical data results in Table II, there is much less quantitative difference between GAN and fully convolutional models performance, and in fact GAN seem to have a little edge qualitatively over the other methods as it is able to predict a more detailed map than the other approaches.
As demonstrated in Fig. 5, not only can our approach be used to predict occupied space, it appears to have the beneficial effect to filter out transient obstacles found by noisy sensor readings. While higher resolution details might be desirable for collision avoidance, this will solved to a large extent with the current sensor measurements in the FOV. We argue that as we expand our temporal horizon, less spatial resolution is necessary in the prediction. In this sense it would be beneficial to use alternate metrics that take this fact into account. One way to achieve this is possibly to compute SSIM at different (coarser) resolution level for more distant future time instants to characterize the ability of the prediction method to capture the future at different scales. This is left to future work.
Additional future work will also focus on improving the current methods for extending predictions and combining them with the stable results generated by the shorter horizon predictions.
Our long term objective is to develop risk-sensitive control algorithms capable of leveraging known obstacles in the environment as well as predicted obstacles. In this paper, we have laid the foundation to demonstrate deep networks can be used to make predictions of occupancy maps that extend beyond the FOV of the sensor. In our evaluation, we uncovered conditions where predictions were highly accurate and examples where the predicted results could be improved. As future work, we plan to evaluate prediction mechanisms operating on raw depth data, combining visual and depth data, to develop an on-line learning policy and also to further develop risk-sensitive control policies for high speed navigation based on these predictions.
-  M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester. Image inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’00, pages 417–424, New York, NY, USA, 2000. ACM Press/Addison-Wesley Publishing Co.
-  R. L. Buckner. The role of the hippocampus in prediction and imagination. Annual review of psychology, 61:27–48, 2010.
-  T. F. Chan and J. Shen. Mathematical models for local nontexture inpaintings. SIAM J. Appl. Math, 62:1019–1043, 2002.
-  C. Finn and S. Levine. Deep visual foresight for planning robot motion. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 2786–2793. IEEE, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
-  A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard. OctoMap: An efficient probabilistic 3D mapping framework based on octrees. Autonomous Robots, 2013.
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.
Image-to-image translation with conditional adversarial networks.CVPR, 2017.
-  M. Joch, M. Hegele, H. Maurer, H. Müller, and L. K. Maurer. Brain negativity as an indicator of predictive error processing: the contribution of visual action effect monitoring. Journal of Neurophysiology, 118(1):486–495, 2017. PMID: 28446578.
-  J. Johnson, A. Alahi, and F. Li. Perceptual losses for real-time style transfer and super-resolution. CoRR, abs/1603.08155, 2016.
-  G. Kahn, A. Villaflor, B. Ding, P. Abbeel, and S. Levine. Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. CoRR, abs/1709.10489, 2017.
-  S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. arXiv preprint arXiv:1603.02199, 2016.
-  S. Maniatopoulos, D. Panagou, and K. J. Kyriakopoulos. Model predictive control for the navigation of a nonholonomic vehicle with field-of-view constraints. In 2013 American Control Conference, pages 3967–3972, June 2013.
-  P. Min. binvox. http://www.patrickmin.com/binvox, 2004 - 2017. Accessed: 2017-02-20.
-  F. S. Nooruddin and G. Turk. Simplification and repair of polygonal models using volumetric techniques. IEEE Transactions on Visualization and Computer Graphics, 9(2):191–205, 2003.
-  J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-Conditional Video Prediction using Deep Networks in Atari Games. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, R. Garnett, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2845–2853. Curran Associates, Inc., 2015.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
-  J. L. Piovesan and H. G. Tanner. Randomized model predictive control for robot navigation. In 2009 IEEE International Conference on Robotics and Automation, pages 94–99, May 2009.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597, 2015.
-  L. Tai, G. Paolo, and M. Liu. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. CoRR, abs/1703.00420, 2017.
-  A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel. Value iteration networks. In Advances in Neural Information Processing Systems, pages 2154–2162, 2016.
-  D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Deep image prior. CoRR, abs/1711.10925, 2017.
-  R. A. Yeh, C. Chen, T. Lim, M. Hasegawa-Johnson, and M. N. Do. Semantic image inpainting with perceptual and contextual losses. CoRR, abs/1607.07539, 2016.
-  J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR, abs/1703.10593, 2017.