Having fast planning algorithms is crucial for practical use of robots in changing environments and safety critical tasks. Efficiency of Heuristic Search-based planning (A* Search 
) largely depends on the quality of the heuristic function for estimation of the cost-to-go. Ideally, if we knew the exact cost-to-go (oracle), we could find the optimal solution with minimum effort (practically traversing greedy). If the robot operates in similar environments, the learning of the cost-to-go function from previous search experience is a promising approach as Planning and Learning have complementary strengths. It is an issue for learning approaches to provide any guarantees on performance, have safe exploration or to learn long-term rewards. In these aspects, planning algorithms can provide valuable support. On the other hand, planning algorithms are rather slow in high dimensional spaces which can be improved if planning is properly guided. Effective synergy of Planning and Learning provided exceptional results so far including seminal achievement of super-human performance in the game of Go .
Interaction of Planning and Learning has a long history , with several modern directions including End-to-end learning approximations of planning algorithms (i.e inspired by Value Iteration algorithm , MCTS , MPC 
), planning to guide exploration in Reinforcement Learning,  and learning to guide planning  as well as model learning and Model Based Reinforcement Learning in general.
This work is in direction of learning of the value function in order to guide heuristic search-based planning. It is a significantly improved extension and generalization of the work  which was focused on automated driving application and considered search-based optimal motion planning framework (SBOMP)  that utilized different admissible heuristics (numeric  and model-based heuristics ). Admissible heuristics did not consider dynamic obstacles, so to improve performance, authors proposed learning of heuristics which consider dynamic obstacles as well . The most similar approaches to one presented in this work appeared in  and in . However, in  authors used only nodes from the shortest path for learning and in  authors used backward Dijkstra’s algorithm which explores the whole search space.
The main contribution of this work is a novel approach for efficient and systematic exploration of the models based on backward and prolonged heuristic search. This ensures that only interesting nodes are explored and all explored nodes are used for value function learning.
Premise: For learning of the Value function it is more beneficial to explore states in the neighborhood of the optimal path (policy) than elsewhere, as the agent will spend most of the time in the neighborhood of optimal path.
Having explored neighboring region around optimal path helps to get back to optimal path if the planner deviates and has better coverage for A* algorithm, which always looks for neighboring states. Inspired by this idea, we prolonged the search even after optimal path was found to explore wider region around the shortest path. Additionally, the direction of the search is flipped, such that the search starts from the goal node. In this way, all expanded nodes lead to the goal state and therefore can be used in the dataset for learning.
The presented approach uses existing admissible heuristic function () and a known model of the system to generate dataset
of exact state-cost data points. The dataset is used for supervised learning of the value function. The value function is then used as heuristic functionin the search, bounded by admissible heuristic to provide guarantees on sub-optimality.
In principle, this approach differs from reinforcement learning since it is supervised, and from imitation learning since the exact optimal solution is used instead of expert demonstrations. Nevertheless, this approach to exploration can be also used within Model Based Reinforcement Learning framework. This approach enables theoretically inexhaustible generation from different scenarios and initial conditions, therefore using computational resources offline to have faster planning online, when necessary. This approach can be also used in cases when it is important to focus the exploration to certain parts of the state space to reduce uncertainty in value function approximation.
Ii-a Dataset generation
Dataset consists of data points which carry information about the scenario (obstacles, initial and goal state) and current state together with the corresponding cost-to-go (i.e. from currrent state to the goal state). For the generation of dataset , planning algorithms can be used to generate data points with the exact cost-to-go. In the vanilla Shortest Path Planning problem (SP), the goal is to find only one collision-free path (i.e. from initial to the goal state), so planning is stopped when the goal state is reached. As the goal state is reached only from one node (and each node has only one parent), the exact cost-to-go can be computed only for nodes on the optimal path. Contrary to SP, in the dataset generation, the objective is to generate as many different data points as possible. One approach is to use Backward Dynamic Programming (Dijkstra’s algorithm), however this would explore the whole search space which is not practical in higher dimensional problems.
We propose Heuristic Search-based exploration for generation of dataset , as it is shown in Algorithm 1. In this framework, search is done backwards from so that all explored nodes can be used in dataset as they contain exact cost to . Additionally, as the region of higher interest is in the neighborhood of optimal path, the search is not stopped when the initial node is reached by some path (as in the SP problem), but prolonged until Closed lists gets times more nodes. This prolongation assures that more nodes in the neighborhood of the optimal path are explored. Datapoints are constructed such that, for each node in the Open and Closed lists, the corresponding scenario structure (grid) and cost-to-go are stored in dataset . Cost-to-go from node to goal node is actually cost-to-come in backward search. In this way, paths does not have to be reconstructed and all expanded nodes are used in the dataset.
Ii-B Value Function Learning
Learning of Value Function in this approach is a supervised learning problem (regression). The proposed takes as input an image representing node , initial and goal node (, ) and a situation (obstacles ), as can be seen in Figure 1. (grayscale part), and returns as a result a scalar value representing an estimated cost to reach the goal from that node.
As it is preferred that heuristic function underestimates the exact cost (admissibility), a non-symmetric loss function can be used. Asymmetry can be introduced by augmenting Mean Square Error Loss function as:
with parameter to emphasize the penalty for positive errors .
Ii-C Using ML Value Function as Heuristic function
Learned value function is used as a heuristic function in the search. To provide guarantees on sub-optimality, ML heuristic is bounded by admissible heuristic (). In this way, heuristic is -admissible so the solution is always maximum times greater than the optimal solution . Values of closer to guarantee smaller deviation from optimal solution but reduce computational performance. Alternative approach would be to use Multi-Heuristic A* Search (MHA*) .
For experiment, grid world domain with 4-connected neighbors and 33 % of cells are covered with obstacles in average. In total 531 different random scenarios were used for dataset generation. From each scenario multiple data points are generated. For purpose of comparison two datasets were created. One dataset (representing  approach) is using only nodes from optimal path ( , 12.007 datapoints) and other uses all explored nodes in Backward Prolonged Heuristic Search (, 122.449 datapoints) as proposed in this work. Advantage of Backawrd Prolonged Heuristic Search is clear as from the same number of scenarios about 10 times more data points are generated, even in simple 2D problem. This is expected to be even larger in higher dimensional problems.
Iii-a Value Function Learning
For value function learning, in this experiment, a fully Convolutional Neural Network ML model was used. The complete architecture can be seen in TableI. Each layer uses the SELU nonlinear activation . The networks was trained for steps using a batch size of
images and a decaying learning rate. The networks were initialized using the variance-scaling initializer and optimized with the ADAM  optimizer. Asymmetric loss function has parameter .
Iii-B Using ML Value Function as Heuristic function
Three different heuristic functions are used in the experiment and compared based on solution quality (i.e. Path length) and planning efficiency (i.e. number of explored nodes). The first heuristic function is admissible heuristic function based on Manhattan distance. The second heuristic function is value function trained on dataset generated from solution path only. The third function is trained on from proposed prolonged heuristic search.
Extensive evaluation survey is still in progress and will be reported on the event. Initial results were extremely positive, showing fast convergence and very low MSE, less than .
The presented approach offers the possibility to effectively include Machine Learning into deterministic planning framework, promising significant performance improvements manifested in reduced number of nodes explored compared to those obtained using admissible heuristic () while keeping guarantees on sub-optimality of the solution. The approach uses the maximum of invested computational resources in planning as all expanded nodes in planning are used for learning. Future steps would include study of behavior in higher dimensional problems, kinodynamic motion and extension to end-to-end Model Based Reinforcement Learning fremework.
The project leading to this study has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 675999, ITEAM project.
VIRTUAL VEHICLE Research Center is funded within the COMET – Competence Centers for Excellent Technologies – programme by the Austrian Federal Ministry for Transport, Innovation and Technology (BMVIT), the Federal Ministry of Science, Research and Economy (BMWFW), the Austrian Research Promotion Agency (FFG), the province of Styria and the Styrian Business Promotion Agency (SFG). The COMET programme is administrated by FFG.
- Aine et al.  Sandip Aine, Siddharth Swaminathan, Venkatraman Narayanan, Victor Hwang, and Maxim Likhachev. Multi-heuristic a*. The International Journal of Robotics Research, 35(1-3):224–243, 2016. ISSN 0278-3649.
-  Zlatan Ajanovic, Bakir Lacevic, Georg Stettinger, Daniel Watzenig, and Martin Horn. Safe learning-based optimal motion planning for automated driving. URL http://arxiv.org/pdf/1805.09994v2.
- Ajanovic et al. [10/1/2018 - 10/5/2018] Zlatan Ajanovic, Bakir Lacevic, Barys Shyrokau, Michael Stolz, and Martin Horn. Search-based optimal motion planning for automated driving. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4523–4530. IEEE, 10/1/2018 - 10/5/2018. ISBN 978-1-5386-8094-0. doi: 10.1109/IROS.2018.8593813.
- Ajanović et al.  Zlatan Ajanović, Michael Stolz, and Martin Horn. Energy efficient driving in dynamic environment: Considering other traffic participants and overtaking possibility. In Daniel Watzenig and Bernhard Brandstätter, editors, Comprehensive energy management, volume 6 of SpringerBriefs in applied sciences and technology, Automotive engineering : simulation and validation methods, 2191-530X, pages 61–80. Springer, Cham, Switzerland, 2017. ISBN 978-3-319-53164-9. doi: 10.1007/978-3-319-53165-6–_˝4.
- Ajanović et al.  Zlatan Ajanović, Michael Stolz, and Martin Horn. A novel model-based heuristic for energy-optimal motion planning for automated driving. IFAC-PapersOnLine, 51(9):255–260, 2018. ISSN 24058963. doi: 10.1016/j.ifacol.2018.07.042.
- Amos et al.  Brandon Amos, Ivan Jimenez, Jacob Sacks, Byron Boots, and J. Zico Kolter. Differentiable mpc for end-to-end planning and control. In Advances in Neural Information Processing Systems, pages 8289–8300, 2018.
- Barto et al.  Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1-2):81–138, 1995. ISSN 0004-3702. doi: 10.1016/0004-3702(94)00011-O.
- Bellman  Richard Bellman. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8):716–719, 1952. ISSN 0027-8424.
-  Mohak Bhardwaj, Sanjiban Choudhury, and Sebastian Scherer. Learning heuristic search via imitation. URL http://arxiv.org/pdf/1707.03034v1.
- Choudhury et al.  Sanjiban Choudhury, Mohak Bhardwaj, Sankalp Arora, Ashish Kapoor, Gireeja Ranade, Sebastian Scherer, and Debadeepta Dey. Data-driven planning via imitation learning. The International Journal of Robotics Research, 37(13-14):1632–1672, 2018. ISSN 0278-3649. doi: 10.1177/0278364918781001.
- Groshev et al.  Edward Groshev, Aviv Tamar, Maxwell Goldstein, Siddharth Srivastava, and Pieter Abbeel. Learning generalized reactive policies using deep neural networks. In 2018 AAAI Spring Symposium Series, 2018.
-  Arthur Guez, Théophane Weber, Ioannis Antonoglou, Karen Simonyan, Oriol Vinyals, Daan Wierstra, Rémi Munos, and David Silver. Learning to search with mctsnets. URL http://arxiv.org/pdf/1802.04697v2.
- Hart et al.  Peter Hart, Nils Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, 1968. ISSN 0536-1567. doi: 10.1109/TSSC.1968.300136.
He et al. 
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In
Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
- Helmert et al.  Malte Helmert, Gabriele Röger, et al. How good is almost perfect? In AAAI, volume 8, pages 944–949, 2008.
- Ichter et al. [21.05.2018 - 25.05.2018] Brian Ichter, James Harrison, and Marco Pavone. Learning sampling distributions for robot motion planning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7087–7094. IEEE, 21.05.2018 - 25.05.2018. ISBN 978-1-5386-3081-5. doi: 10.1109/ICRA.2018.8460730.
- Kim et al. [5/29/2017 - 6/3/2017] Beomjoon Kim, Leslie Pack Kaelbling, and Tomas Lozano-Perez. Learning to guide task and motion planning using score-space representation. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2810–2817. IEEE, 5/29/2017 - 6/3/2017. ISBN 978-1-5090-4633-1. doi: 10.1109/ICRA.2017.7989327.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. URL http://arxiv.org/pdf/1412.6980v9.
- Klambauer et al.  Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Advances in neural information processing systems, pages 971–980, 2017.
- Korf  Richard E. Korf. Real-time heuristic search. Artificial Intelligence, 42(2-3):189–211, 1990. ISSN 0004-3702. doi: 10.1016/0004-3702(90)90054-4.
- Likhachev et al.  Maxim Likhachev, Geoffrey J. Gordon, and Sebastian Thrun. Ara*: Anytime a* with provable bounds on sub-optimality. In Advances in neural information processing systems, pages 767–774, 2004.
-  Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan online, learn offline: Efficient learning and exploration via model-based control. URL http://arxiv.org/pdf/1811.01848v3.
- Silver et al.  David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, and Adrian Bolton. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017. ISSN 1476-4687. URL https://www.nature.com/articles/nature24270?sf123103138=1.
- Tamar et al.  Aviv Tamar, Y. I. WU, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2154–2162. Curran Associates, Inc, 2016. URL http://papers.nips.cc/paper/6046-value-iteration-networks.pdf.
-  Théophane Weber, Sébastien Racanière, David P. Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adria Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, Razvan Pascanu, Peter Battaglia, Demis Hassabis, David Silver, and Daan Wierstra. Imagination-augmented agents for deep reinforcement learning. URL http://arxiv.org/pdf/1707.06203v2.
- Zhang et al. [10/1/2018 - 10/5/2018] Clark Zhang, Jinwook Huh, and Daniel D. Lee. Learning implicit sampling distributions for motion planning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3654–3661. IEEE, 10/1/2018 - 10/5/2018. ISBN 978-1-5386-8094-0. doi: 10.1109/IROS.2018.8594028.