I Introduction
Having fast planning algorithms is crucial for practical use of robots in changing environments and safety critical tasks. Efficiency of Heuristic Searchbased planning (A* Search [13]
) largely depends on the quality of the heuristic function for estimation of the costtogo
[15]. Ideally, if we knew the exact costtogo (oracle), we could find the optimal solution with minimum effort (practically traversing greedy). If the robot operates in similar environments, the learning of the costtogo function from previous search experience is a promising approach as Planning and Learning have complementary strengths. It is an issue for learning approaches to provide any guarantees on performance, have safe exploration or to learn longterm rewards. In these aspects, planning algorithms can provide valuable support. On the other hand, planning algorithms are rather slow in high dimensional spaces which can be improved if planning is properly guided. Effective synergy of Planning and Learning provided exceptional results so far including seminal achievement of superhuman performance in the game of Go [23].Interaction of Planning and Learning has a long history [8][20][7], with several modern directions including Endtoend learning approximations of planning algorithms (i.e inspired by Value Iteration algorithm [24], MCTS [12], MPC [6]
), planning to guide exploration in Reinforcement Learning
[25], [22] and learning to guide planning [17][9] [26][16]as well as model learning and Model Based Reinforcement Learning in general.This work is in direction of learning of the value function in order to guide heuristic searchbased planning. It is a significantly improved extension and generalization of the work [2] which was focused on automated driving application and considered searchbased optimal motion planning framework (SBOMP) [3] that utilized different admissible heuristics (numeric [4] and modelbased heuristics [5]). Admissible heuristics did not consider dynamic obstacles, so to improve performance, authors proposed learning of heuristics which consider dynamic obstacles as well . The most similar approaches to one presented in this work appeared in [11] and in [10]. However, in [11] authors used only nodes from the shortest path for learning and in [10] authors used backward Dijkstra’s algorithm which explores the whole search space.
The main contribution of this work is a novel approach for efficient and systematic exploration of the models based on backward and prolonged heuristic search. This ensures that only interesting nodes are explored and all explored nodes are used for value function learning.
Premise: For learning of the Value function it is more beneficial to explore states in the neighborhood of the optimal path (policy) than elsewhere, as the agent will spend most of the time in the neighborhood of optimal path.
Having explored neighboring region around optimal path helps to get back to optimal path if the planner deviates and has better coverage for A* algorithm, which always looks for neighboring states. Inspired by this idea, we prolonged the search even after optimal path was found to explore wider region around the shortest path. Additionally, the direction of the search is flipped, such that the search starts from the goal node. In this way, all expanded nodes lead to the goal state and therefore can be used in the dataset for learning.
Ii Method
The presented approach uses existing admissible heuristic function () and a known model of the system to generate dataset
of exact statecost data points. The dataset is used for supervised learning of the value function. The value function is then used as heuristic function
in the search, bounded by admissible heuristic to provide guarantees on suboptimality.In principle, this approach differs from reinforcement learning since it is supervised, and from imitation learning since the exact optimal solution is used instead of expert demonstrations. Nevertheless, this approach to exploration can be also used within Model Based Reinforcement Learning framework. This approach enables theoretically inexhaustible generation from different scenarios and initial conditions, therefore using computational resources offline to have faster planning online, when necessary. This approach can be also used in cases when it is important to focus the exploration to certain parts of the state space to reduce uncertainty in value function approximation.
Iia Dataset generation
Dataset consists of data points which carry information about the scenario (obstacles, initial and goal state) and current state together with the corresponding costtogo (i.e. from currrent state to the goal state). For the generation of dataset , planning algorithms can be used to generate data points with the exact costtogo. In the vanilla Shortest Path Planning problem (SP), the goal is to find only one collisionfree path (i.e. from initial to the goal state), so planning is stopped when the goal state is reached. As the goal state is reached only from one node (and each node has only one parent), the exact costtogo can be computed only for nodes on the optimal path. Contrary to SP, in the dataset generation, the objective is to generate as many different data points as possible. One approach is to use Backward Dynamic Programming (Dijkstra’s algorithm), however this would explore the whole search space which is not practical in higher dimensional problems.
We propose Heuristic Searchbased exploration for generation of dataset , as it is shown in Algorithm 1. In this framework, search is done backwards from so that all explored nodes can be used in dataset as they contain exact cost to . Additionally, as the region of higher interest is in the neighborhood of optimal path, the search is not stopped when the initial node is reached by some path (as in the SP problem), but prolonged until Closed lists gets times more nodes. This prolongation assures that more nodes in the neighborhood of the optimal path are explored. Datapoints are constructed such that, for each node in the Open and Closed lists, the corresponding scenario structure (grid) and costtogo are stored in dataset . Costtogo from node to goal node is actually costtocome in backward search. In this way, paths does not have to be reconstructed and all expanded nodes are used in the dataset.
IiB Value Function Learning
Learning of Value Function in this approach is a supervised learning problem (regression). The proposed takes as input an image representing node , initial and goal node (, ) and a situation (obstacles ), as can be seen in Figure 1. (grayscale part), and returns as a result a scalar value representing an estimated cost to reach the goal from that node.
As it is preferred that heuristic function underestimates the exact cost (admissibility), a nonsymmetric loss function can be used. Asymmetry can be introduced by augmenting Mean Square Error Loss function as:
(1)  
(2) 
with parameter to emphasize the penalty for positive errors .
IiC Using ML Value Function as Heuristic function
Learned value function is used as a heuristic function in the search. To provide guarantees on suboptimality, ML heuristic is bounded by admissible heuristic (). In this way, heuristic is admissible so the solution is always maximum times greater than the optimal solution [21]. Values of closer to guarantee smaller deviation from optimal solution but reduce computational performance. Alternative approach would be to use MultiHeuristic A* Search (MHA*) [1].
Iii Experiment
For experiment, grid world domain with 4connected neighbors and 33 % of cells are covered with obstacles in average. In total 531 different random scenarios were used for dataset generation. From each scenario multiple data points are generated. For purpose of comparison two datasets were created. One dataset (representing [11] approach) is using only nodes from optimal path ( , 12.007 datapoints) and other uses all explored nodes in Backward Prolonged Heuristic Search (, 122.449 datapoints) as proposed in this work. Advantage of Backawrd Prolonged Heuristic Search is clear as from the same number of scenarios about 10 times more data points are generated, even in simple 2D problem. This is expected to be even larger in higher dimensional problems.
Iiia Value Function Learning
For value function learning, in this experiment, a fully Convolutional Neural Network ML model was used. The complete architecture can be seen in Table
I. Each layer uses the SELU nonlinear activation [19]. The networks was trained for steps using a batch size ofimages and a decaying learning rate. The networks were initialized using the variancescaling initializer
[14] and optimized with the ADAM [18] optimizer. Asymmetric loss function has parameter .layer  kernel  stride  output size 

conv1  
conv2  
conv3  
conv4  
conv5  
conv6 
IiiB Using ML Value Function as Heuristic function
Three different heuristic functions are used in the experiment and compared based on solution quality (i.e. Path length) and planning efficiency (i.e. number of explored nodes). The first heuristic function is admissible heuristic function based on Manhattan distance. The second heuristic function is value function trained on dataset generated from solution path only. The third function is trained on from proposed prolonged heuristic search.
Iv Results
Extensive evaluation survey is still in progress and will be reported on the event. Initial results were extremely positive, showing fast convergence and very low MSE, less than .
V Conclusion
The presented approach offers the possibility to effectively include Machine Learning into deterministic planning framework, promising significant performance improvements manifested in reduced number of nodes explored compared to those obtained using admissible heuristic (
) while keeping guarantees on suboptimality of the solution. The approach uses the maximum of invested computational resources in planning as all expanded nodes in planning are used for learning. Future steps would include study of behavior in higher dimensional problems, kinodynamic motion and extension to endtoend Model Based Reinforcement Learning fremework.Acknowledgments
The project leading to this study has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie SkłodowskaCurie grant agreement No 675999, ITEAM project.
VIRTUAL VEHICLE Research Center is funded within the COMET – Competence Centers for Excellent Technologies – programme by the Austrian Federal Ministry for Transport, Innovation and Technology (BMVIT), the Federal Ministry of Science, Research and Economy (BMWFW), the Austrian Research Promotion Agency (FFG), the province of Styria and the Styrian Business Promotion Agency (SFG). The COMET programme is administrated by FFG.
References
 Aine et al. [2016] Sandip Aine, Siddharth Swaminathan, Venkatraman Narayanan, Victor Hwang, and Maxim Likhachev. Multiheuristic a*. The International Journal of Robotics Research, 35(13):224–243, 2016. ISSN 02783649.
 [2] Zlatan Ajanovic, Bakir Lacevic, Georg Stettinger, Daniel Watzenig, and Martin Horn. Safe learningbased optimal motion planning for automated driving. URL http://arxiv.org/pdf/1805.09994v2.
 Ajanovic et al. [10/1/2018  10/5/2018] Zlatan Ajanovic, Bakir Lacevic, Barys Shyrokau, Michael Stolz, and Martin Horn. Searchbased optimal motion planning for automated driving. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4523–4530. IEEE, 10/1/2018  10/5/2018. ISBN 9781538680940. doi: 10.1109/IROS.2018.8593813.
 Ajanović et al. [2017] Zlatan Ajanović, Michael Stolz, and Martin Horn. Energy efficient driving in dynamic environment: Considering other traffic participants and overtaking possibility. In Daniel Watzenig and Bernhard Brandstätter, editors, Comprehensive energy management, volume 6 of SpringerBriefs in applied sciences and technology, Automotive engineering : simulation and validation methods, 2191530X, pages 61–80. Springer, Cham, Switzerland, 2017. ISBN 9783319531649. doi: 10.1007/9783319531656–_˝4.
 Ajanović et al. [2018] Zlatan Ajanović, Michael Stolz, and Martin Horn. A novel modelbased heuristic for energyoptimal motion planning for automated driving. IFACPapersOnLine, 51(9):255–260, 2018. ISSN 24058963. doi: 10.1016/j.ifacol.2018.07.042.
 Amos et al. [2018] Brandon Amos, Ivan Jimenez, Jacob Sacks, Byron Boots, and J. Zico Kolter. Differentiable mpc for endtoend planning and control. In Advances in Neural Information Processing Systems, pages 8289–8300, 2018.
 Barto et al. [1995] Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to act using realtime dynamic programming. Artificial Intelligence, 72(12):81–138, 1995. ISSN 00043702. doi: 10.1016/00043702(94)00011O.
 Bellman [1952] Richard Bellman. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8):716–719, 1952. ISSN 00278424.
 [9] Mohak Bhardwaj, Sanjiban Choudhury, and Sebastian Scherer. Learning heuristic search via imitation. URL http://arxiv.org/pdf/1707.03034v1.
 Choudhury et al. [2018] Sanjiban Choudhury, Mohak Bhardwaj, Sankalp Arora, Ashish Kapoor, Gireeja Ranade, Sebastian Scherer, and Debadeepta Dey. Datadriven planning via imitation learning. The International Journal of Robotics Research, 37(1314):1632–1672, 2018. ISSN 02783649. doi: 10.1177/0278364918781001.
 Groshev et al. [2018] Edward Groshev, Aviv Tamar, Maxwell Goldstein, Siddharth Srivastava, and Pieter Abbeel. Learning generalized reactive policies using deep neural networks. In 2018 AAAI Spring Symposium Series, 2018.
 [12] Arthur Guez, Théophane Weber, Ioannis Antonoglou, Karen Simonyan, Oriol Vinyals, Daan Wierstra, Rémi Munos, and David Silver. Learning to search with mctsnets. URL http://arxiv.org/pdf/1802.04697v2.
 Hart et al. [1968] Peter Hart, Nils Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, 1968. ISSN 05361567. doi: 10.1109/TSSC.1968.300136.

He et al. [2015]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification.
InProceedings of the IEEE international conference on computer vision
, pages 1026–1034, 2015.  Helmert et al. [2008] Malte Helmert, Gabriele Röger, et al. How good is almost perfect? In AAAI, volume 8, pages 944–949, 2008.
 Ichter et al. [21.05.2018  25.05.2018] Brian Ichter, James Harrison, and Marco Pavone. Learning sampling distributions for robot motion planning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7087–7094. IEEE, 21.05.2018  25.05.2018. ISBN 9781538630815. doi: 10.1109/ICRA.2018.8460730.
 Kim et al. [5/29/2017  6/3/2017] Beomjoon Kim, Leslie Pack Kaelbling, and Tomas LozanoPerez. Learning to guide task and motion planning using scorespace representation. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2810–2817. IEEE, 5/29/2017  6/3/2017. ISBN 9781509046331. doi: 10.1109/ICRA.2017.7989327.
 [18] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. URL http://arxiv.org/pdf/1412.6980v9.
 Klambauer et al. [2017] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Selfnormalizing neural networks. In Advances in neural information processing systems, pages 971–980, 2017.
 Korf [1990] Richard E. Korf. Realtime heuristic search. Artificial Intelligence, 42(23):189–211, 1990. ISSN 00043702. doi: 10.1016/00043702(90)900544.
 Likhachev et al. [2004] Maxim Likhachev, Geoffrey J. Gordon, and Sebastian Thrun. Ara*: Anytime a* with provable bounds on suboptimality. In Advances in neural information processing systems, pages 767–774, 2004.
 [22] Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan online, learn offline: Efficient learning and exploration via modelbased control. URL http://arxiv.org/pdf/1811.01848v3.
 Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, and Adrian Bolton. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017. ISSN 14764687. URL https://www.nature.com/articles/nature24270?sf123103138=1.
 Tamar et al. [2016] Aviv Tamar, Y. I. WU, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2154–2162. Curran Associates, Inc, 2016. URL http://papers.nips.cc/paper/6046valueiterationnetworks.pdf.
 [25] Théophane Weber, Sébastien Racanière, David P. Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adria Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, Razvan Pascanu, Peter Battaglia, Demis Hassabis, David Silver, and Daan Wierstra. Imaginationaugmented agents for deep reinforcement learning. URL http://arxiv.org/pdf/1707.06203v2.
 Zhang et al. [10/1/2018  10/5/2018] Clark Zhang, Jinwook Huh, and Daniel D. Lee. Learning implicit sampling distributions for motion planning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3654–3661. IEEE, 10/1/2018  10/5/2018. ISBN 9781538680940. doi: 10.1109/IROS.2018.8594028.
Comments
There are no comments yet.