Path planning is an important technique for robotic applications such as autonomous mobile robots and arm manipulation. The objective of this technique is to find an optimal or a feasible path from the initial state to the goal, subject to constraints derived from a variety of factors including the non-holonomic property of wheeled mobile robots, collision avoidance, or the limits associated with the joints of a robotic manipulator.
These constraints and the existence of local traps mean that a large amount of calculation is required to find a path. Traditional search algorithms, A* (Hart et al., 1968) and Heuristic based Rapidly exploring Random Trees (RRT) (Vemula et al., 2014), rely on heuristics to decrease the amount of calculations for finding a path. Typically, these heuristics are manually crafted for a particular robot and environment. For example, to accelerate the path search in three-dimensional (3D) space (i.e., the 2D position and heading angle of the robot) under the non-holonomic constraints of car-like robots, the Hybrid A* algorithm (Montemerlo et al., 2008) combines the following manually designed heuristics: 1) a non-holonomic shortest path cost calculated by Dubins (Dubins, 1957) or Reeds-and-Shepp (Reeds and Shepp, 1990) algorithms (assuming that no obstacles exist), and 2) a holonomic shortest path cost with obstacles calculated by the backward Dijkstra algorithm. Both require a reasonably smaller amount of computational time compared to the search cost itself. However, it is not trivial for humans to manually craft such heuristics for each specific search problem. Furthermore, although the simple combination of heuristics is effective in simple environments, heuristics do not perform as reliable in more complicated environments.
In recent years, Convolutional Neural Networks (CNNs), a machine-learning technique, have achieved impressive results in a variety of domains such as vision(Noh et al., 2015; Yu and Koltun, 2016), language (Vinyals et al., 2014), audio (Hershey et al., 2016), and games (Silver et al., 2017; Mnih et al., 2015) applications. In this paper, we propose CNN-based heuristic learning methods. As depicted in Figure 1
, our convolutional networks take feature images, which are extracted from an obstacle map, and a goal position as inputs, and predict the heuristic (estimated cost-to-go) values in each position in a 2D map as the outputs. These outputs are used in path planners as heuristics to accelerate the search in planners. CNNs have the following advantages with respect to learning heuristics in path planning problems, especially for robots; 1) CNNs can capture both the global structures of environments (e.g., the road map) and local details (e.g., the obstacle shape), and 2) CNNs can generate spatially structured outputs (e.g., heuristic values in neighboring configurations tend to be smoothly transitioned). We utilize path planning algorithms such as Backward Dijkstra and A* to generate ground truth heuristic values. Because our CNNs predict heuristics in a fully convolutional way, both the inference step and training step are efficiently taken on all states in environments at the same time. We also propose a learning method that combines supervision by a planner with the Temporal Difference learning method (TD) to improve sampling efficiency. Similarly to ours, the approach proposed by Bhardwaj et al.(Bhardwaj et al., 2017) learns a heuristic by imitating the Backward Dijkstra algorithm. However, it uses fully connected neural networks, and is applied for each state feature to obtain the state output. This requires the computation of inference and training to be performed independently on each state. Our model has the ability to learn heuristics only from paths rather than requiring dense cost-to-go values that rely on an algorithm that performs a whole state search, the computational cost of which would be prohibitive for more complicated problems. In addition, our method predicts heuristics for all the states of a 2D map at the same time using fully convolutional NNs, whereas that of (Bhardwaj et al., 2017) predicts a heuristic independently for every state by using fully connected NNs, which is computationally inefficient. Consequently, they had to employ the DAgger algorithm (Ross et al., 2010; Ross and Bagnell, 2014) to efficiently sample training data. Furthermore, their method relies only on a whole state space search algorithm (i.e., Backward Dijkstra) to generate the ground truth, whereas we propose a more efficient method to generate the ground truth. That is, our method only relies on an optimal path search algorithm (i.e., the A* algorithm). This algorithm enables our method to be applied to problems of a larger scale and a wider range of domains.
learned CNNs to produce cost maps from demonstration (i.e., Inverse Reinforcement Learning). The purpose was to learn previously unknown cost functions for planning to imitate the demonstration behavior. Our objective is different in a sense that the heuristic is learned to reduce the computational time required for planning. Path planners were previously utilized for reactive CNN policy learning to solve robot navigation problems. For example,(Kanezaki et al., 2018) learned a reactive CNN policy with global path planner results in the form of supervised signals. In another study, (Gao et al., 2017) utilized global planner paths as inputs to improve a CNN policy based on reinforcement learning. “Value Iteration Networks (VIN)" (Tamar et al., 2017)
embed a differentiable planning module (i.e., value iteration) into CNNs that can learn planners including mapping from observations to cost maps and the state transition probabilities in an end-to-end fashion.(Gupta et al., 2017)
applied VIN to mobile robot visual navigation problems to perform map localization and planning simultaneously by using an end-to-end framework. In a VIN framework, the training objective is fundamentally arbitrary, and their experiments show imitation learning and reinforcement learning only, because their objective was not to use learning heuristics to speed up planners. The computational cost of value iteration becomes prohibitively large when the state space is large. As a result, VIN is limited to search problems with a small state space, e.g., a2D grid world. Our method does not rely on value iteration at inference time, and can be applied to problems with a much larger search space, e.g., a grid world. Although we limit our experiments to a path-finding problem in simple 2D grid worlds, our method could also be applied to larger problems such as 3D path planning with non-holonomic constraints. In summary, the main contributions of our paper are as follows:
learning heuristics using CNNs in a fully convolutional way over states
proposing three learning methods (backward Dijkstra(BD), Sparse, Sparse+TD) which imitate the cost-to-go values generated by either path planning algorithms or the Temporal Difference method
demonstrating significant reduction on search costs against a simple heuristic search method in our 2D grid world planning experiments
This paper is organized as follows: in Section 2, we describe the procedure of our proposed framework as illustrated in Figure 1. First, we describe a search-based path planning algorithm and its heuristic function in Subsection 2.1. Further details of our heuristic learning approach are provided in Subsection 2.2. We introduce three different algorithms depending on the characteristics of the training data. Details of the experiments we performed on these algorithms appear in Section 3, where we discuss the results and describe the dataset and present the implementation of CNNs in our proposed framework. These results are used to demonstrate the effectiveness of the proposed framework. Finally, we summarize the results and discuss our future work in Section 4.
2 Proposed Framework
We consider a search-based path planner in a graph as a baseline. The pseudocode of this planner is provided in Algorithm 1. A graph search begins at a start vertex . At each vertex evaluation, it expands the next search candidates by , which returns successor edges and child vertexes. Each of the search candidate vertexes is validated by the function, which returns if an edge is occupied with an obstacle according to an environment .Each valid candidate is evaluated by a search score function , and all candidate vertexes are pushed into a queue with their scores. At the next iterative cycle, the queue pops a vertex with the highest score. Then, its successor vertexes are evaluated and pushed into the queue again. The procedure is repeated until it reaches a goal or the queue becomes empty.
Based on Algorithm 1, the Dijkstra search is obtained by defining a score function using a cost-so-far value denoted as :
A cost-so-far is calculated by accumulating costs (denoted as ) of edges along a shortest path found so far during a search. By defining the search heuristics function , the A* algorithm can be derived from a score function
and we define a search depending only on heuristics as a greedy search algorithm as follows:
2.2 Learning Heuristics using Convolutional Neural Networks for Planner
Our goal is to find a more efficient heuristic function that minimizes search costs (the number of vertices visited/examined during search). As shown in Figure 1, our method considers an environment that contains a binary obstacle map as input, extracts the feature maps from it, then uses a CNN to predict a heuristic value at every node in a graph, which we call heuristic map. The predicted heuristic map is used as a look-up table for querying a heuristic value during a graph search based on the planner described in the previous section. Note that one can extend our method to take a continuous valued cost map as input, where each pixel represents a cost to visit the corresponding state.
Because the CNN in this method is fully convolutional, it has the ability to simultaneously predict a heuristic value for every node in a graph (single-shot inference). In addition, it can also leverage the matured implementations of GPGPU such as cuDNN. We learn the heuristic map with the aid of the planner, which is employed during the training of CNNs as a target of prediction. We introduce three variants of learning algorithms.
Dense target learning with Backward Dijkstra (BD):
Our CNN can be directly trained by minimizing the squared error between the prediction and the target cost-to-go value at every node. The cost-to-go of a vertex is defined as the cost accumulated along a shortest path to the goal. The Backward Dijkstra algorithm can calculate the cost-to-go values of all valid vertexes in a graph, where searches are propagated from a goal until no vertex to be opened is available in Algorithm 1 . Our training is performed by minimizing the loss function
. Our training is performed by minimizing the loss function
where denotes the cost-to-go value map generated by the Backward Dijkstra, and is a mask to make it possible to ignore invalid vertexes the Backward Dijkstra search cannot visit during target value generation, e.g., areas occupied or surrounded by obstacles (Figure. 2).
Sparse target learning with A* path search (Sparse): The computational time required to generate the cost-to-go target value by the Backward Dijkstra is often prohibitively long for large-scale problems (planning in larger 2D grid maps, or high-dimensional problems, etc.), and can be a bottleneck for learning heuristics. We also propose a learning method that relies only on the target cost-to-go values in those vertexes belonging to the shortest path found by the A* algorithm, given randomly sampled starting and goal positions. A* is much faster than the Backward Dijkstra, which improves the data collection efficiency in terms of the variation of environments. Similar to dense target learning, eq.(4) is used as a loss function, although the training mask is 1 only at vertexes along a path.
Sparse target learning with TD error minimization (Sparse+TD): Learning with sparse target signals may result in under-fitting when training due to no supervision signal in pixels that are not visited by the A* path. We propose a method to utilize temporal difference (TD) learning method in order to compensate the lack of supervision.
where is initialized as the current prediction , , and . The value can be updated by using another iterative step and this can be implemented as a convolution with fixed kernels and biases followed by a minimum operation along an axis representing successor vertexes (Tamar et al., 2017). Note that we only update the value iteratively during training to obtain more dense target values of the cost-to-go. The use of an updated cost-to-go estimate enables the loss function to be written as
where is 1 at , 0 otherwise, and balances the weight of the TD minimization loss. We can also update the value iteratively with multiple steps to obtain the target cost-to-go estimate. In our experiment, we used steps to iteratively update the value and set .
3 Experimental Setup and Results
We trained and evaluated our algorithms on a 2D grid world path planning problem with a dataset provided by (Bhardwaj et al., 2017). We used seven different types of environments in the dataset, Shifting gaps, Bugtrap and Forest, Forest, Gap and Forest, Single Bugtrap, Mazes, and Multiple Bugtraps for our experiment. Each environment type contains different kinds of local traps. For example, the environment type Shifting gaps has an obstacle traversing the central section of the 2D map, obstructing the left and right sides of the map. The traversing obstacle is opened at a vertical position (the position is to be sampled randomly during dataset generation). Simple heuristics such as Euclidean heuristics may undesirably guide local traps by greedily moving towards the goal without considering the opened position.
Each environment type consists of 800 training 2D grid maps as binary images (either occupied by an obstacle or not), and 100 testing maps. Each map has the dimensionality of , where each grid indicates the existence of an obstacle. We consider each pixel in the map as a vertex and find a path from the start vertex to the goal vertex in an 8-connected grid as a planning problem. The cost is defined as the distance of a path. Edges connected to a vertex at which an obstacle exists are considered as invalid. Although we randomly sampled the start and goal positions for path planning to generate supervision during training, we fixed the start and goal positions for path planning as and , respectively, during evaluation in order to be compatible with evaluation in Bhardwaj et al. (2017).
3.2 Implementation details
In the neural network architecture we used in our tasks, we employed suitable techniques such as a dilated convolution and encoder-decoder structure to extract global and local spatial contexts from 2D input maps, and to output spatially consistent output images. The encoder CNN repeatedly applied the convolution module three times to produce feature maps with smaller spatial dimensions, and larger maps to take a wider spatial context into account. The convolution module consists of three (Yu and Koltun, 2016) were incremented from 1 to 3. The number of convolution channels of the three modules was 16, 32, and 64, respectively. The decoder CNN repeated the deconvolution module three times. The deconvolution module is similar to the convolution module except that the first convolution is replaced with a deconvolution step with a kernel with an upscaling factor of 2. The number of convolution channels of the three modules is 32, 16, and 16, respectively, except the last convolution of the third deconvolution module produces single channel output in the form of a heuristics map. The input to the CNN consists of feature maps we extract from the 2D obstacle map. The feature maps consist of 1) obstacle map itself, 2) the distances from obstacles, and 3) the distance from the goal, each as an image, which are composed by stacking them as channels of images. The distances from the obstacle and goal are often used to construct Artificial Potential Fields Qureshi and Ayaz (2017), where the goal distance is used as an attractive potential function, and the obstacle distance is used as a repulsive potential function. We pass these functions to train a simple heuristic more easily.
During training, we randomly sample 32 maps from the dataset to construct a mini-batch of a stochastic gradient descent step. For each map, we use A* to randomly sample start and goal positions until a valid path is found between them, after which we generate the cost-to-go targets as described in Section
During training, we randomly sample 32 maps from the dataset to construct a mini-batch of a stochastic gradient descent step. For each map, we use A* to randomly sample start and goal positions until a valid path is found between them, after which we generate the cost-to-go targets as described in Section2.2 using either the Backward Dijkstra or A*. A random image translation is applied to inputs as data augmentation, which produces feature maps as inputs. We used Adam (Kingma and Ba, 2014) as a stochastic gradient descent algorithm with , , and . During testing, we used a greedy search algorithm as a planner as in eq.(3), which accepts the heuristics map produced by our trained CNNs.
For each type of environment in the dataset, training is performed for epochs. Each training period is approximately 10 hours on a single GTX 1080 Ti for CNNs and on a Core-i7 K7700 for on-the-fly planning ground-truth generation.
Figure 3 shows the training curves of the mean absolute errors between the predicted heuristic values and ground truth cost-to-go values obtained from the evaluation set. BD consistently produced a smaller error than Sparse and Sparse+TD because it can utilize the ground truths on all states in an environment as its training targets. Sparse and Sparse+TD produce fairly similar results, although they can only access the ground truths at states along an optimal path. Table 1 contains the results of our evaluation of the trained models utilized as heuristic value estimators in a greedy path planner for both metrics: search cost and path quality. The search cost is defined as the number of expanded vertexes during the search, where a smaller number corresponds to a shorter search time. The path quality is calculated by accumulating the distance moved along a generated path.
BD consistently outperforms the others in both metrics with the learning curve evaluation, whereas no significant difference is observed between Sparse and Sparse+TD. We used Sparse in the subsequent experiments because it maintains a good balance between efficiency in training data generation, algorithm simplicity, and performance.
|Search cost||Path quality|
|Bugtrap and Forest||450||544||594||359||373||380|
|Gaps and Forest||300||357||445||337||348||350|
Table 2 compares our trained heuristic with a simple Euclid distance heuristic and a heuristic trained with SaIL (Bhardwaj et al., 2017).
The results quantitatively show that our method significantly outperforms the other methods in terms of the search cost in all environments.
Although the greedy path planner does not aim to find the optimal path, our methods produce paths that closely approximate the optimal path (Optimal) in terms of the path quality . Euclid also produces path quality close to the ground truth optimal path, because it leads the search aggressively towards the goal, which enables it to find the minimum path length in this simple holonomic 2D path-finding problem. However, its search cost is far larger than ours. Compared to those of (Bhardwaj et al., 2017), our results suggest that our convolutional model predicts more effective heuristics from simply generated feature maps without having the carefully designed features that are fed into simple fully connected networks as in (Bhardwaj et al., 2017) . Our feature extraction mostly relies on fully convolutional architectures only.
. Our feature extraction mostly relies on fully convolutional architectures only.
We also compared the computational time on our local machine (GTX 1080 Ti and Intel Core-i7 K7700) against the Euclid heuristic baseline. This was not compared with Bhardwaj’s results (Bhardwaj et al., 2017) because their implementation is too slow because of its pure Python implementation. Our planners are written in highly optimized C++, and the CNNs for heuristic prediction utilize GPUs for computation. The average planning time is as follows: 1) A* with Euclidean heuristics: ms, 2) Greedy search with Euclidean heuristics: ms, 3) Greedy search with learned heuristics (ours): ms (CNNs: ms Planning: ms). The learned heuristics reduces the planning time considerably relative to the Euclidean baseline. Even after adding the computational cost of the CNNs, our method is significantly faster than the baselines.
|Search cost||Path quality|
|Planner||Greedy||A*||SaIL Bhardwaj et al. (2017)||Greedy||SaIL Bhardwaj et al. (2017)|
|Bugtrap and Forest||273||20367||544||35056||751||325||352||373||395|
|Gaps and Forest||259||12386||357||19981||8913||316||322||348||945|
As shown in Figure 4, the Euclidean heuristic often leads the search to local traps owing to its ignorance of obstacle structures in the environment, which causes undesired search effort. One may notice that our model often produces jaggy paths. This effect is attributed to the fact that the heuristic prediction by our CNN model appears globally consistent, but locally noisy. However, our main interest is to reduce the computational cost during path search rather than achieving optimality. Furthermore, we could locally optimize or smooth the obtained feasible paths as a post-processing step.
In this paper, we proposed a novel CNN-based heuristic learning framework for rapid planners. Our experiments on path finding problems in 2D grid worlds showed that the proposed learning approaches significantly decrease the search effort compared to a handcrafted heuristic search. Our convolutional method demonstrated a promising direction to learn the heuristic function to minimize the search cost in complicated environments. One can extend our work to more complicated and high-dimensional search problems, such as non-holonomic path planning problems of mobile robots (e.g., implementing a learning heuristic as CNNs in the Hybrid A* algorithm), and robot arm motion planning (e.g., learning sampling heuristics or the distance metric in RRT).
- Hart et al. (1968) P. E. Hart, N. J. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, July 1968.
- Vemula et al. (2014) Anirudh Vemula, Sanjiban Choudhury, and Sebastian Scherer. Learning motion planning assumptions. Carnegie Mellon University Techinial Report, August 2014.
- Montemerlo et al. (2008) Michael Montemerlo, Jan Becker, Suhrid Bhat, Hendrik Dahlkamp, Dmitri Dolgov, Scott Ettinger, Dirk Haehnel, Tim Hilden, Gabe Hoffmann, Burkhard Huhnke, Doug Johnston, Stefan Klumpp, Dirk Langer, Anthony Levandowski, Jesse Levinson, Julien Marcil, David Orenstein, Johannes Paefgen, Isaac Penny, Anna Petrovskaya, Mike Pflueger, Ganymed Stanek, David Stavens, Antone Vogt, and Sebastian Thrun. Junior: The stanford entry in the urban challenge. J. Field Robot., 25(9):569–597, 2008.
- Dubins (1957) Lester E Dubins. On curves of minimal length with a constraint on average curvature, and with prescribed initial and terminal positions and tangents. American Journal of Mathematics, 79:497–516, 1957.
- Reeds and Shepp (1990) J. A. Reeds and L. A. Shepp. Optimal paths for a car that goes both forwards and backwards. Pacific Journal of Mathematics, 145(2):367–393, 1990.
Noh et al. (2015)
Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han.
Learning deconvolution network for semantic segmentation.
Proc. IEEE Int. Conf on Computer Vision (ICCV), pages 1520–1528, 2015.
- Yu and Koltun (2016) Fisher Yu and Vladlen Koltun. Multi-Scale Context Aggregation by Dilated Convolutions. In Proc. Int. Conf on Learning Representations (ICLR), 2016.
- Vinyals et al. (2014) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. CoRR, abs/1411.4555, 2014.
- Hershey et al. (2016) Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin W. Wilson. CNN architectures for large-scale audio classification. CoRR, abs/1609.09430, 2016.
- Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
- Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Bhardwaj et al. (2017) Mohak Bhardwaj, Sanjiban Choudhury, and Sebastian Scherer. Learning heuristic search via imitation. In Proc. 1st Annual Conf. on Robot Learning (CoRL), pages 271–280, 2017.
- Ross et al. (2010) Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. No-regret reductions for imitation learning and structured prediction. CoRR, abs/1011.0686, 2010.
- Ross and Bagnell (2014) Stéphane Ross and J. Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning. CoRR, abs/1406.5979, 2014.
- Wulfmeier et al. (2016) Markus Wulfmeier, Dominic Zeng Wang, and Ingmar Posner. Watch this: Scalable cost-function learning for path planning in urban environments. In Proc. IEEE Int. Conf. on Intelligent Robots and Systems (IROS), pages 2089–2095, 2016.
- Wulfmeier et al. (2015) Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum entropy deep inverse reinforcement learning. In Neural Information Processing Systems Conference, Deep Reinforcement Learning Workshop, 2015.
- Kanezaki et al. (2018) Asako Kanezaki, Jirou Nitta, and Yoko Sasaki. GOSELO: goal-directed obstacle and self-location map for robot navigation using reactive neural networks. IEEE Robotics and Automation Letters, 3(2):696–703, 2018.
Gao et al. (2017)
Wei Gao, David F. C. Hsu, Wee Sun Lee, Shengmei Shen, and Karthikk Subramanian.
Intention-net: Integrating planning and deep learning for goal-directed autonomous navigation.In Proc. 1st Annual Conf. on Robot Learning (CoRL), 2017.
Tamar et al. (2017)
Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel.
Value iteration networks.
Proc. IJCAI International Joint Conference on Artificial Intelligence, 2017.
Gupta et al. (2017)
Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra
Cognitive mapping and planning for visual navigation.
Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 7272–7281, 2017.
- Qureshi and Ayaz (2017) Ahmed Hussain Qureshi and Yasar Ayaz. Potential functions based sampling heuristic for optimal path planning. CoRR, abs/1704.00264, 2017.
- Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.