A novel approach to model exploration for value function learning

06/06/2019 ∙ by Zlatan Ajanovic, et al. ∙ 0

Planning and Learning are complementary approaches. Planning relies on deliberative reasoning about the current state and sequence of future reachable states to solve the problem. Learning, on the other hand, is focused on improving system performance based on experience or available data. Learning to improve the performance of planning based on experience in similar, previously solved problems, is ongoing research. One approach is to learn Value function (cost-to-go) which can be used as heuristics for speeding up search-based planning. Existing approaches in this direction use the results of the previous search for learning the heuristics. In this work, we present a search-inspired approach of systematic model exploration for the learning of the value function which does not stop when a plan is available but rather prolongs search such that not only resulting optimal path is used but also extended region around the optimal path. This, in turn, improves both the efficiency and robustness of successive planning. Additionally, the effect of losing admissibility by using ML heuristic is managed by bounding ML with other admissible heuristics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Having fast planning algorithms is crucial for practical use of robots in changing environments and safety critical tasks. Efficiency of Heuristic Search-based planning (A* Search [13]

) largely depends on the quality of the heuristic function for estimation of the cost-to-go

[15]. Ideally, if we knew the exact cost-to-go (oracle), we could find the optimal solution with minimum effort (practically traversing greedy). If the robot operates in similar environments, the learning of the cost-to-go function from previous search experience is a promising approach as Planning and Learning have complementary strengths. It is an issue for learning approaches to provide any guarantees on performance, have safe exploration or to learn long-term rewards. In these aspects, planning algorithms can provide valuable support. On the other hand, planning algorithms are rather slow in high dimensional spaces which can be improved if planning is properly guided. Effective synergy of Planning and Learning provided exceptional results so far including seminal achievement of super-human performance in the game of Go [23].

Interaction of Planning and Learning has a long history [8][20][7], with several modern directions including End-to-end learning approximations of planning algorithms (i.e inspired by Value Iteration algorithm [24], MCTS [12], MPC [6]

), planning to guide exploration in Reinforcement Learning

[25], [22] and learning to guide planning [17][9] [26][16]as well as model learning and Model Based Reinforcement Learning in general.

This work is in direction of learning of the value function in order to guide heuristic search-based planning. It is a significantly improved extension and generalization of the work [2] which was focused on automated driving application and considered search-based optimal motion planning framework (SBOMP) [3] that utilized different admissible heuristics (numeric [4] and model-based heuristics [5]). Admissible heuristics did not consider dynamic obstacles, so to improve performance, authors proposed learning of heuristics which consider dynamic obstacles as well . The most similar approaches to one presented in this work appeared in [11] and in [10]. However, in [11] authors used only nodes from the shortest path for learning and in [10] authors used backward Dijkstra’s algorithm which explores the whole search space.

Fig. 1: Nodes explored in Vanilla Shortest Path Problem (left) and Prolonged Heuristic Search (right).

The main contribution of this work is a novel approach for efficient and systematic exploration of the models based on backward and prolonged heuristic search. This ensures that only interesting nodes are explored and all explored nodes are used for value function learning.

Premise: For learning of the Value function it is more beneficial to explore states in the neighborhood of the optimal path (policy) than elsewhere, as the agent will spend most of the time in the neighborhood of optimal path.

Having explored neighboring region around optimal path helps to get back to optimal path if the planner deviates and has better coverage for A* algorithm, which always looks for neighboring states. Inspired by this idea, we prolonged the search even after optimal path was found to explore wider region around the shortest path. Additionally, the direction of the search is flipped, such that the search starts from the goal node. In this way, all expanded nodes lead to the goal state and therefore can be used in the dataset for learning.

Ii Method

The presented approach uses existing admissible heuristic function () and a known model of the system to generate dataset

of exact state-cost data points. The dataset is used for supervised learning of the value function. The value function is then used as heuristic function

in the search, bounded by admissible heuristic to provide guarantees on sub-optimality.

In principle, this approach differs from reinforcement learning since it is supervised, and from imitation learning since the exact optimal solution is used instead of expert demonstrations. Nevertheless, this approach to exploration can be also used within Model Based Reinforcement Learning framework. This approach enables theoretically inexhaustible generation from different scenarios and initial conditions, therefore using computational resources offline to have faster planning online, when necessary. This approach can be also used in cases when it is important to focus the exploration to certain parts of the state space to reduce uncertainty in value function approximation.

Ii-a Dataset generation

Dataset consists of data points which carry information about the scenario (obstacles, initial and goal state) and current state together with the corresponding cost-to-go (i.e. from currrent state to the goal state). For the generation of dataset , planning algorithms can be used to generate data points with the exact cost-to-go. In the vanilla Shortest Path Planning problem (SP), the goal is to find only one collision-free path (i.e. from initial to the goal state), so planning is stopped when the goal state is reached. As the goal state is reached only from one node (and each node has only one parent), the exact cost-to-go can be computed only for nodes on the optimal path. Contrary to SP, in the dataset generation, the objective is to generate as many different data points as possible. One approach is to use Backward Dynamic Programming (Dijkstra’s algorithm), however this would explore the whole search space which is not practical in higher dimensional problems.

We propose Heuristic Search-based exploration for generation of dataset , as it is shown in Algorithm 1. In this framework, search is done backwards from so that all explored nodes can be used in dataset as they contain exact cost to . Additionally, as the region of higher interest is in the neighborhood of optimal path, the search is not stopped when the initial node is reached by some path (as in the SP problem), but prolonged until Closed lists gets times more nodes. This prolongation assures that more nodes in the neighborhood of the optimal path are explored. Datapoints are constructed such that, for each node in the Open and Closed lists, the corresponding scenario structure (grid) and cost-to-go are stored in dataset . Cost-to-go from node to goal node is actually cost-to-come in backward search. In this way, paths does not have to be reconstructed and all expanded nodes are used in the dataset.

input : 
output :   // Dataset
1
2 begin
         // Dataset
3       foreach  do
               // New scenario
              // Search-based Exploration
              // Extracting data from the search
4             foreach   do
                     // Data points
5                  
6            
7      return
8
Algorithm 1 Prolonged Heuristic Search for dataset generation

Ii-B Value Function Learning

Learning of Value Function in this approach is a supervised learning problem (regression). The proposed takes as input an image representing node , initial and goal node (, ) and a situation (obstacles ), as can be seen in Figure 1. (grayscale part), and returns as a result a scalar value representing an estimated cost to reach the goal from that node.

As it is preferred that heuristic function underestimates the exact cost (admissibility), a non-symmetric loss function can be used. Asymmetry can be introduced by augmenting Mean Square Error Loss function as:

(1)
(2)

with parameter to emphasize the penalty for positive errors .

Ii-C Using ML Value Function as Heuristic function

Learned value function is used as a heuristic function in the search. To provide guarantees on sub-optimality, ML heuristic is bounded by admissible heuristic (). In this way, heuristic is -admissible so the solution is always maximum times greater than the optimal solution [21]. Values of closer to guarantee smaller deviation from optimal solution but reduce computational performance. Alternative approach would be to use Multi-Heuristic A* Search (MHA*) [1].

Iii Experiment

For experiment, grid world domain with 4-connected neighbors and 33 % of cells are covered with obstacles in average. In total 531 different random scenarios were used for dataset generation. From each scenario multiple data points are generated. For purpose of comparison two datasets were created. One dataset (representing [11] approach) is using only nodes from optimal path ( , 12.007 datapoints) and other uses all explored nodes in Backward Prolonged Heuristic Search (, 122.449 datapoints) as proposed in this work. Advantage of Backawrd Prolonged Heuristic Search is clear as from the same number of scenarios about 10 times more data points are generated, even in simple 2D problem. This is expected to be even larger in higher dimensional problems.

Iii-a Value Function Learning

For value function learning, in this experiment, a fully Convolutional Neural Network ML model was used. The complete architecture can be seen in Table

I. Each layer uses the SELU nonlinear activation [19]. The networks was trained for steps using a batch size of

images and a decaying learning rate. The networks were initialized using the variance-scaling initializer

[14] and optimized with the ADAM [18] optimizer. Asymmetric loss function has parameter .

layer kernel stride output size
conv1
conv2
conv3
conv4
conv5
conv6
TABLE I: Model Architecture

Iii-B Using ML Value Function as Heuristic function

Three different heuristic functions are used in the experiment and compared based on solution quality (i.e. Path length) and planning efficiency (i.e. number of explored nodes). The first heuristic function is admissible heuristic function based on Manhattan distance. The second heuristic function is value function trained on dataset generated from solution path only. The third function is trained on from proposed prolonged heuristic search.

Iv Results

Extensive evaluation survey is still in progress and will be reported on the event. Initial results were extremely positive, showing fast convergence and very low MSE, less than .

V Conclusion

The presented approach offers the possibility to effectively include Machine Learning into deterministic planning framework, promising significant performance improvements manifested in reduced number of nodes explored compared to those obtained using admissible heuristic (

) while keeping guarantees on sub-optimality of the solution. The approach uses the maximum of invested computational resources in planning as all expanded nodes in planning are used for learning. Future steps would include study of behavior in higher dimensional problems, kinodynamic motion and extension to end-to-end Model Based Reinforcement Learning fremework.

Acknowledgments

The project leading to this study has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 675999, ITEAM project.

VIRTUAL VEHICLE Research Center is funded within the COMET – Competence Centers for Excellent Technologies – programme by the Austrian Federal Ministry for Transport, Innovation and Technology (BMVIT), the Federal Ministry of Science, Research and Economy (BMWFW), the Austrian Research Promotion Agency (FFG), the province of Styria and the Styrian Business Promotion Agency (SFG). The COMET programme is administrated by FFG.

References