Learning to Progressively Plan

09/30/2018 ∙ by Xinyun Chen, et al. ∙ 0

For problem solving, making reactive decisions based on problem description is fast but inaccurate, while search-based planning using heuristics gives better solutions but could be exponentially slow. In this paper, we propose a new approach that improves an existing solution by iteratively picking and rewriting its local components until convergence. The rewriting policy employs a neural network trained with reinforcement learning. We evaluate our approach in two domains: job scheduling and expression simplification. Compared to common effective heuristics, baseline deep models and search algorithms, our approach efficiently gives solutions with higher quality.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

2 Optimality as a way to discover macro actions

2.1 Theory

If the optimal actions are always uniformly distributed, then finding macro actions are meaningless, since there are exponential number of them.

What makes it meaningful is that there exists some macro actions in general MDP and we should be able to find them from the optimal solutions.

Here is a theorem that needs to be proven:

Theorem 1.

Given an MDP with known set of , and the dynamics , the optimal action sequence of an MDP clusters among family of reward distributions.

Example: A 2D maze with multiple rooms. A sparse reward at some room location. No matter where you place the reward, at optimal solution, initially you always need to leave your current room.

Proof.

Note that the value in the room is completely determined by the values on its interface with the remaining of the states. If the value ordering remains the same, so does the optimal policy within this room (except at the interface). ∎

Once you have the theory, we know that clustering makes sense: you first cluster the optimal trajectories into macro actions given a few reward distributions, then you can simply apply macro actions in the new reward distributions.

3 Experiments

3.1 Generating complicated expressions

[Put a few sentences saying how the expression is generated?][Make sure at least some of them are from real cases.]

3.2 Comparison against rule-based systems

Halide has a rule-based system that is well-engineered. [Explain the system a bit here]. Our RL agent now beats the rule-based system with policy network only (i.e., picking the most probable action from the network).

With search-based method, it is even better [we need some numbers here]

3.3 Extraction of Macro Actions

Using the principle of optimality, we were able to find patterns in the action sequence. Fig. shows some patterns.

3.4 Generalization capability of patterns

We define patterns as macro actions and apply these actions in the unseen simplification cases. Do we reduce the number of steps?

Faster Exploration Using macro actions, we can achieve much faster exploration and learn to solve more complicated problems (problem that involves a much deeper search tree). Show a few examples.