Learning to Progressively Plan

by   Xinyun Chen, et al.

For problem solving, making reactive decisions based on problem description is fast but inaccurate, while search-based planning using heuristics gives better solutions but could be exponentially slow. In this paper, we propose a new approach that improves an existing solution by iteratively picking and rewriting its local components until convergence. The rewriting policy employs a neural network trained with reinforcement learning. We evaluate our approach in two domains: job scheduling and expression simplification. Compared to common effective heuristics, baseline deep models and search algorithms, our approach efficiently gives solutions with higher quality.


page 1

page 2

page 3

page 4


Learning Improvement Heuristics for Solving the Travelling Salesman Problem

Recent studies in using deep learning to solve the Travelling Salesman P...

Learning 2-opt Heuristics for the Traveling Salesman Problem via Deep Reinforcement Learning

Recent works using deep learning to solve the Traveling Salesman Problem...

Counting-Based Search: Branching Heuristics for Constraint Satisfaction Problems

Designing a search heuristic for constraint programming that is reliable...

Learning to Schedule DAG Tasks

Scheduling computational tasks represented by directed acyclic graphs (D...

Effective Footstep Planning for Humanoids Using Homotopy-Class Guidance

Planning the motion for humanoid robots is a computationally-complex tas...

Simulation-guided Beam Search for Neural Combinatorial Optimization

Neural approaches for combinatorial optimization (CO) equip a learning m...

Bilevel Learning Model Towards Industrial Scheduling

Automatic industrial scheduling, aiming at optimizing the sequence of jo...

1 Introduction

2 Optimality as a way to discover macro actions

2.1 Theory

If the optimal actions are always uniformly distributed, then finding macro actions are meaningless, since there are exponential number of them.

What makes it meaningful is that there exists some macro actions in general MDP and we should be able to find them from the optimal solutions.

Here is a theorem that needs to be proven:

Theorem 1.

Given an MDP with known set of , and the dynamics , the optimal action sequence of an MDP clusters among family of reward distributions.

Example: A 2D maze with multiple rooms. A sparse reward at some room location. No matter where you place the reward, at optimal solution, initially you always need to leave your current room.


Note that the value in the room is completely determined by the values on its interface with the remaining of the states. If the value ordering remains the same, so does the optimal policy within this room (except at the interface). ∎

Once you have the theory, we know that clustering makes sense: you first cluster the optimal trajectories into macro actions given a few reward distributions, then you can simply apply macro actions in the new reward distributions.

3 Experiments

3.1 Generating complicated expressions

[Put a few sentences saying how the expression is generated?][Make sure at least some of them are from real cases.]

3.2 Comparison against rule-based systems

Halide has a rule-based system that is well-engineered. [Explain the system a bit here]. Our RL agent now beats the rule-based system with policy network only (i.e., picking the most probable action from the network).

With search-based method, it is even better [we need some numbers here]

3.3 Extraction of Macro Actions

Using the principle of optimality, we were able to find patterns in the action sequence. Fig. shows some patterns.

3.4 Generalization capability of patterns

We define patterns as macro actions and apply these actions in the unseen simplification cases. Do we reduce the number of steps?

Faster Exploration Using macro actions, we can achieve much faster exploration and learn to solve more complicated problems (problem that involves a much deeper search tree). Show a few examples.