The solution of many continuous decision problem can be described as such a process: agent set out from the initial state, then go through a series of intermediate state and finally reach the goal state. Imagine an agent in a maze, which needs to find some key positions and pass through them one by one to get out.
Agent has two types of behavior: one is the micro action taken at every state, which is similar to muscle activity, called reaction; another is the change of trend in reactions taken over a period of time, which is similar to thought of human, called planning . For the agent in maze, reaction can be its every little moving step and planning can be its every determination of the position it should reach next.
In a complicated
scene with high-dimensional data stream, long-term decision process and sparse supervision signal, an agent trained only to react[10, 9] can hardly perform well (See Appendix A for demonstration). However, combining reaction and planning [4, 14, 3] can significantly improve its capability.
The essence of such improvement is that agent has limited reaction capability and the introduction of planning releases agent from reacting in the whole task. If the agent in maze only know how to reach a nearby position, with consecutive adjacent positions given by a planner, it still has the ability to reach a specified remote position.
What is the connection between reaction and planning? Improving reaction capability means spending much time training a connecting structure [13, 5] before task. Planning is not required once the reaction capability reach a certain level, which is difficult in a complicated scene (See Appendix B for demonstration). Improving planning capability means designing a planner that can provide useful information for reacter or divide original task into easier tasks. Planning would consume much more resources during the task.
Considering the features of reaction and planning, there should be a way to make full of their advantages: giving agent enough reaction capability without consuming much resources before the task, and, ensuring well performance without consuming much resources during the task. To achieve this, an evaluation of the reaction capability is necessary, helping to get a compatible planner.
A recent work (SoRB) showed a novel way to handle problems in a complicated scene: agent first samples states as waypoints, next connects waypoints to get a planning graph
, then finds a shortest path in the graph, and finally reacts along waypoints on the shortest path. They found a powerful tool to incorporate planning techniques into RL: distance estimates obtained from RL.
Based on SoRB, this paper analyze the effect of two planning parameters on planning: the number of waypoints in and the maximum edge length of the planning graph. An online adapting algorithm is then proposed, which can adjust the planning parameters base on complexity of state place and reaction capability of agent. With this algorithm, task will be handled with relatively little computational cost and high success rate.
Before using online adapting algorithm, agent has obtained a reactive policy trained by RL and got a planner which constructs a planning graph and uses Dijkstra’s Algorithm to find shortest path, dividing original task by setting subgoals along the path. Two parameters as mentioned above have significant effect on performance. An optimization strategy is created to find a satisfactory setting of the parameters. An modified pattern search method is created to accelerate optimization.
Goal-Conditioned RL: The state of agent is determined by its current and goal state: . At every state agent takes a reaction: . It has a reactive policy: . The environment of agent has a reward function: , and a transition function: . The reactive policy is learned by DDPG  algorithm. Agent has a function that assess values of each pair of state and action: . Ideally, , where is a discount factor. By decreasing Bellman Error: , values will approach to the ideal ones. By choosing reaction of larger value: , better performance can be achieved. After alternately optimizing and , agent will get better reacting and evaluating capability.
Distance Estimates Obtained from RL: In order to construct a planning graph, agent must estimate distances between every pair of waypoints without additional information. If the environment has a special setting: and , then DDPG algorithm will learn value that have close connection to shortest distance between two state and could be used to determine lengths of each edge in the planning graph.
Planning: The idea of combing planning in RL has been around for a long time [2, 15]. A recent work  use CNN as planner which convey useful global information to reactive policy. Another work  use hierarchical RL, where high level controller set goals and low level controller produce locomotion. Both controllers are trained in an actor-critic process. In this paper , waypoints are filled in state place in advance  rather than generated dynamically during the task. Since agent can estimate the distance between two waypoints, it can search a path without additional learning.
Pattern Search: The two planning parameters is optimized according to the testing result of tasks. The optimization has no gradient. After changing parameters, the planning graph should also be changed which could take a long time. Therefore, pattern search is used to reduce optimization time. In the original pattern search method , searching interval is gradually reduced to precisely find the optimal value. In this paper, the searching interval is increased to quickly find the range of optimal value. Adjustments are made to ensure an appropriate termination of search.
The online adapting algorithm has two part: optimizing planning parameters and pattern search. The optimization strategy is derived from analysis of relationship between planning and reaction. Pattern search accelerate, give a soft convergence circumstance to, and set a termination condition on the optimization.
3.1 Optimizing Planning Parameters
The two planning parameters is changed according to three different testing results of tasks: agent reaches the goal successfully; agent finds a path to the goal but cannot reach it; agent cannot find a path to the goal.
Success: A shortest path is got by visiting to waypoints in the planning graph using Dijkstra’s Algorithm, which takes more time as the number of waypoints grows. If agent could reach the goal, we could try to set less waypoints to get a quicker reaction.
Cannot Reach: This means there is a pair of adjacent waypoints in the path that agent cannot move from one to another by reaction. The shortest distance of two states is estimated by Q network trained through DDPG. This distance estimates is efficient but not accurate enough (See Appendix C for demonstration). In SoRB, three Q networks are trained together and distributional Q Values  are used to ensure robust distance estimates. If the problem still exist given these, the reaction capability of agent must be overestimated. Therefore, the maximum edge length of planning graph should decrease so that easier subgoals are set to the agent.
No Path: This means the start and goal state are not connected in the planning graph. It is caused by sparsity of waypoints or edges. The solution is to add both two parameters which could bring more waypoints and edges into the graph.
Combing the above three situations, we can get an optimization algorithm, shown in Algorithm 1.
3.2 Pattern Search
The aim of using pattern search is to quickly determine the range of optimal value, and then narrow this range. The planning parameters may fluctuate to some extent but are close to the optimal value. Each parameter is optimized independently and an extra group of parameters are used to record its optimization status. The process of pattern search is described below, taking the number of waypoints (denoted by ) as an example.
Initially, is set to a small enough value. The increment of (denoted by ) is larger than the decrement (denoted by ). The reason for such setting is that smaller is more likely to cause an failure which is dangerous while larger increase task time which is relatively tolerable. A larger could avoid from converging into an dangerous area.
At the beginning, increases continuously and increases exponentially which makes far exceed the optimal value. Then is set to its last searching value, is set to its initial value and will continue increasing. After several such repetition, the optimal value is determined within a small range. Now we can fix and try to reduce . When ’No Path’ happens, meaning that may enter the dangerous area, it should increase. Although agent may perform well in the following tasks, risk still exists. Therefore is reduced simultaneously to restrict attempt at reducing . As decrease, gradually move away from dangerous area and fluctuate in a small area. The search is terminated when is small enough. A clear process is shown in Algorithm 2. To further accelerate the optimization, another search process is provided (See Appendix E for detail).
The experiments are taken in a 2D environment (See Appendix). First, a satisfactory parameter setting are got using the adapting algorithm. Then, one of the planning parameters is fixed to see the effect of another on the task time and success rate.
4.1 Changing Process of Planning Parameters
When converges (concentrate on since maximum edge length is changed along with ), it fluctuates around a certain value. On average, it goes up times every time it goes down times. When it goes up, ’No Path’ happens in at least of the tasks (assume the number is exactly ). When it goes down, the frequency of ’No Path’ is less than (assume the number is exactly ). Then we can calculate the success rate of task when is around its convergence value:
Notice that is much larger than both and . Equation 1 can be simplified:
This means, using adapting algorithm, we would finally get a convergent value that make agent success in of tasks. Such prediction is not accurate enough since it is derived under many assumptions, however, it is useful for understanding the training result.
Two changing processes of planning parameters are shown. The setting of extra parameters in Figure 1(a) is same as those in Algorithm 1 and 2 except that is set to 10. In Figure 1(b), the setting is totally the same. Reaction capability also has a influence on the convergence value of (See Appendix D for detail).
The randomness comes from two parts: waypoints and tasks are randomly sampled. In each iteration, there are 40 different waypoints settings, in each of which 5 different tasks are given.
4.2 Comparison of Different Planning Parameters Settings
In Figure 1(b) we get a satisfactory setting of planning parameters: and , where denotes the maximum edge length of planning graph. Taking this setting as center, we now compare task time and success rate in different parameter settings. We first fix to 5 and change . Then we fix to 400 and change . The results are shown in Figure 2(a) and 2(b).
The distances of each pair of waypoints are cached before the task so that the time complexity of searching next waypoint is reduced from to . This makes task time grows almost linearly with . Failed tasks are not counted when calculating average task time.
The experiments show that with appropriate parameter setting, the pattern search can quickly find a satisfactory setting of two planning parameters. This method can be extend to more sophisticated problems where there are more than three planning parameters to optimize without gradient, as long as the optimization strategy (similar to Algorithm 1) is given.
5 Discussion and Future Work
Combining planning and reaction could help to handle complicated tasks which have high-dimensional data stream, long term decision process and sparse supervision signal. A good planning algorithm could make full use of limited reaction capability which is usually obtained by deep reinforcement learning. A specific planning algorithm has parameters that need changing to accommodate the reaction capability of agent. The optimization direction of these parameters could not derive from calculating gradients. Therefore, we need to figure out the relationship between planning and reaction to create optimization strategy. After determining the strategy, improved pattern search method can be used to greatly accelerate optimization.
In this paper, planning method creates a memory of state place where agent can get useful instructions during tasks. In the future, we could design a planner which create and remove waypoints repeatedly, to form a more efficient memory of the environment. We could also try to simultaneously improve planning and reaction, which might bring us powerful agents with excellent reactions (see Figure 4) in the whole environment. Furthermore, agent needs to explore the environment when there is no waypoint initially.
A distributional perspective on reinforcement learning.
Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 449–458. Cited by: §3.1.
-  (1993) Feudal reinforcement learning. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles (Eds.), pp. 271–278. External Links: Cited by: §2.
-  (2019) Search on the replay buffer: bridging planning and reinforcement learning. International Conference on Learning Representations (ICLR). Cited by: Appendix F, §1, §1, §2, §4.2.
-  (2018) PRM-rl: long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning. IEEE International Conference on Robotics and Automation (ICRA),, pp. pages 5113–5120. Cited by: §1.
Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: §1.
-  (1961-04) “ Direct search” solution of numerical and statistical problems. J. ACM 8 (2), pp. 212–229. External Links: Cited by: §2.
-  (2016) Reinforcement learning with unsupervised auxiliary tasks.. CoRR abs/1611.05397. External Links: Cited by: Appendix B.
-  (1993) Learning to achieve goals. In IJCAI, Cited by: §2.
-  (2016) Continuous control with deep reinforcement learning. International Conference on Learning Representations (ICLR). Cited by: §1, §2.
-  (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. External Links: Cited by: Appendix B, §1.
-  (2017-07) DeepLoco: dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Trans. Graph. 36 (4). External Links: Cited by: §2.
-  (2016) Learning real manipulation tasks from virtual demonstrations using lstm. Cited by: Appendix B.
-  (1986) Learning representations by back-propagating errors. Nature 323 (6088), pp. 533–536. External Links: Cited by: §1.
-  (2018) Semi-parametric topological memory for navigation. International Conference on Learning Representations (ICLR). Cited by: §1.
-  (1999) Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112 (1), pp. 181 – 211. External Links: Cited by: §1, §2.
-  (2016) Value iteration networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Red Hook, NY, USA, pp. 2154–2162. External Links: Cited by: §2.
-  (2014) How transferable are features in deep neural networks?. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 3320–3328. Cited by: Appendix B.
Appendix A Introducing Planning
Figure 3 shows the 2D environment where experiments are done.
It is difficult to train a DDPG agent (left), because supervision signal only appears when agent is close to the goal and agent gets no reward in most of its experience. If we set waypoints (right) that agent can reach one by one, the remote goal is easier to achieve.
Appendix B Deliberate Training
There are tricks for improving reaction capability substantially. However, such method could not generalize to the whole environment. In Figure 4, agent can reach the goal only by reacting, which means it remember the walls in some way (parameters in the reacting network).
To achieve this, agent is first trained in an simple environment where there are no walls. This could make it move in the right direction. The reacting network obtained is reused in this new environment [17, 7]. Then agent is repeatedly trained in a same task ( and are fixed) and the difficulty of task is gradually increasing (agent needs to bypass increasingly thicker walls). After such complicated training process, agent get excellent reaction in this small area.
To extend such ability to the whole environment, we need much more training data to let the reacting network get a memory of all the walls. The scale of such network is unknowable. Besides, training process should be set carefully to avoid catastrophic forgetting .
To get more efficient training data, imitation learning could be used. This requires manually provided data which is got from human experience.
Appendix C Problematic Distance Estimates
The Q values learned by DDPG is accurate only in a small area that is around agent. This is enough, because the Q values is used to determine edge length of close waypoint pairs and reactive policy need not to care about remove state.
Figure 5(a) shows an image of all the Q values. (33,33) is the center.
In the goal-conditioned RL, an training episode end when the agent is close enough to the goal (in a center circle). Therefore, there is no transition where start state is in the center circle (i.e., start and goal state are very close), causing lack of training data and agnostic Q values nearing the center. Fortunately, this has not caused trouble in the experiments, because agent is not likely to choose a subgoal that is extremely close to it.
Another problem could cause much trouble: There exist some pairs of states whose shortest path is large but considered small (notice the white part in the corner of Figure 5(a)). Agent would add an edge to such pairs in the planning graph. But in fact, agent could not react from one to another. Figure 5(b) exemplifies a typical trouble.
Appendix D Comparison of Different Reaction Capability
Agents with different reaction capability have different requirement for planning parameters. The agent in Figure 6(a) is the same as one in Section 4 which is trained with 200 thousand steps by DDPG, whereas the agent in Figure 6(b) is only trained with 40 thousand steps.
These training curves has obvious difference to those in Section 4, because another pattern search algorithm (See Appendix E) is used to further accelerate the optimization of planning parameters. The randomness of experiments could hinder the optimization if we use Algorithm 2.
In Figure 6(b), fluctuates around 7 which conflicts with intuition. should have been smaller than the one in Figure 6(a), because an agent trained with fewer steps has worse reaction capability. The reason for a larger
is that the distances estimated by this agent are generally larger. An agent performing badly often overestimates distances of two state. This phenomenon further reveals a characteristic of distance measurement: It is a heuristic create by agent within, and its fundamental purpose is to help agent make decisions rather than predictions.
Appendix E Another Pattern Search Method
In Algorithm 2, we set an end condition that could be fulfilled when increase enough times. If the algorithm ends normally, agent would get a high success rate. Even if
is at a small value, there are still some probability for success, which would cause early ending of exponential growth and whole search. Figure7 gives an example.
For a small , although we could not expect failures to occur one after another, there is a large frequency of them. To make increase quickly in such situation, we could extend the time span of pattern search, creating conditions for exponential growth that are easier to meet. In Algorithm 2, exponential growth happens when growth also happened on previous iteration. Now, the condition is not limited to last one but several iterations, and termination no longer happens. Algorithm 3 shows the new pattern search process.
Appendix F Environment and Hyperparameters
The 2D environment used in this paper is the same as one in SoRB 
, except that the noise of environment is smaller: the standard deviation of noise is 1.0 in SoRB and 0.3 here. Smaller environment noise makes task easier, and hence reduces required waypoints, making the training process shorter. Settings for RL training are list in Table1.
|training iterations||2e5 and 4e4|
|training steps per environment step||1:1|
|random steps at start of training||1000|
|replay buffer size||half of training iterations|
|OU-stddev, OU-damping||1.0, 1.0|
|target network update frequency||every 5 steps|
|target network update rate||0.05|