Reinforcement learning (RL) is widely used to train an agent, such as robots, to perform a task by feedback rewards in environment. For example, train an agent to play Atari games(Mnih et al., 2015), to defeat a champion at the game of Go(Silver et al., 2016), to overtake human scores in 49 Atari games(Guo et al., 2016), as well as learn control manipulator arm screw a cap onto a bottle(Levine et al., 2016), building blocks(Nair et al., 2018). Among the above tasks, perhaps the most central idea of RL is value functions that represents overall feedback rewards in any state(Sutton et al., 1998). The agent is trained to optimize the value function which caches knowledge of reward in order to learn to perform a single task. However, agents are required to perform multi-goal tasks in real environments where many rewards are sparse, such as placing different bricks in designated positions with a manipulator arm. How can we design a method that can perform multi-goal tasks well in a sparse reward environment?
We consider these problems in a human way of thinking. In fact, when dealing with complex tasks, humans set goals for themselves and keep approaching them. Therefore, the agent is supposed to set and achieve new goals during self-supervised process. In this way, the agent can perform a set of tasks in sparse reward environment. In order to ensure the agent can interact with complex environment to know how to approach the target in training, we must first choose a universal distance function. General value functions (Sutton et al., 2011) represent the rewards of any state in achieving a potential given goal . In the general RL, the value function is represented by a function approximator
, such as a neural network with parameters. The function approximator learns the value of the observed state through the structure in the state space, and expands to the unobserved value. The goals are usually set to what the agent can achieve, and on this basis, the goal space usually has the same structure as the state space. Therefore, the idea of value function approximation can be extended to both states and goals by using general value function approximator (Schaul et al., 2015). If the value function is related to distance, such as reward is to use the negative Mahalanobis distance in the latent space(Nair et al., 2018), the smaller the distance, the bigger the reward. We can also use a general function approximation to represent the distance function.
In order to solve the tasks in sparse reward or even non-reward environments, we consider whether the reward can be represented by the distance that exists in any environment. But it is quite difficult to train a function that can accurately evaluate the distance between states directly from the raw signals provided by the environment. For instance, in most visual tasks, the meanings of value function represented by pixel-wise Euclidean distance are not relevant to the meanings of value function in actual states(Ponomarenko et al., 2015; Zhang et al., 2018). So we propose to address this challenges by using the least steps of transfers needed from to , and the function is expressed as the minimum steps of transitions from the state to the goal . In any task, the agent needs to transfer to a new state by constantly choosing actions until the goal is reached. Therefore, in sparse reward or even non-reward envrionment, the distance function can effectively train the agent to keep approaching until it reaches any set available goal.
In this paper, our main contribution is a general-purpose skill technique that perform several different tasks in sparse reward or even non-reward environments. It makes sense for the real world to make agents learn to achieve multiple target tasks without relying on or rarely relying on rewards, such as manipulating robots to help humans perform some repetitive labor or dangerous work. In order to obtain effective training results, we modified the previous framework of standard RL algorithm, abandoned the action-value function, and directly used distance function to train our model. In addition, our method also combines the bridge point theory in SoRB(Eysenbach et al., 2019), which makes our method perform better than SoRb when dealing with complex problems.
2.1 Reinforcement Learning
Reinforcement learning is the control agent interaction with the environment to get the maximum reward. We model this problem as a Markov decision process(MDP), and consider this MDP defined by a set of state space , a set of action space , a reward function , an initial state distribution with density
, and transition probabilities. The agent’s action is defined by a policy with parameters . The goal of policy is to find the parameters that maximize the expected sum of future rewards from the start state, denoted by the performance objective . The expected sum of future rewards is called a return: where . We can write the performance objective as an expectation,
Value functions are defined to be the expected sum of future rewards: and . These satisfies the following equation called the Bellman equation,
The actor-critic is a widely used architecture based on the policy gradient theorem(Sutton et al., 1998; Peters and Schaal, 2008; Degris et al., 2012a). The actor-critic consists of two eponymous components. An actor adjusts the parameters of the stochastic policy . Instead of the unknown true action-value function , an action-value function
is used, with parameter vector. A critic estimates the action-value function using an appropriate policy evaluation algorithm such as temporal-difference learning.
2.2 Deep Deterministic Policy Gradient Algorithms
Deep Deterministic Policy Gradients(DDPG)(Lillicrap et al., 2016) is an off-policy actor-critic algorithm for continuous action spaces. The DDPG mainly includes two parts actor and critic. The actor is primarily responsible for a deterministic target policy , and the role of the critic is to approximate the action-value function that helps the actor learns the policy. Compared with ordinary stochastic policy gradients, DDPG uses a deterministic policy gradient. The generalised policy iteration(Sutton et al., 1998) is commonly used in most model-free reinforcement learning algorithms. Use temporal-difference learning(Bhatnagar et al., 2007; Degris et al., 2012b; Peters and Schaal, 2008) or Monte-Carlo evaluation to estimate the action-value function . The policy improvement method is a greedy maximisation of the estimated action-value function, .
2.3 Hindsight Experience Replay
Tasks with multiple different goals and sparse rewards have always been a huge challenge in reinforcement learning. For the challenge of sparse rewards, the standard solution is to introduce a new informative reward function that can guide the agent to the goal, e.g . While such shape rewards can solve some problems, it is still difficult to apply to more complex problems. Multiple goals tasks require more training samples and more efficient samples than single goal tasks from an intuitive perspective. Hindsight Experience Replay(HER)(Andrychowicz et al., 2017) present a technique which effective learning of samples from sparse reward environment. HER not only improves the sample efficiency, but also makes it possible to learn sparse reward signals. The method is based on training universal policies(Schaul et al., 2015) that takes both the current state and the goal state as inputs. For any trajectory , the most central idea of HER is store the transition in the replay buffer, and the is not only with the original goal but also with a subset of other goals.
3 Goal Distance Gradient
To realize the use of distance instead of rewards in reinforcement learning, the following two points must be considered. Firstly, in order to make distance replace reward, it means that reward function and action-value function will be replaced by distance function . How should the distance function be defined and estimated? Can the previous method of evaluating the action-value function also estimate the distance function? Secondly, Without an action-value function, how can we use the distance function to improve the policy? The distance function cannot provide an effective gradient for the policy to improve. We next describe the estimation of distance function in Section 3.1, and the method of Goal Distance Gradient in Section 3.2.
3.1 Estimate the Distance Function by TD
The distance function is used to represent the minimum number of transitions from the state to the goal . Compared with the value function , the distance function has a clear directionality . But in fact, the also hides a goal that can get the maximum cumulative reward when it is reached. and are equal if we set the feedback reward obtained at each step to 1. It means that each step is a transfer, that is how many transfers have been made from to or how many rewards have been accumulated. In this case, we can use the temporal-difference learning evaluation to estimate the distance function, such as Sarsa update(Sutton et al., 1998) is used by critic to estimate the action-value function in the on-policy deterministic actor-critic algorithm,
and Q-learning update is used by critic to estimate the action-value function in the off-policy deterministic actor-critic algorithm,
Therefore, we only need to replace the reward with the distance , and then we can evaluate the distance function without considering off-policy or on-policy. We can define as the number of transfers of ,
But there is still a very important problem here, is always a positive number, so as the iteration progresses, the distance function may become larger and larger, and constantly deviate from the correct estimation. We need to set a fixed transition value as the distance benchmark between identical states. We denote is the state after reaching the goal , and the number of transfers is 0,
3.2 Gradients of Distance Policies
Policy Gradient algorithm is often used in continuous action space, and improves policy by the global maximisation at every step. In deterministic policy algorithm, a simple and efficient way to improve policy is through the gradient of action-value function , rather than globally maximising. For each state , the of policy parameters are updated by the gradient . However, the distance function cannot provide gradient for updating the parameters of like action-value function . Therefore, we proposed a new method for policy improvement.
We define deterministic policy and a deterministic model . The form of the deterministic Bellman equation for the action-value function is . So, the relationship between action-value function and value function is as follows:
Use value function instead of action-value function for policy improvement:
Although the value function itself cannot provide gradient for the policy to improve, through the relationship with the action-value function, the policy is improved indirectly. The form of Distance Bellman equation is where refers to the distance or times of transition. The agent aims to obtain a policy which minimizes the distance between the next state and the goal. Reference Equation 7 and 8, there are the following formulas:
So, the policy improvement method is to use the gradient to minimize the distance function. The improved policy can output an action that makes the distance from next state to the goal sufficiently small, and finally achieve the task of reaching .
We summarize the reinforcement learning algorithm of the Goal Distance Gradient (GDG) in Algorithm 1. Our main idea is to make agent perceive the distance of the whole environment, which is to estimate the times of transitions required to reach different states. By setting a goal, the agent can estimate the times of transition needed to reach the goal, and then train it by distance function to move in the direction of decreasing the number of transfers , so as to reach the goal. At the beginning of each episode, we judge whether to find a bridge point that can connect the start to the goal and make the start to the goal closer according to a fixed probability. Than, We use a simple exploration policy to collect data as usual algorithms, and store it in the replay buffer. Finally, we train our model based on samples randomly sampled from the collected data and finetune the parameters of model.
We summarize the method of searching for bridge points during training in Algorithm 2. In complex environments, it is often difficult to collect good experiences for learning because of insufficient exploration. Therefore, we want to get out of the predicament of thinking and find an intermediate point to simplify the complexity of the problem, thereby increasing the probability of exploring and collecting excellent experience.
Our experiments were designed to address and answer the following questions:
How does the distance compare to the rewards in terms of the effect of policy improvements?
How does bridge planning apply to my approach? Can the GDG and bridging planning complement each other?
In a complex high-dimensional continuous environment, can our method accomplish the task? How does our method perform compared to other prior algorithms?
For the first question, we verify that our method can improve policy throught the 7-DOF fetch robotics arm(Andrychowicz et al., 2017) whose actual distance between the start and the goal is equal to the Euclidean distance. We compare our method with prior methods on the MujoCo physics engine task for simulating robots. For the second question, we apply our method to a simple 2D map to guide the agent to the goal. We use the distance function to search the bridge point connecting the start and the goal, and estimate the distance between them, so as to help the agent get to the target faster. For the last question, we built a more complex high-dimensional environment based on the previous 2D map model to evaluate our method and compare it with other methods.
In this paper, we only focus on the distance between states rather than rewards. So all environments have no feedback reward value, and the feedback given is only the current state, goal and whether agent has been reached.
As shown in Figure 1-a, 7-DOF robotics arm consists of seven robot joints. The state of the robotics arm includes velocities and angles of robot joints. The initial position of the robotics arm and the goal is randomly chosen in space. We can use the coordinates of the arm position and the goal point to calculate the Euclidean distance as the distance we need. Test whether our method can improve a good policy with a known distance. Our task is to control the robotics arm to reach this goal as soon as possible.
Figures 1-b and 1-c are 2D maps of a simulated urban road model. The state of the 2D map is represented as the coordinates of the agent in the environment. The state space denotes the coordinates that agent can reach. All obstacles in Figure 1-b and Figure 1-c are inaccessible space that agent can’t reach. When initializing a navigation task, the start and goal are chosen randomly selected in free space , and the distance between start and goal be far enough. Our task is to guide the agent to reach the goal from the start and avoid obstacles. When the agent reaches the target nearby, the environment will feed back the information. Once the next action will cause agent to touch the obstacle, it will keep the original state.
4.2 How does it compare with the default algorithm
The DDPG(Lillicrap et al., 2016) is a classic algorithm in deterministic policy reinforcement learning algorithms. The key of the DDPG is to find an action that can reach a state of as large a reward as possible after execution. The key to our algorithm is to find an action that can reach the goal the fastest. If the reward of each step is set to -1, the value function in DDPG is the same as the distance function in our method have the same meaning. represents the minimum number of transition steps required from state to state , and the represents the maximum reward that can be obtained in the future from the state . If the combination of start and goal is regarded as the state, then 222 denotes concatenation. In the above case, the only difference between our method and DDPG is how to find an ideal action.
In order to exclude the influence of other factors, we found the environment of the 7-DOF fetch robotics arm(Andrychowicz et al., 2017), in which the distance from the start to the end is computable. The states of the start and the goal are represented by the coordinates in the Euclidean space, and the Euclidean distance can be calculated directly.
In this experiment, we consider the distance from the binary representation state to goal of form . Therefore, we can directly use the forms and to represent the distance function in our method and the value function in the DDPG(Lillicrap et al., 2016). Our method and DDPG are based on the actor-critic(Bhatnagar et al., 2007) architecture. Computable distance can help us observe the performance of the actor on the basis of determining the critical. The purpose of this experiment is to verify that our method is as effective as DDPG in finding the optimal action . In theory, our method should be consistent with the performance of DDPG in this experiment. As long as the results obtained can verify this theory, it means that our method is feasible and effective. From Fig.2 it is clear that both methods can easily accomplish this task, and the convergence effect and the final result of the two methods are consistent. The results show that our use of distance to improve policies is feasible and effective.
4.3 Application of bridge point in GDG algorithm
This experiments will illustrate that accurate distances estimates are crucial to our algorithm’s success. (Eysenbach et al., 2019) define as the expected number of steps to reach from under the optimal policy. But, this method is not suitable for all environments. In the real-world environment, most of the environments can not estimate the optimal distance from to in advance. Therefore, we try to use our distance function instead of in SoRB (Eysenbach et al., 2019) to estimate the distance from s to G in advance. So as to help search bridge point to establish the connection between S and G to better complete the navigation task.
In training, it is difficult for agents to learn how to get around obstacles and reach the target. Like Figure 2-a, it usually moves directly in the direction of goal and hits an obstacle. It is more difficult to reach the opposite side of an obstacle, if it is wider. Because for a seemingly immediate goal, the agent must stay away from it before reaching it. It’s a very difficult thing for the agent to stay away from goal. If we can find a point that connects start and goal like a bridge, we can guide the agent to reach first, then from to goal. As shown in Figure 2-b, We find a bridge point that agent can reach, and figure 2-c shows that can also reach the goal. Figure 2-d shows that with the help of the bridge point B, the agent can complete the task from the start to the goal. However, more bridge points may be needed in the actual task to complete the connection between the start and the goal.
4.4 Performance comparison in complex environments
Now we test the performance of our method in a more complex high-dimensional environment, illustrated in Figure 1-c. We use similar methods to build a similar urban street environment with more obstacles, more complex paths and a larger scope. This environment compared with that simple 2D map, the maximum distance (steps) from the start to the goal is increased from 120 to 240, and more obstacles need to be bypassed by agents. We found it difficult to get agents to get around obstacles and turn. In FourRooms environment, the agent can reach the goal only by bypassing two obstacles at most. But in Urban environment, agents need to bypass up to nine obstacles to reach the furthest goal.
In this experiment, we should compare with the SoRB(Eysenbach et al., 2019) algorithm, but the distance it uses is obtained directly from the environment in advance. The distance used in our method is later learned from the environment. Therefore we cannot compare our method with SoRB in this experiment. In this environment, we evaluated four methods: Goal-Distance Gradient, Goal-Distance Gradient with Bridging Planning, Deep Deterministic Policy Gradients(DDPG)(Lillicrap et al., 2016) and stochastic method. We use the same goal sampling distribution when comparing each method.
During the training of the above method, each method was tested 200 times for every 20,000 episodes of training, and the average success rate of the results was calculated. From the results in Figure 4, it can be seen that the success rate of DDPG in the late training period is basically maintained at about 0.2 and cannot continue to increase. Although the success rate of my method is still unable to continue to increase in the late training period, it basically remains around 0.3. The success rate of the GDG with bridging planning has been rising in waves, and finally maintained at 0.8.
We calculated the average success rate of each method in three environments. For each distance of each environment, we randomly generated 50 start and goal, and recorded whether each success. In each experiment, if the goal is reached within 500 steps, it will be recorded as success, otherwise it will be recorded as failure. We repeated each experiment with five different random seeds. As shown in Figure 5, we plot the average results of five experiments as solid lines, and use translucent areas to represent the upper and lower limits of the five results. We can see the GDG with bridging planning can still maintain a relatively high success rate at longer distances, while other methods can only complete tasks at short distances.
Figure 6 below shows the navigation results of our method at different distances between the starting point and the target in an urban environment.
We propose a method based on the actor-critic model that uses distance to replace the rewards in general reinforcement learning to solve the problem in a sparse reward environment. To this end, the existing policy gradient is improved and a goal distance gradient is proposed. By comparing with the DPG framework, we verified the feasibility and effectiveness of the Actor part of the policy function and the Critic part of the evaluation function, respectively. Due to the characteristics of our method, we can dynamically evaluate the distance between any states, so compared to SoRB, our method can be applied to more environments that cannot directly evaluate the distance between states. Moreover, our method can achieve farther goals in more complex environments than SoRB(Eysenbach et al., 2019). My method can also be combined with techniques such as TD3(Fujimoto et al., 2018) to reduce overestimation.
5 Discussion and Future Work
Although there are many methods of reinforcement learning that can be used to solve complex problems in high-dimensional environments. But current methods must be based on sufficient data and training to accomplish specific tasks. Once the environment changes or new untrained tasks are added, the previously obtained models will no longer be applicable. Only by solving the above problems, can reinforcement learning be applied to more real-world environments instead of staying in the simulator. So we must let our agents learn to analyze problems, not just judge based on past experience. Any complex problem includes many simple problems. For example, building a house with bricks is to place each brick in a specified position in order. Our approach provides an idea for agents to analyze problems and split complex problems into multiple simple problems. In our method, each start and goal are connected by multiple bridge points. Each of these bridge points can be regarded as a simple task, and all bridge points constitute a complex task. The key is to find each bridge point in a complex task composed of these simple tasks according to the training of simple tasks. The method we proposed in this paper provides a theoretical method for the above problems.
- Andrychowicz et al.  Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
- Bhatnagar et al.  Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, and Richard S Sutton. Incremental natural actor-critic algorithms. pages 105–112, 2007.
- Degris et al. [2012a] Thomas Degris, Patrick M Pilarski, and Richard S Sutton. Model-free reinforcement learning with continuous action in practice. In American Control Conference, 2012a.
Degris et al. [2012b]
Thomas Degris, Martha White, and Richard S. Sutton.
Linear off-policy actor-critic.
In International Conference on Machine Learning, 2012b.
- Eysenbach et al.  Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. Search on the replay buffer: Bridging planning and reinforcement learning. arXiv preprint arXiv:1906.05253, 2019.
Fujimoto et al. 
Scott Fujimoto, Herke Van Hoof, and David Meger.
Addressing function approximation error in actor-critic methods.
arXiv: Artificial Intelligence, 2018.
- Guo et al.  Xiaoxiao Guo, Satinder Singh, Richard Lewis, and Honglak Lee. Deep learning for reward design to improve monte carlo tree search in atari games. arXiv preprint arXiv:1604.07095, 2016.
- Levine et al.  Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
- Lillicrap et al.  Timothy Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. 2016.
- Mnih et al.  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Nair et al.  Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pages 9191–9200, 2018.
- Peters and Schaal  Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7):1180–1190, 2008.
- Ponomarenko et al.  Nikolay Ponomarenko, Lina Jin, Oleg Ieremeiev, Vladimir Lukin, Karen Egiazarian, Jaakko Astola, Benoit Vozel, Kacem Chehdi, Marco Carli, Federica Battisti, et al. Image database tid2013: Peculiarities, results and perspectives. Signal Processing: Image Communication, 30:57–77, 2015.
- Schaul et al.  Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International Conference on Machine Learning, pages 1312–1320, 2015.
- Silver et al.  David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- Sutton et al.  Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 2. MIT press Cambridge, 1998.
- Sutton et al.  Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White, and Doina Precup. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 761–768. International Foundation for Autonomous Agents and Multiagent Systems, 2011.
Zhang et al. 
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang.
The unreasonable effectiveness of deep features as a perceptual metric.In