TTR-Based Rewards for Reinforcement Learning with Implicit Model Priors

03/23/2019 ∙ by Xubo Lyu, et al. ∙ Simon Fraser University 0

Model-free reinforcement learning (RL) provides an attractive approach for learning control policies directly in high dimensional state spaces. However, many goal-oriented tasks involving sparse rewards remain difficult to solve with state-of-the-art model-free RL algorithms, even in simulation. One of the key difficulties is that deep RL, due to its relatively poor sample complexity, often requires a prohibitive number of trials to obtain a learning signal. We propose a novel, non-sparse reward function for robotic RL tasks by leveraging physical priors in the form of a time-to-reach (TTR) function computed from an approximate system dynamics model. TTR functions come from the optimal control field and measure the minimal time required to move from any state to the goal. However, TTR functions are intractable to compute for complex systems, so we compute it in a lower-dimensional state space, and then do a simple transformation to convert it into a TTR-based reward function for the MDP in RL tasks. Our TTR-based reward function provides highly-informative rewards that account for system dynamics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Reinforcement learning (RL) is a principled mathematical framework for experience-driven autonomous learning and is a powerful tool for solving sequential decision-making problems by training the agents to maximize the long-term accumulated reward. Generally, RL methods can be divided into two main streams: model-based methods and model-free methods, depending on if the inner working principles (model) between agent and environment are directly employed in learning process [1]. In the context of model-based methods, the optimal policy is derived based on internal simulations of a forward model that corresponds to a representation of the robot’s dynamics. This characteristic significantly reduces the amount of trials in learning and leads to fast convergence to optimal policy. On the other hand, the main disadvantage of model-based method is the algorithms are heavily dependent on the model’s capability of accurately representing the transition dynamics.

The downsides of model-based methods have been somewhat alleviated by model-free methods, which do not require a prior model, and the optimal policies are derived by trial-and-error. For example, model-free methods allow the learning of policies that map observations directly to actions, a difficult robotic task for traditional control theoretic tools and model-based methods to solve [2]. Although model-free methods avoid the issue of inaccurate dynamics representation, they often require an impractically large number trials. High sample complexity is a key challenge of model-free RL, and a fundamental barrier impeding its adoption in real-world settings, especially in the context of robotics.

Fig. 1: This is the vivid illustration of TTR-based reward in goal-oriented RL task

Two main causes of high sample complexity are the use of naive exploration strategies and delayed sparse reward. Naive exploration method such as

-greedy would only simply choose the action which the agent expects to provide the greatest reward as optimal action in higher probability at each timestep. Though theoretical RL literature offers a variety of provably-efficient approaches to deep exploration

[3]

, most of these are designed for Markov decision processes (MDPs) with small finite state spaces, while others require solving computationally intractable planning tasks

[4]. These algorithms are not practical in complex environments with high dimensional states. On the other hand, delayed sparse reward indicates that agent can only have access to fewer reward as feedback even after long-term exploration. Such issue can significantly slow down learning process. Various techniques have been proposed to overcome these challenges. Hierarchical Reinforcement Learning (HRL) involves decomposing a task into sub-tasks and learning to solve each sub-task individually. Curriculum learning [5] offers another attractive strategy by training the agent using simpler tasks first, and increasingly difficult tasks afterwards. While this idea works successfully on a variety of systems [6], but failed in highly dynamic or unstable system since it’s still not easy to determine reasonable curricula for each learning stage.

The combination of model-free and model-based approaches now attracts more research interests and may pave the way for overcoming the challenge of high sample complexity while maintaining advantages of model-free RL. Vladimir Feinberg [7] proposed model-based value expansion (MVE) to control the uncertainty in the model by only allowing imagination to fixed depth, thus improving Q-learning by providing higher-quality target values for training. BaRC [8] leverage physical priors in the form of an approximate system dynamics model to design a curriculum for a model-free policy optimization algorithms. In BaRC, the high dimensional input including sensor data as well as motion states are used to learn control policies for robotic systems.

In this paper, we also apply the high-level idea of combining model-based and model-free method and propose Time-To-Reach (TTR) reward shaping. TTR function is traditionally used in the field of optimal control and is defined as a function of the robot states. It represents the minimum arrival time to the goal from any state under specific dynamics constraints. Here we adopt TTR function as the reward elements in RL tasks since minimal arriving time is far more reasonable to measure ”how good” the state is. Also in our method the TTR function is computed by implicitly leveraging approximate physical model while not altering the flexibility of model-free RL.The TTR function is hard to be computed for high-dimensional systems, which is why we use an approximate, lower dimensional model of the robot to compute it. As will demonstrated in our experiments, the approximate physical model is sufficient for the purpose of guiding policy learning because normally we don’t require highly accurate reward signals in reinforcement learning.

Our method has distinct features on addressing sample complexity issue. Unlike curriculum learning, there is no requirement of designing curricula in TTR-based reward shaping method because the learning process does not depend on curricula but on globally shaped reward. Moreover, the TTR function only needs to be computed once for entire state space. The advantage of our approach is that TTR function captures an essential aspect of a task: the TTR-based reward measures “how good” a state is by the minimal arrival time of a simplified system model. In addition, it only requires approximate TTR function, so a low-fidelity approximate physical model can be used to compute them while not affecting forward learning. So our method acts like a wrapper that augments any model-free RL algorithm and is capable of working on highly dynamic systems in which the forward and backward time dynamics vastly differ.

By using prior knowledge of the system through approximate dynamics, our approach drastically reduce sample complexity while still maintaining the potential of model-free method to learn high quality policies with complicated inputs.

We demonstrate our approach on a generic car environment and a planar quadrotor environment, and observe substantial improvements in sample efficiency compared to sparse or distance reward settings.

Contributions: We propose a TTR-based reward function that leverages physical model priors to efficiently guide policy learning in model-free RL tasks. In our method, an approximate system model with lower-dimensional state space is selected first. Once an approximate model is chosen, there is very little hand-tuning needed to compute the TTR function. After it is computed, we perform a simple transformation to convert the TTR function for the approximate model to a TTR-based reward function for the MDP in order to guide the policy learning in RL tasks. Moreover, our new approach can be incorporated in any model-free algorithms for solving robotic tasks with different system dynamics in highly flexible way. In particular, by effectively incorporating system dynamics in an implicit and compatible manner with deep RL, we retain the ability to learn policies that map robotic sensor inputs directly to actions. Thus, our approach represents a bridge between traditional analytical optimal control approaches and the modern data-driven RL, and inherits benefits of both. We evaluate our approach on two common robotic tasks and obtain significant improvements in learning efficiency and performance.

Ii Related Work

Integration of RL and neural networks has a long history

[9] [10]

. With recent exciting achievements of deep learning

[11], deep neural network has been prevailing in RL in different areas such as games, robotics and NLP. The deep Q-network, which was developed to learn to play a range of Atari 2600 video games at a superhuman level directly from image pixels, provides solutions for instability of function approximation techniques in the class of value-based RL method [12]. Policy-based methods, including Guided policy search (GPS) [13], Trust region policy optimization (TRPO) [14] and proximal policy optimization (PPO) [15] aims to directly find policies by means of gradient-free or gradient-based method. As shown in our work, we evaluate our reward shaping method on two robotic tasks with policy-based model-free learning algorithm and illustrates competitive performance.

Exploration inefficiency, as the root cause of high sample complexity, remains a major challenge in RL [16]. ”Deep exploration” strategy address this problem by maintaining a distribution over possible values and bootstrapping from it with random initialization [16]. Curriculum-based approaches paved another possible path by first training on easier sub-problems, and increasing the difficulty of sub-problems as training progresses until the original task becomes appropriate to train on directly [5]. Hierarchical RL framework tries to improve exploration efficiency by build modules with different level, where top-level module learns a policy over options (subgoals) and the bottom-level module learns policies to accomplish the objective of each option [17, 18, 19].

Our work is mostly based on reward shaping, a technique that provides additional localized feedback based on prior knowledge to guide the learning process [20]. Reward shaping has been proven to be a simple but powerful method for reducing sample complexity and speed up many RL tasks. Randlov and Alstrom [21] apply a shaping function to learn control policy for riding a bicycle to a goal. However, the function requires significant engineering efforts to hand-tune. Ng, Harada, and Russell [22] raised up potential-based reward shaping (PBRS), proving the conditions that an arbitrary externally specified reward shaping function must satisfy if they could be included in a RL system without modifying its optimal policy. What’s more, Wiewiora [23] proposed the potential-based advice (PBA) framework that allows the potential function to also include actions while still restricted with PBRS conditions.

Inverse reinforcement learning (IRL) can be used to learning a reward function from the expert demonstrations, thereby avoiding the manual specification [24, 25]. Recently, incorporating prior model knowledge into learning acquire some success. [8] apply physical model dynamics as prior knowledge for stable curricula generation to accelerate learning process. [26] allow the full transition information as prior beliefs to provide shaping rewards directly without learning through interactions with environment. Our work is inspired by the high-level idea of utilizing physical system dynamics to improve deep exploration thus speed up learning process. Specifically, we leverage model dynamics as prior to offer shaping reward directly without any engineering effort or interactions with environment.

Time-to-reach (TTR) problems are an important class of optimal control problems, and involves finding the minimum time it takes a system to reach some goal starting from any initial state. The common approach for solving TTR problem is solving a Hamilton-Jacobi (HJ) partial differential equation (PDE). Normally the computational cost of TTR is too expensive for system with high dimensional state. However, model simplification and system decomposition techniques make a significant leap in alleviating the computational burden in a variety of problem setups

[27, 28, 29]. As we will demonstrate in our method, approximating the TTR via simplifying the model dynamics or applying system decomposition is sufficient to generate effective reward map for guiding policy learning.

Iii Preliminaries

The objective of our work is to implicitly utilize the system dynamics as prior to generate dynamics-informed reward function automatically to accelerate model-free RL.

Iii-a Markov Decision Processes

Consider a Markov decision process (MDP) whose reward function have finite variance, defined by a tuple

, where is a finite set of states, and is a finite set of actions. are the transition probabilities for transitioning to state conditional on the agent being in state and taking action . is the deterministic reward function. is start state distribution, and is a discount rate.

In context of model-free deep RL (DRL), we utilize neural network represent a policy , describing the probability of taking action conditional on being in state . is the parameters of neural network. Policy can be either deterministic or stochastic. The objective of an RL agent is to find an optimal policy that maximize the expected return: , where is the expected cumulative reward starting from states . denotes a complete trajectory with and .

Iii-B Model-free optimization

Policy-based algorithms, coupled with nonlinear approximator have been prevailing in the area of model-free optimization in RL. policy-based methods have better convergence properties since it’s optimized by gradient. Moreover, policy gradients are more effective in high dimensional action spaces, or when using continuous actions. For these reasons, the learning algorithm we apply in our task is ”Proximal policy optimization (PPO)” [15], which belongs to the category of policy-based method. Policy-based methods do not need to maintain a value function, but directly search for an optimal policy . Typically, a parameterized policy is chosen, whose parameters are updated to maximize the expected return using gradient-based optimization. Gradients provide a strong learning signal as to how to improve a parameterized policy. However, to compute the expected return we need to average over plausible trajectories induced by the current policy parameterization. Techniques such as REINFORCE [30] or the “re-parameteriszation trick”[31]

provide neural network the ability to back-propagate through stochastic function. Intuitively, gradient ascent using the estimator increases the log probability of the sampled action, weighted by the return. More formally, the REINFORCE rule can be used to compute the gradient of an expectation over a function

of a random variable

with respect to parameters :

(1)

Searching directly for a policy represented by a neural network with very many parameters can be difficult and the resulting policy may be stuck in local minima. Trust region is a common used method to overcome such problem by restricting optimization steps to lie within a region where the approximation of the true cost function still holds. One of the newer algorithms in this line of work, trust region policy optimization (TRPO), has been shown to be relatively robust and applicable to domains with high-dimensional inputs [14]. However, the constrained optimization of TRPO requires calculating second order gradients, limiting its applicability. Proximal policy optimisation (PPO) algorithm performs unconstrained optimization, requiring only first-order gradient information while still retaining the performance of TRPO. These characteristic means PPO is gaining popularity for a range of RL tasks [32, 15].

Iii-C Approximate System Model and Time-to-Reach Functions

Given the following dynamical system in

(2)

where is the state of approximated system model and is control action, and is the feasible set of control. The TTR problem involves finding the minimum time it takes to reach a goal from any initial state , subject to the system dynamics in (2). We assume that is Lipschitz continuous. Under these assumptions, the dynamical system has a unique solution.

Mathematically, the time it takes to reach a goal using a control policy :

(3)

and our objective is to compute the TTR function, defined as follows:

(4)

Through the dynamic programming principle, we can obtain by solving the following stationary HJ PDE:

(5)
(6)

Detailed derivations and discussions are presented in [33, 34]. Well-studied numerical techniques [35, 27, 28] based on level set methods have been developed to solve (5).

Iv Approach

In RL, an agent learns to optimize its behaviors in an unknown environment by executing control policies, experiencing the consequent rewards, and improving the policy based on the rewards. Without sufficiently strong reward signals, learning is very difficult. For many interesting problems, the state space is far too large to be exhaustively explored, and the reward function is extremely sparse, leading to poor learning performance. Our method addresses this disparity based on the idea of reward shaping – a technique that provides localized feedback based on prior knowledge to guide the learning process. Intuitively, to many forms of life, agent (animals or human) would learn faster when fed with more localized feedback resulting in progress toward a target behavior.

In many goal-oriented RL tasks with sparse reward setting, the best behaviors we want is to reach the goal as fast as possible. Such requirement is naturally in accordance with the features of TTR-based reward because it offers a convenient choice as localized feedback by indicating the minimal arriving time from any state to goal region. Intuitively, the states with lower arriving time is a better state because it’s easier to get into goal region, thus it should be assigned with a higher reward. In this way, we build the connection between the meaning of TTR function and reward signal and obtain the non-sparse continuous reward function. Note that unlike the common shaping function such as the distance to goal region, TTR-based reward function captures the information of state in all feasible dimensions in a dynamically-informed manner. By applying the TTR-based reward to state at each training time step, we can dramatically improve learning performance without changing the model-free RL algorithm. The rest of this section describes our method in detail.

Iv-1 Model Selection

“Model selection”111

Note that the meaning of “model selection” here is different from that in machine learning.

here refers to that we need to pick up a physical model in order to compute TTR function. The physical model here should be the ordinary differential equation (ODE) which describe the approximate system dynamics. For example, we choose (

7) as our approximate model for generic car RL task and (8) for planar quadrotor RL task. Normally, the state domain of the approximate physical model is a subset of full MDP state domain in order to make TTR function be tractably computed. One feature of our method is that we don’t have to know or select the same accurate MDP model we follow in learning process since the TTR-based reward computed via the approximate physical model is sufficient to provide discriminative local feedback for guiding policy searching. For example, Gazebo simulator has a built-in physical engine as MDP model which we do not know exactly. However, the TTR-based reward still could be used to guide policy learning even though it’s computed based on the approximate physical model we have chosen below.

Moreover, it’s rather important to decide which states should be included in the approximate model. As demonstrated in our experiment, the information about pitch angles and angular velocity would be very useful for describing the motion of quadrotor. Therefore, if we include angle elements as part of states in the approximate model to compute TTR-based reward, the TTR-based reward will guide the quadrotor to learn the policy which always choose the best pitch angle in order to reach the goal region as soon as possible.

(7)
(8)

As an example, for the first experiment involving a simple car, we used the approximate model in (7), where the position , heading are the car’s states. The control variable is the angular velocity .

For our second experiment involving a quadrotor, we used the approximate model in (8), with , and where the state is given by the position in the vertical plane , velocity , pitch , and pitch rate . The control variables are the thrusts . In addition, the quadrotor has mass

, moment of inertia

, half-length . Furthermore, denotes the acceleration due to gravity, the translational drag coefficient, and the rotational drag coefficient.

Iv-2 TTR Function as Approximate Reward

Consider the real reward function in formal MDP framework: is the reward function that governs the reward signal when in state . Note that here the state in MDP could be high dimensional states that is usually regard as input of model-free learning algorithm. Now we define as the approximate reward function which only embraces a subset of full MDP state. Note that here refers to a subset of full state and is the set of state variables in the approximate physical model we have chosen in (7) or (8). In our method, the TTR function defined in (4) is used as the approximate reward function .

(9)

By definition, is non-negative and if and only if . Here we use as reward because the state with less arriving time to the goal should be given higher reward. The reason that we apply TTR function as approximate reward in RL process is three-folds: firstly, the reward function in learning process doesn’t need to be highly accurate, thus the TTR function would be a good approximation of reward only if it is able to capture key system behaviors and that is sufficient for guiding policy learning; secondly, the high dimensional state in most RL task may include sensor data which is usually not directly relevant to the reward; thirdly, TTR function can be efficiently computed for a simplified system model that contains a subset of the MDP states.

To reduce the computational complexity of solving the PDE on complicated system dynamics (such as quadrotor), we may apply system decomposition methods [27, 28, 29] to obtain approximate TTR function without significantly impacting the overall policy training time. Particularly, we first decomposed the entire system into several sub-systems with overlapping components of state variables, thus each sub-system can be solved efficiently. Specifically, we use the open source helperOC222helperOC: https://github.com/HJReachability/helperOC and Level Set Methods toolboxes [27] in MATLAB.

V Experiments

We perform simulation experiments on two dynamical systems: generic car system and planar quadrotor system, using Gazebo, a robot simulator with robust physical engine [36]. Also we evaluate our reward shaping method on a variety of measurements relating to performance and efficiency. We apply PPO [15] as our model-free policy optimization method.

V-a Simple Car Model

A five dimensional car model is used in our first experiment. Car model is widely used as standard test environment in motion planning [36] and RL [37]

. The state of car is represented as a vector with five elements:

, where and denote position in the plane, denotes heading angle, denotes linear velocity, and denotes angular velocity. The learning agent receives an observation that is augmented based on 5D state with 8 range and bearing measurements, which may be obtained through a laser rangefinder. These measurements provide the distance and direction from the vehicle to the nearest obstacle, and the learning agent must learn to map the augmented state, which includes the 5 internal states augmented with the 8 measurements. The control input of this system is , where is the linear velocity of car, is the angular velocity. The car starts from some initial condition, and aims to reach some goal region without colliding with obstacles. We set the goal state as a point at upper-right corner with . The states that are within m in positional distance and radian in heading angle of the goal state are considered to have reached the goal.

Fig. 2: The average cumulative reward obtained from three different reward functions during training on car model
Fig. 3: The success rate on reaching the goal obtained from three different reward functions during training on car model
Fig. 4: The average cumulative reward obtained from same sparse reward setting for evaluation on car model

We apply three different kinds of reward functions on same problem setting as comparison: one is sparse reward, where agent obtain reward at goal states, reward when collision happens, and reward at any other states. The second one is the commonly-used distance reward which measures the Euclidean distance from any position to goal while agent obtain same reward at goal states and collision states. The last one is TTR-based reward, and agent also obtain same reward at goal states and collision states, but negative TTR value as reward at any other states. Note that TTR value is computed based on the system dynamics we’ve selected, so at any state, there will be a corresponding dynamics-informed TTR-based reward indicating “how good” current state is in terms of arriving time to goal state.

Fig. 5: Performance visualization of car in gazebo simulator. The picture is a combination of the same car at different time snapshots.

Figures 2-4 show the performance of standard PPO algorithm [15]

on the car model with TTR-based reward, distance reward and sparse reward. All statistics are based on five different runs with the background shadow representing the 95 percentage confidence interval. Note that in Figure

3, one algorithm iteration contains 30 PPO iterations in Figure 2. Figure 3 indicates the success percentage of agent reaching goal versus training iterations of outer loop. Note that the success percentage is measured by the percent between the number of episodes ending with goal state and the total number of episodes in one out loop iteration. As we can see, TTR-based reward shaping achieves fast convergence and high success rate based on all five experiments in under ten algorithm iterations. In contrast, with sparse reward it hardly learn any useful policy to reach the goal since the occurrence of strong reward signals is too rare to support valid policy optimization. For distance reward, even though sometimes it will perform a little better than sparse reward, it’s very unstable across different runs of same experiment especially when the optimal trajectory to the goal position includes large curvature.

Figure 2 shows the average reward during training process. The average reward is computed among all sampled episodes in one single PPO iteration.As a result, with the similar tendency as success percentage, TTR-based reward shaping rapidly reaches a high and stable cumulative reward level after around 50 or 60 PPO iterations while there are a few oscillations during exploration at beginning. We also evaluating three trained policies in the identical environment with same sparse reward setting and the results are shown in figure 4. It’s easy to notice that the control policy trained with TTR-based reward is performing well among all episodes. Nevertheless, the control policies trained with distance and sparse perform much worse since the agent never learns how to deliberately arrive to the goal.

V-B Planar Quadrotor Model

Our second experiment is conducted on a simulated planar quadrotor model in Gazebo to evaluate the performance of our TTR-based reward shaping method on highly dynamic and unstable system. This planar quadrotor is a standard test problem in the control literature [36], [37]. ”Planar” means the quadrotor would only fly in vertical plane without changing the roll and yaw angle. The system has the state , where denote the planar coordinates and pitch, and denote their time derivatives. Quadrotor’s movement is controlled by the force difference between two motor thrusts, and . Similar as the task on car model, the observation of agent is augmented based on state by 8 laser rays averagely extracted from “Hokoyu_utm30lx” ranging sensor for detecting nearest obstacles. The goal region is defined as

and the states in this region are considered to have reached the goal. For PPO training, the output of network is the mean and variance of Gaussian distribution depicting the distribution of continuous action in stochastic sense, thus we could sample the action from this Gaussian distribution.

Fig. 6: The average cumulative reward obtained from three different reward functions during training on quadrotor model
Fig. 7: The success rate on reaching the goal obtained from three different reward functions during training on quadrotor model
Fig. 8: The average cumulative reward obtained from same sparse reward setting for evaluation on quadrotor model

Similar to generic car problem, we apply three different reward function for comparison. For sparse reward setting, agent receives a reward of for reaching goal region and a reward of for collision penalty. At any other states, agent receives reward. Distance reward setting measures the Euclidean distance from any state position to the goal while agent obtain same reward at goal states and collision states. TTR-based reward function preserve exactly same reward information for goal region and collisions but assign negative TTR value as reward to quadrotor at each timestep. The episode runs for at most timesteps and the maximum thrust of quadrotor is set to be 1.5 times the weight of itself to simulate reality as much as possible. As is shown in our result, with the extremely sparse reward associated with reaching the goal on unstable highly dynamic system, it’s extremely difficult to learn a good policy. However, with a simple replacement with dynamics-informed TTR-based reward function, not only the performance but efficiency of agent’s learning is greatly improved.

Fig. 9: Performance visualization of quadrotor in gazebo simulator. Actually the picture is just a combination of the same quadrotor at different time snapshots.

Figures 6-8 show the performance of learning when use TTR-based reward function as well as the sparse reward and distance reward. Sparse reward, due to the extremely lack of effective feedback signal, never guide the agent reach the goal and thus cannot learn to deliberately move toward the goal region. Instead, it learns a locally optimal policy which hovers in small safe region to avoid any collision. The distance reward only consider the absolute distance between the agent and goal instead of the angle information, thus quadrotor always tries to reach the neighborhood area of goal region in a near-hover configuration. However, in our environment setting, quadrotor has much more difficulties in arriving the goal without any tilting motion. Thus quadrotor receiving distance reward will finally stay hovering nearby the goal but never produce any deliberate behaviors tilting towards the goal.

In contrast, as shown in Figure 9, the TTR-based reward performs very well in guiding quadrotor reaching the goal along shortest trajectories based on its’ start state in highly tilting status. During the training process, the TTR-based reward leads the quadrotor approach the goal area not only considering the distance between their positions but also the angular tendency. Gradually, the quadrotor find out that it would accumulate higher reward if it manages to arrive the goal within less time with highly tilting motions. The quadrotor can learn such complex behavior using TTR-based reward because it has captured the model dynamics in all state dimensions including pitch angles.

Specifically, figure 6 and 7 plots the training curves in terms of cumulative reward and success rate versus training iterations. As we can see, even though the learning process via TTR-based reward has comparatively large variance, it will finally search for a proper trajectory to the goal thus receive highest reward and success rate compared to the distance and sparse reward. This phenomenon illustrates that the TTR-based reward is able to encourage the agent to explore even more efficiently to avoid being stuck. And the evaluation figure 8 tells the same story that the policy learned from TTR-based reward will keep performing good and definitely beats the policy learned via distance or sparse reward especially when the task requires highly complicated motions, like quadrotor flying in a tilting manner333A short video demo of our work: https://youtu.be/KdA2UGr6T4g.

Vi Conclusion

In this paper, we proposed TTR-based reward shaping method, which addresses the challenge of high sample complexity in model-free reinforcement learning (RL) by using TTR function to provide reward at each timestep for guiding the policy learning process. In our approach, we first pick up an system model which could be the same one as forward learning process or a simplified model preserving main elements describing reward function. By computing TTR function based on the model we have chosen at the beginning of learning process and integrate it into training as reward at each timestep, the agent receive more dynamics-informed feedback and learns better and faster. TTR-based reward shaping is an effective way to leverage physical priors to dramatically increase the speed of training for model-free algorithms without affecting their flexibility to a large variety of complicated RL problems.

There are still several possible choices of future work. Since usually there will be obstacles in the scene of goal-oriented RL tasks, it would be of interest if the info of obstacles (position, shape) could be incorporated as part of model dynamics to develop more informative TTR-based reward for learning policies. Moreover, TTR-based reward shaping method has potential to replace the sparse reward setting when physical model could be accessible and could be integrated with other deep exploration techniques like hierarchical reinforcement learning to obtain even better experimental performance.

References

  • [1] A. S. Polydoros and L. Nalpantidis, “Survey of model-based reinforcement learning: Applications on robotics,” J. Intelligent & Robotic Systems, vol. 86, no. 2, pp. 153–173, May 2017.
  • [2] A. m. Farahmand, A. Shademan, M. Jagersand, and C. Szepesvari, “Model-based and model-free reinforcement learning for visual servoing,” in Proc. IEEE Int, Conf, Robotics and Automation, 2009.
  • [3] P. Auer, T. Jaksch, and R. Ortner, “Near-optimal regret bounds for reinforcement learning,” in Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds.   Curran Associates, Inc., 2009, pp. 89–96.
  • [4] A. Guez, D. Silver, and P. Dayan, “Efficient bayes-adaptive reinforcement learning using sample-based search,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds.   Curran Associates, Inc., 2012, pp. 1025–1033.
  • [5] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proc. Annual Int. Conf. Machine Learning, 2009.
  • [6] C. Florensa, D. Held, M. Wulfmeier, and P. Abbeel, “Reverse curriculum generation for reinforcement learning,” CoRR, 2017. [Online]. Available: http://arxiv.org/abs/1707.05300
  • [7] V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine, “Model-based value estimation for efficient model-free reinforcement learning,” CoRR, vol. abs/1803.00101, 2018.
  • [8] B. Ivanovic, J. Harrison, A. Sharma, M. Chen, and M. Pavone, “Barc: Backward reachability curriculum for robotic reinforcement learning,” in Proc. IEEE Int. Conf. Robotics and Automation, 2019.
  • [9] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.   USA: A Bradford Book, 2018.
  • [10] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85 – 117, 2015.
  • [11] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, May 2015.
  • [12] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, Feb. 2015.
  • [13] S. Levine and V. Koltun, “Guided policy search,” in Proc. Annual Int. Conf. Machine Learning, 2013, pp. 1–9.
  • [14] J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel, “Trust region policy optimization,” in Proc. Annual Int. Conf. Machine Learning, 2015.
  • [15] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017.
  • [16] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy, “Deep exploration via bootstrapped dqn,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds.   Curran Associates, Inc., 2016, pp. 4026–4034.
  • [17] T. G. Dietterich, “The maxq method for hierarchical reinforcement learning,” in Proc. Annual Int. Conf. Machine Learning, 1998.
  • [18] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artificial Intelligence, vol. 112, no. 1, pp. 181 – 211, 1999.
  • [19] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds.   Curran Associates, Inc., 2016, pp. 3675–3683.
  • [20] A. Daniel Laud, “Theory and application of reward shaping in reinforcement learning,” 04 2011.
  • [21] J. Randløv and P. Alstrøm, “Learning to drive a bicycle using reinforcement learning and shaping,” in Proc. Annual Int. Conf. Machine Learning, 1998.
  • [22] A. Y. Ng, D. Harada, and S. J. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in Proc. Annual Int. Conf. Machine Learning, 1999.
  • [23] E. Wiewiora, G. Cottrell, and C. Elkan, “Principled methods for advising reinforcement learning agents,” in Proc. Annual Int. Conf. Machine Learning, 2003.
  • [24] S. Russell, “Learning agents for uncertain environments (extended abstract),” in

    Proc. Annual Conf. Computational Learning Theory

    , 1998.
  • [25] A. Y. Ng and S. J. Russell, “Algorithms for inverse reinforcement learning,” in Proc. Annual Int. Conf. Machine Learning, 2000.
  • [26] O. Marom and B. Rosman, “Belief reward shaping in reinforcement learning,” in AAAI, 2018.
  • [27] I. M. Mitchell, “The flexible, extensible and efficient toolbox of level set methods,” J. Scientific Computing, vol. 35, no. 2, pp. 300–329, Jun 2008.
  • [28] M. Chen, S. Herbert, and C. J. Tomlin, “Fast reachable set approximations via state decoupling disturbances,” in Proc. IEEE Conf. Decision and Control, 2016.
  • [29] M. Chen, S. L. Herbert, M. S. Vashishtha, S. Bansal, and C. J. Tomlin, “Decomposition of reachable sets and tubes for a class of nonlinear systems,” IEEE Transactions on Automatic Control, vol. 63, no. 11, pp. 3675–3688, Nov 2018.
  • [30] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3, pp. 229–256, May 1992.
  • [31] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” 2013.
  • [32] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. M. A. Eslami, M. Riedmiller, and D. Silver, “Emergence of Locomotion Behaviours in Rich Environments,” arXiv e-prints, Jul 2017.
  • [33] M. Bardi and I. Capuzzo-Dolcetta, Optimal Control and Viscosity Solutions of Hamilton-Jacobi-Bellman Equations, ser. Modern Birkhäuser Classics, 2008.
  • [34] M. Bardi and P. Soravia, “Hamilton-jacobi equations with singular boundary conditions on a free boundary and applications to differential games,” Transactions of the American Mathematical Society, vol. 325, no. 1, pp. 205–229, 1991.
  • [35] R. Takei and R. Tsai, “Optimal trajectories of curvature constrained motion in the hamilton–jacobi formulation,” J. Scientific Computing, vol. 54, no. 2, pp. 622–644, Feb 2013.
  • [36] N. Koenig and A. Howard, “Design and use paradigms for gazebo, an open-source multi-robot simulator,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, 2004.
  • [37] D. J. Webb and J. van den Berg, “Kinodynamic rrt*: Asymptotically optimal motion planning for robots with linear dynamics,” in Proc. IEEE Int.Conf. Robotics and Automation, 2013.