I Introduction
Deep reinforcement learning (DRL) methods have shown recent success on continuous control tasks in robotics systems in simulation [1, 2, 3]. Such methods are applied using no prior knowledge of the systems, leading to problematic sample complexity and thus long training times. Unfortunately, little can be said about the stability or robustness of these resulting control policies, even if more traditional modelbased optimal control solutions exist for these same systems.
Additionally, DRL has been almost exclusively applied in simulation, where a failed trial has no repercussions. In the real world a failure can have catastrophic consequences, including damaging the robot or causing injury to humans in the area. Some recent works have successfully learned a policy for a real robot [4] [5], or transferred policies learned in simulation to the real system [6] [7]. Of particular note in [7]
is that the learned policies outperform the authors’ previous modelbased methods with respect to both energyefficiency and speed. However, instead of training from scratch, it would seem intuitive to use the modelbased methods as an initial starting point for DRL, with the reasoning that at any given moment, our learned policy should do
at worst as well as existing control solutions.One might be tempted to perform imitation learning updates on trajectories taken from running a modelbased optimal control policy, for example using DAgger
[8]. However, due to often mismatching dynamics between the simplified system on which this modelbased control policy is based off of, and the real physical system, this may lead to overfitting a suboptimal policy. There is also the additional concern of deviating too much from the expert trajectories into regions of the state space not previously visited, in which the policy learned only from expert data may perform poorly.Instead, we propose interweaving optimal control samples during the policy rollouts of modelfree DRL methods in the following manner: at each timestep we can evaluate our policy network to get action , as well as query our trajectory optimization to get action . We then simulate the execution of each of these actions individually, and select the one which gave the larger reward as our true action to use in the real world. Such a scheme, shown in Figure 1, should ensure that at worst the agent will always do as well as the modelbased optimal control policy, and can only do better. At the beginning of training, we expect this approach to almost exclusively pick ; due to the network weights being randomly initialized, it is very unlikely to consistently outperform a modelbased method. However as training progresses, and from added policy exploration/exploitation, the number of selected onpolicy samples will increase.
Related work on using trajectory optimization to help learn or guide a control policy include [9], [10], and [11]. These works differ from the proposed method in this paper as they focus more on incorporating offline demonstrations or trajectories into training to guide the policy search, whereas in this work the trajectory optimization is run online at each timestep and compared with the current policy action, ensuring a worst case scenario.
A related work combining prior knowledge of the system with learning is [12], where the policy chooses between actions computed with a simple PID controller and from evaluating the current actor network. Although the authors observe this controller helps achieve faster and more stable learning performance, it is not optimal and much can still be improved in terms of sample complexity. Additionally, there are no guarantees on the policy at any given time, and no minimum time to goal estimates or worst case scenarios.
Another related work that combines learning with MPC is POLO [13]. POLO seeks to improve MPC by learning a global value function, of which it has only a local estimate when initialized. As a result, it cannot be run online and provides no policy guarantees, as it cannot achieve a desired result without first learning the global value function. This paper in contrast seeks to improve on policy learning by using a trajectory optimization framework to guide the learning process, and provides a worstcase scenario action that can be run online.
As a modelbased optimal control policy, in this work we take direct inspiration from the work done in [14, 15, 16, 17], where full body control for underactuated systems is achieved via trajectory optimization and stabilization under constraints. [15] in particular introduces an algorithm (DIRCON) that extends the direct collocation method, incorporating manifold constraints to produce nominal trajectories with thirdorder integration accuracy.
The rest of this paper is organized as follows: Section II provides background details on reinforcement learning, imitation learning as behavioral cloning, and robot dynamics in the context of a car model. Section III describes the trajectory optimization framework used to calculate optimal trajectories for the car, and our algorithm combining this trajectory optimization with deep reinforcement learning (in this case PPO) is presented in Section IV. Section V shows results on the benefits of using our algorithm, and a brief conclusion is given in Section VI.
Ii Preliminaries
Iia Reinforcement Learning
The reinforcement learning framework, which is described thoroughly in [18]
and elsewhere, typically consists of an agent interacting with an environment modeled as a Markov Decision Process (MDP). An MDP is given by a 4tuple
, where S is the set of states, A is the set of actions available to the agent, is the transition function, wheregives the probability of being in state
, taking action , and ending up in state , and is the reward function, where gives the expected reward for being in state , taking action , and ending up in state . The goal of an agent is thus to interact with the environment by selecting actions that will maximize future rewards.In this paper, the states consist of a subset of a robot’s positions and velocities, the actions are motor torques or positions, the transition function is modeled by a physics engine [19], and the reward is a potentialbased function to minimize distance to a target goal.
IiB Proximal Policy Optimization
Although we expect to see benefits from combining trajectory optimization with any deep reinforcement learning algorithm, for this paper we use the current stateoftheart, Proximal Policy Optimization (PPO) [2]. In particular, PPO has achieved breakthrough results for continuous control robotics tasks by optimizing the following surrogate objective with clipped probability ratio:
(1) 
where is an estimator of the advantage function at time step as in [20], and denotes the probability ratio
(2) 
where is a stochastic policy, and
is the vector of policy parameters before the update. This objective seeks to penalize too large of a policy update, which means penalizing deviations of
from 1.IiC Learning from Demonstration
In this work we use the classical behavioral cloning (BC) approach to imitation learning where we seek to minimize the error between an expert action and the maximum likelihood estimate action from the current policy:
(3) 
for expert demonstration stateaction pairs , where is the maximum likelihood estimate action for state using policy .
IiD Robot Dynamics
The equations of motion for a robotic system can be written as:
(4) 
where are the generalized coordinates, is the inertial matrix, denotes centrifugal and Coriolis forces, captures potentials (gravity), are constraint forces (where are unknown multipliers a priori), maps control inputs into generalized forces, and contains nonconservative forces such as friction.
In this work, we specifically consider a simple car model, shown in Figure 2, with , where are the center of mass coordinates in the world frame, is the yaw of the body with respect to the global xaxis, and is the steering angle of the front wheels.
For general wheeled mobile robots, typically contains constraints ensuring no slip (free rolling in the direction the wheel is pointing) and no skid (no velocity along the wheel’s rotation axis perpendicular to the free rolling direction), which come from writing these constraints in Pfaffian form . can be explicitly solved for by differentiating and substituting in from Equation 4.
IiD1 Remark
Because this constraint is in place, the trajectory optimization will not find solutions where skidding is a viable option allowing for greater reward (such as skidding into a parking space instead of parallel parking, or an aggressive turning maneuver to more quickly change directions). The alternative to this constraint would be to add friction approximations such as in [21], for which the optimization must then solve a Linear Complementarity Problem at each contact point. As this can be a very expensive computation, we avoid this consideration and instead entrust the DRL algorithm to use the trajectory optimization as a guide towards learning a better policy, in which slip may or may not be optimal.
Iii Trajectory Optimization
This section provides details for formulating the locomotion problem for a robotic system as a trajectory optimization. At a high level, the full nonlinear system is discretized, and direct collocation along with backward Euler integration is used to generate motion as in [14] [22]. More precisely, the problem is formulated as:
find  (5)  
subject to  
State Constraints  
(6)  
(7)  
Dynamics Constraints  
(8) 
where each of the above constraints are detailed below, along with cost function considerations.
Iiia Objectives
The cost function is defined as the weighted squared error between the goal coordinates and the body coordinates , where is the number of sample points for the trajectory:
(9) 
where weights can vary based on the desired task, i.e. if final body orientation is important.
IiiB State constraints
The initial states and are constrained exactly based on the robot’s current state. For the rest of the time points, and are bounded by joint position and velocity limits. The input torques are also bounded explicitly by the physical constraints of the robot, as well as implicitly by ranges.
IiiC Dynamics constraints
At each time step , with the time step interval, the dynamics are constrained:
(10)  
(11) 
with
(12) 
where we write as , and similar for other terms.
Iv Cooperative Trajectory Optimization and Deep Reinforcement Learning
In this section we detail our algorithm, Cooperative Trajectory Optimization and PPO (CoTOPPO), shown in Algorithm 1. The main idea is that for each new observation at time step , the current PPO actor network is queried for action , and a trajectory optimization is solved for action . Each of these actions is simulated individually to get rewards and . The action that produces the larger simulated reward is the one that is selected as the true best action and used in the real world (or to step the actual simulation). Necessary transition information is then appended to either the PPO dataset
or Supervised Learning (SL) dataset
, depending on which action was selected. After time steps corresponding to the current policy/trajectory optimization roll out, the actorcritic PPO networks are updated by optimizing on dataset , and the actor network is additionally updated with supervised learning by optimizing on dataset .V Results
Va Implementation Details
We use a combination of OpenAi Gym [23] to represent the MDP and PyBullet [19] as the physics engine for training and simulation purposes.
We additionally use the OpenAI Baselines [24] implementation of PPO (which optimizes as discussed in Sec. IIB
) with the default hyperparameters, but with the Beta distribution to select continuous actions as suggested in
[25]to avoid the bias introduced with limited control ranges when using the standard Gaussian distribution. The Beta distribution parameters
andare TensorFlow variables and are therefore updated during each SGD minibatch, so the action variance will decrease as the policy converges. The Gym environment we use is similar to the standard HumanoidFlagRun environments, but the humanoid is replaced with the Bullet MIT racecar. Example snapshots from the environment are shown in Figure
3. The goal destination/flag is updated only when the car center of mass lies within 0.2 [m] of the current goal location, and then placed randomly in a 10 by 10 [m] grid. The agent has 10 seconds to maximize rewards each trial, which will typically consist of reaching several goal locations consecutively.Our neural network architecture is the default MultiLayer Perceptron, consisting of 2 fully connected hidden layers of 64 neurons each, with tanh activation. The policy and value networks each have this same structure.
The trajectory optimization is implemented in Python with CasADi [26], using IPOPT [27] to solve the NLP. Due to the imposed torque and velocity limits as well as the nonholonomic constraints added to the dynamics, the trajectory optimization will find solutions that are suboptimal to the true best policy. We will see in the following subsection that even this suboptimal trajectory optimization has a large benefit when combined with policies either learned from scratch or with our method, both during training and testing.
The observation space is:
(13) 
which is the distance from the center of mass to the goal location, the angle from the current body heading to the goal, the steering angle of the front wheels, the body velocity in the global and directions, the body yaw rate, and the yaw rate of the steering wheels.
The action space is , which is a desired body velocity to be set with velocity control mapped to a differential drive, and desired steering angle to be set with position control.
We consider potentialbased shaping reward functions of the form:
(14) 
to guarantee consistency with the optimal policy, as proved by Ng et. al in [28]. The real valued function seeks to minimize the distance to a target goal :
(15) 
This reward scheme gives dense rewards at each time step, towards ensuring the optimal policy is learned, rewarding incremental motion in the direction of the current goal. Having dense rewards is important in this framework as we are choosing between actions based on the simulated instantaneous reward, which would likely be 0 at most time steps under sparse reward scenarios.
VB Experiments
We seek to compare and evaluate the following methods:

pure PPO

pure trajectory optimization

CoTOPPO  (our method)

CoTOPPO, policy only  how well does the policy learned from CoTOPPO perform on its own, without the failsafe action of the trajectory optimization?

CoTO(pure PPO)  how well does combining trajectory optimization with an entirely separately trained agent with PPO perform?
The reader is encouraged to watch the accompanying video^{1}^{1}1https://youtu.be/mv2xw83NyWU for simulations of the discussed policies.
Figure 4
shows the episode reward mean vs. number of training time steps for running pure PPO as well as CoTOPPO on the CarFlagRun environment. The episode reward mean indicates how well the agent was able to continue to progress in the direction of the goal location(s) during each trial. Due to doing at worst as well as the trajectory optimization, CoTOPPO begins with very high reward mean, and only improves from there as the networks are updated with both DRL and supervised learning updates. Pure PPO on the other hand is forced to learn from scratch, and even after training for 1 million time steps, is only able to do as well as the trajectory optimization combined with essentially uniformly distributed random noise of the uninitialized policy from CoTOPPO.
Figure 5 shows the percentage of samples picked by the policy network of PPO in CoTOPPO that outperform the actions from the trajectory optimization over training. As expected, when the policies are randomly initialized, few samples from PPO will outperform even a suboptimal trajectory optimization. Eventually as training progresses, the percentage of maximal reward samples picked with PPO converges to around 75% of the time. This shows there is still a benefit to using the trajectory optimization as a worst case scenario, as it is still being picked 25% of the time after 1 million training time steps.
Reward Mean  Percent PPO Actions  
Algorithm  Stochastic  Deterministic  Stochastic  Deterministic 
pure PPO  12.9  13.8     
Trajectory Optimization    12.1     
CoTOPPO  15.1  15.7  75  81 
CoTOPPO, policy only  14.1  14.7     
CoTO(pure PPO)  14.5  14.5  44  57 
Table I details the reward mean and percentage of PPO actions picked (if relevant) with various algorithms and scenarios across 100 trials, after training for 1 million time steps. We do trials of evaluating each policy by sampling from the output Beta distributions stochastically, as well as deterministically evaluating the distributions with the maximum likelihood estimate. We see that combining the trajectory optimization with PPO significantly increases the mean reward, with our method CoTOPPO having the best performance. If we evaluate only the policy trained from CoTOPPO, it is still a significant improvement over the policy trained from pure PPO alone. We also evaluate the effect of combining the policy trained with pure PPO with the trajectory optimization, labeled CoTO(pure PPO), which makes it clear that pure PPO has learned a suboptimal policy, as the combination with the TO leads to a larger reward mean.
In this latter case, we also track what percent of the time CoTO(pure PPO) picks the TO action vs. the action selected from the policy network of pure PPO. Despite training for 1 million time steps, our algorithm finds that the trajectory optimization performs better than the policy trained from scratch roughly half of the time. In comparison, the policy trained from our method CoTOPPO is picked over 75% of the time, despite far fewer onpolicy samples (due to using the TO samples for supervised updates).
VC Maximum Instantaneous Reward Discussion
A first look at the algorithm may seem to imply that it is greedy, rather than optimal, as the agent selects the action leading to the maximum instantaneous reward, rather than a function of expected returns. We experimented with simulating taking multiple actions from the TO and from RL over varying horizons, but found significantly worse performance with this method. One plausible explanation for this result is that when first initialized, the PPO actor network is essentially taking random actions. As the horizon increases on which we simulate taking actions from TO and RL separately, the expected returns of taking a series of random actions regresses toward 0 under our reward scheme, and thus the probability that RL will outperform the TO tends to 0. Said another way, it becomes increasingly unlikely to have multiple “lucky actions” from RL in a row during exploration as the horizon increases. Since the agent will correspondingly almost always choose the actions from the suboptimal TO, the policy network will almost always be updated with supervised learning updates and will correspondingly converge, approximately, to this same suboptimal policy.
On the other hand, by using the maximum instantaneous transition reward from a horizon of 1, there is a much higher probability of sampling a “good” action from an uninitialized random policy. This allows for more efficient exploration of the environment than by overwhelmingly following the (suboptimal) trajectory optimization actions, leading to a better overall policy (such as more aggressive throttle values, steering angles during turns, etc.), while still ensuring a reasonable worst case scenario action from the more dynamically conservative TO solution, for cases in which we sample a worseperforming action with RL. The short horizon also avoids overfitting to the suboptimal trajectory optimization expert.
Vi Conclusion
In this work we have shown the benefits of combining trajectory optimization and deep reinforcement learning methods into one training process. Using these two methods cooperatively allows for online use of our algorithm at any point in the training process, knowing that the worst case scenario will be as good as a modelbased trajectory optimization. This additionally leads to much greater sample efficiency, and avoids unnecessary exploration of randomly initialized policies, towards avoiding local optima.
Even if the trajectory optimization is suboptimal due to mismatching dynamics or overly conservative constraints, there is a clear advantage to incorporating prior knowledge of the system to speed up and guide learning. We also observe that trained policies, whether exclusively learned with deep reinforcement learning or from our combined method, are likely to converge to local optima and cannot exhaustively span all observation states, showing the benefit of modelbased methods as a proven failsafe option. The need to be able to put bounds on learned policies and guarantee some sort of behavior is clear, and this work presents preliminary steps in this direction.
The method detailed in this paper can be readily applied to any robotic system, and should be an effective way to reduce sampling complexity, accelerate training, guide the policy search, deploy policies online at any point in the training process, and give an upper bound estimate on timetogoal through the trajectory optimization.
References
 [1] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. M. A. Eslami, M. A. Riedmiller, and D. Silver, “Emergence of locomotion behaviours in rich environments,” CoRR, vol. abs/1707.02286, 2017. [Online]. Available: http://arxiv.org/abs/1707.02286
 [2] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017. [Online]. Available: http://arxiv.org/abs/1707.06347
 [3] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, 2015.
 [4] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), May 2017, pp. 3389–3396.
 [5] T. Haarnoja, A. Zhou, S. Ha, J. Tan, G. Tucker, and S. Levine, “Learning to walk via deep reinforcement learning,” CoRR, vol. abs/1812.11103, 2018. [Online]. Available: http://arxiv.org/abs/1812.11103
 [6] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke, “Simtoreal: Learning agile locomotion for quadruped robots,” CoRR, vol. abs/1804.10332, 2018. [Online]. Available: http://arxiv.org/abs/1804.10332
 [7] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter, “Learning agile and dynamic motor skills for legged robots,” Science Robotics, vol. 4, no. 26, 2019. [Online]. Available: https://robotics.sciencemag.org/content/4/26/eaau5872
 [8] S. Ross, G. J. Gordon, and J. A. Bagnell, “Noregret reductions for imitation learning and structured prediction,” CoRR, vol. abs/1011.0686, 2010. [Online]. Available: http://arxiv.org/abs/1011.0686
 [9] I. Mordatch and E. Todorov, “Combining the benefits of function approximation and trajectory optimization,” in In Robotics: Science and Systems (RSS), 2014.
 [10] S. Levine and V. Koltun, “Variational policy search via trajectory optimization,” in Proceedings of the 26th International Conference on Neural Information Processing Systems  Volume 1, ser. NIPS’13. USA: Curran Associates Inc., 2013, pp. 207–215. [Online]. Available: http://dl.acm.org/citation.cfm?id=2999611.2999635

[11]
——, “Guided policy search,” in Proceedings of the 30th
International Conference on Machine Learning
, ser. Proceedings of Machine Learning Research, S. Dasgupta and D. McAllester, Eds., vol. 28, no. 3. Atlanta, Georgia, USA: PMLR, 17–19 Jun 2013, pp. 1–9. [Online]. Available:
http://proceedings.mlr.press/v28/levine13.html  [12] L. Xie, S. Wang, S. Rosa, A. Markham, and N. Trigoni, “Learning with training wheels: Speeding up training with a simple controller for deep reinforcement learning,” CoRR, vol. abs/1812.05027, 2018. [Online]. Available: http://arxiv.org/abs/1812.05027
 [13] K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch, “Plan online, learn offline: Efficient learning and exploration via modelbased control,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=Byey7n05FQ
 [14] M. Posa and R. Tedrake, “Direct trajectory optimization of rigid body dynamical systems through contact,” in Algorithmic foundations of robotics X. Springer Tracts in Adv. Robotics, 2013, pp. 527–542.
 [15] M. Posa, S. Kuindersma, and R. Tedrake, “Optimization and stabilization of trajectories for constrained dynamical systems,” in Proc. IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2016, pp. 1366–1373.
 [16] M. Posa, C. Cantu, and R. Tedrake, “A direct method for trajectory optimization of rigid bodies through contact,” International Journal of Robotics Research, vol. 33, no. 1, pp. 69–81, 2014.
 [17] S. Feng, E. Whitman, X. Xinjilefu, and C. G. Atkeson, “Optimizationbased full body control for the DARPA Robotics Challenge,” Journal of Field Robotics, vol. 32, no. 2, pp. 293–312, 2015.
 [18] R. S. Sutton and A. G. Barto, Reinforcement learning  an introduction, ser. Adaptive computation and machine learning. MIT Press, 1998.
 [19] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” http://pybullet.org, 2016–2019.
 [20] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel, “Highdimensional continuous control using generalized advantage estimation,” CoRR, vol. abs/1506.02438, 2015.
 [21] D. E. Stewart and J. C. Trinkle, “An implicit timestepping scheme for rigid body dynamics with Coulomb friction,” in Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), 2000, pp. 162–169. [Online]. Available: https://doi.org/10.1109/ROBOT.2000.844054
 [22] G. Bellegarda and K. Byl, “Trajectory optimization for a wheellegged system for dynamic maneuvers that allow for wheel slip,” in 2019 IEEE International Conference on Decision and Control (CDC), in Press.
 [23] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.
 [24] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu, “Openai baselines,” https://github.com/openai/baselines, 2017.
 [25] P.W. Chou, D. Maturana, and S. Scherer, “Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. International Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp. 834–843. [Online]. Available: http://proceedings.mlr.press/v70/chou17a.html
 [26] J. A. E. Andersson, J. Gillis, G. Horn, J. B. Rawlings, and M. Diehl, “CasADi – A software framework for nonlinear optimization and optimal control,” Mathematical Programming Computation, vol. 11, no. 1, pp. 1–36, 2019.
 [27] A. Wächter and L. T. Biegler, “On the implementation of an interiorpoint filter linesearch algorithm for largescale nonlinear programming,” Mathematical Programming, vol. 106, no. 1, pp. 25–57, Mar 2006. [Online]. Available: https://doi.org/10.1007/s101070040559y
 [28] A. Y. Ng, D. Harada, and S. J. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in Proceedings of the Sixteenth International Conference on Machine Learning, ser. ICML ’99. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1999, pp. 278–287. [Online]. Available: http://dl.acm.org/citation.cfm?id=645528.657613
Comments
There are no comments yet.