Deep reinforcement learning (DRL) methods have shown recent success on continuous control tasks in robotics systems in simulation [1, 2, 3]. Such methods are applied using no prior knowledge of the systems, leading to problematic sample complexity and thus long training times. Unfortunately, little can be said about the stability or robustness of these resulting control policies, even if more traditional model-based optimal control solutions exist for these same systems.
Additionally, DRL has been almost exclusively applied in simulation, where a failed trial has no repercussions. In the real world a failure can have catastrophic consequences, including damaging the robot or causing injury to humans in the area. Some recent works have successfully learned a policy for a real robot  , or transferred policies learned in simulation to the real system  . Of particular note in 
is that the learned policies outperform the authors’ previous model-based methods with respect to both energy-efficiency and speed. However, instead of training from scratch, it would seem intuitive to use the model-based methods as an initial starting point for DRL, with the reasoning that at any given moment, our learned policy should doat worst as well as existing control solutions.
One might be tempted to perform imitation learning updates on trajectories taken from running a model-based optimal control policy, for example using DAgger. However, due to often mismatching dynamics between the simplified system on which this model-based control policy is based off of, and the real physical system, this may lead to overfitting a suboptimal policy. There is also the additional concern of deviating too much from the expert trajectories into regions of the state space not previously visited, in which the policy learned only from expert data may perform poorly.
Instead, we propose interweaving optimal control samples during the policy rollouts of model-free DRL methods in the following manner: at each timestep we can evaluate our policy network to get action , as well as query our trajectory optimization to get action . We then simulate the execution of each of these actions individually, and select the one which gave the larger reward as our true action to use in the real world. Such a scheme, shown in Figure 1, should ensure that at worst the agent will always do as well as the model-based optimal control policy, and can only do better. At the beginning of training, we expect this approach to almost exclusively pick ; due to the network weights being randomly initialized, it is very unlikely to consistently outperform a model-based method. However as training progresses, and from added policy exploration/exploitation, the number of selected on-policy samples will increase.
Related work on using trajectory optimization to help learn or guide a control policy include , , and . These works differ from the proposed method in this paper as they focus more on incorporating offline demonstrations or trajectories into training to guide the policy search, whereas in this work the trajectory optimization is run online at each timestep and compared with the current policy action, ensuring a worst case scenario.
A related work combining prior knowledge of the system with learning is , where the policy chooses between actions computed with a simple PID controller and from evaluating the current actor network. Although the authors observe this controller helps achieve faster and more stable learning performance, it is not optimal and much can still be improved in terms of sample complexity. Additionally, there are no guarantees on the policy at any given time, and no minimum time to goal estimates or worst case scenarios.
Another related work that combines learning with MPC is POLO . POLO seeks to improve MPC by learning a global value function, of which it has only a local estimate when initialized. As a result, it cannot be run online and provides no policy guarantees, as it cannot achieve a desired result without first learning the global value function. This paper in contrast seeks to improve on policy learning by using a trajectory optimization framework to guide the learning process, and provides a worst-case scenario action that can be run online.
As a model-based optimal control policy, in this work we take direct inspiration from the work done in [14, 15, 16, 17], where full body control for underactuated systems is achieved via trajectory optimization and stabilization under constraints.  in particular introduces an algorithm (DIRCON) that extends the direct collocation method, incorporating manifold constraints to produce nominal trajectories with third-order integration accuracy.
The rest of this paper is organized as follows: Section II provides background details on reinforcement learning, imitation learning as behavioral cloning, and robot dynamics in the context of a car model. Section III describes the trajectory optimization framework used to calculate optimal trajectories for the car, and our algorithm combining this trajectory optimization with deep reinforcement learning (in this case PPO) is presented in Section IV. Section V shows results on the benefits of using our algorithm, and a brief conclusion is given in Section VI.
Ii-a Reinforcement Learning
The reinforcement learning framework, which is described thoroughly in 
and elsewhere, typically consists of an agent interacting with an environment modeled as a Markov Decision Process (MDP). An MDP is given by a 4-tuple, where S is the set of states, A is the set of actions available to the agent, is the transition function, where
gives the probability of being in state, taking action , and ending up in state , and is the reward function, where gives the expected reward for being in state , taking action , and ending up in state . The goal of an agent is thus to interact with the environment by selecting actions that will maximize future rewards.
In this paper, the states consist of a subset of a robot’s positions and velocities, the actions are motor torques or positions, the transition function is modeled by a physics engine , and the reward is a potential-based function to minimize distance to a target goal.
Ii-B Proximal Policy Optimization
Although we expect to see benefits from combining trajectory optimization with any deep reinforcement learning algorithm, for this paper we use the current state-of-the-art, Proximal Policy Optimization (PPO) . In particular, PPO has achieved breakthrough results for continuous control robotics tasks by optimizing the following surrogate objective with clipped probability ratio:
where is an estimator of the advantage function at time step as in , and denotes the probability ratio
where is a stochastic policy, and
is the vector of policy parameters before the update. This objective seeks to penalize too large of a policy update, which means penalizing deviations offrom 1.
Ii-C Learning from Demonstration
In this work we use the classical behavioral cloning (BC) approach to imitation learning where we seek to minimize the error between an expert action and the maximum likelihood estimate action from the current policy:
for expert demonstration state-action pairs , where is the maximum likelihood estimate action for state using policy .
Ii-D Robot Dynamics
The equations of motion for a robotic system can be written as:
where are the generalized coordinates, is the inertial matrix, denotes centrifugal and Coriolis forces, captures potentials (gravity), are constraint forces (where are unknown multipliers a priori), maps control inputs into generalized forces, and contains non-conservative forces such as friction.
In this work, we specifically consider a simple car model, shown in Figure 2, with , where are the center of mass coordinates in the world frame, is the yaw of the body with respect to the global x-axis, and is the steering angle of the front wheels.
For general wheeled mobile robots, typically contains constraints ensuring no slip (free rolling in the direction the wheel is pointing) and no skid (no velocity along the wheel’s rotation axis perpendicular to the free rolling direction), which come from writing these constraints in Pfaffian form . can be explicitly solved for by differentiating and substituting in from Equation 4.
Because this constraint is in place, the trajectory optimization will not find solutions where skidding is a viable option allowing for greater reward (such as skidding into a parking space instead of parallel parking, or an aggressive turning maneuver to more quickly change directions). The alternative to this constraint would be to add friction approximations such as in , for which the optimization must then solve a Linear Complementarity Problem at each contact point. As this can be a very expensive computation, we avoid this consideration and instead entrust the DRL algorithm to use the trajectory optimization as a guide towards learning a better policy, in which slip may or may not be optimal.
Iii Trajectory Optimization
This section provides details for formulating the locomotion problem for a robotic system as a trajectory optimization. At a high level, the full nonlinear system is discretized, and direct collocation along with backward Euler integration is used to generate motion as in  . More precisely, the problem is formulated as:
where each of the above constraints are detailed below, along with cost function considerations.
The cost function is defined as the weighted squared error between the goal coordinates and the body coordinates , where is the number of sample points for the trajectory:
where weights can vary based on the desired task, i.e. if final body orientation is important.
Iii-B State constraints
The initial states and are constrained exactly based on the robot’s current state. For the rest of the time points, and are bounded by joint position and velocity limits. The input torques are also bounded explicitly by the physical constraints of the robot, as well as implicitly by ranges.
Iii-C Dynamics constraints
At each time step , with the time step interval, the dynamics are constrained:
where we write as , and similar for other terms.
Iv Cooperative Trajectory Optimization and Deep Reinforcement Learning
In this section we detail our algorithm, Cooperative Trajectory Optimization and PPO (CoTO-PPO), shown in Algorithm 1. The main idea is that for each new observation at time step , the current PPO actor network is queried for action , and a trajectory optimization is solved for action . Each of these actions is simulated individually to get rewards and . The action that produces the larger simulated reward is the one that is selected as the true best action and used in the real world (or to step the actual simulation). Necessary transition information is then appended to either the PPO dataset
or Supervised Learning (SL) dataset, depending on which action was selected. After time steps corresponding to the current policy/trajectory optimization roll out, the actor-critic PPO networks are updated by optimizing on dataset , and the actor network is additionally updated with supervised learning by optimizing on dataset .
V-a Implementation Details
to avoid the bias introduced with limited control ranges when using the standard Gaussian distribution. The Beta distribution parametersand
are TensorFlow variables and are therefore updated during each SGD minibatch, so the action variance will decrease as the policy converges. The Gym environment we use is similar to the standard HumanoidFlagRun environments, but the humanoid is replaced with the Bullet MIT racecar. Example snapshots from the environment are shown in Figure3. The goal destination/flag is updated only when the car center of mass lies within 0.2 [m] of the current goal location, and then placed randomly in a 10 by 10 [m] grid. The agent has 10 seconds to maximize rewards each trial, which will typically consist of reaching several goal locations consecutively.
Our neural network architecture is the default Multi-Layer Perceptron, consisting of 2 fully connected hidden layers of 64 neurons each, with tanh activation. The policy and value networks each have this same structure.
The trajectory optimization is implemented in Python with CasADi , using IPOPT  to solve the NLP. Due to the imposed torque and velocity limits as well as the nonholonomic constraints added to the dynamics, the trajectory optimization will find solutions that are suboptimal to the true best policy. We will see in the following subsection that even this suboptimal trajectory optimization has a large benefit when combined with policies either learned from scratch or with our method, both during training and testing.
The observation space is:
which is the distance from the center of mass to the goal location, the angle from the current body heading to the goal, the steering angle of the front wheels, the body velocity in the global and directions, the body yaw rate, and the yaw rate of the steering wheels.
The action space is , which is a desired body velocity to be set with velocity control mapped to a differential drive, and desired steering angle to be set with position control.
We consider potential-based shaping reward functions of the form:
to guarantee consistency with the optimal policy, as proved by Ng et. al in . The real valued function seeks to minimize the distance to a target goal :
This reward scheme gives dense rewards at each time step, towards ensuring the optimal policy is learned, rewarding incremental motion in the direction of the current goal. Having dense rewards is important in this framework as we are choosing between actions based on the simulated instantaneous reward, which would likely be 0 at most time steps under sparse reward scenarios.
We seek to compare and evaluate the following methods:
pure trajectory optimization
CoTO-PPO - (our method)
CoTO-PPO, policy only - how well does the policy learned from CoTO-PPO perform on its own, without the fail-safe action of the trajectory optimization?
CoTO-(pure PPO) - how well does combining trajectory optimization with an entirely separately trained agent with PPO perform?
The reader is encouraged to watch the accompanying video111https://youtu.be/mv2xw83NyWU for simulations of the discussed policies.
shows the episode reward mean vs. number of training time steps for running pure PPO as well as CoTO-PPO on the CarFlagRun environment. The episode reward mean indicates how well the agent was able to continue to progress in the direction of the goal location(s) during each trial. Due to doing at worst as well as the trajectory optimization, CoTO-PPO begins with very high reward mean, and only improves from there as the networks are updated with both DRL and supervised learning updates. Pure PPO on the other hand is forced to learn from scratch, and even after training for 1 million time steps, is only able to do as well as the trajectory optimization combined with essentially uniformly distributed random noise of the uninitialized policy from CoTO-PPO.
Figure 5 shows the percentage of samples picked by the policy network of PPO in CoTO-PPO that outperform the actions from the trajectory optimization over training. As expected, when the policies are randomly initialized, few samples from PPO will outperform even a suboptimal trajectory optimization. Eventually as training progresses, the percentage of maximal reward samples picked with PPO converges to around 75% of the time. This shows there is still a benefit to using the trajectory optimization as a worst case scenario, as it is still being picked 25% of the time after 1 million training time steps.
|Reward Mean||Percent PPO Actions|
|CoTO-PPO, policy only||14.1||14.7||-||-|
Table I details the reward mean and percentage of PPO actions picked (if relevant) with various algorithms and scenarios across 100 trials, after training for 1 million time steps. We do trials of evaluating each policy by sampling from the output Beta distributions stochastically, as well as deterministically evaluating the distributions with the maximum likelihood estimate. We see that combining the trajectory optimization with PPO significantly increases the mean reward, with our method CoTO-PPO having the best performance. If we evaluate only the policy trained from CoTO-PPO, it is still a significant improvement over the policy trained from pure PPO alone. We also evaluate the effect of combining the policy trained with pure PPO with the trajectory optimization, labeled CoTO-(pure PPO), which makes it clear that pure PPO has learned a suboptimal policy, as the combination with the TO leads to a larger reward mean.
In this latter case, we also track what percent of the time CoTO-(pure PPO) picks the TO action vs. the action selected from the policy network of pure PPO. Despite training for 1 million time steps, our algorithm finds that the trajectory optimization performs better than the policy trained from scratch roughly half of the time. In comparison, the policy trained from our method CoTO-PPO is picked over 75% of the time, despite far fewer on-policy samples (due to using the TO samples for supervised updates).
V-C Maximum Instantaneous Reward Discussion
A first look at the algorithm may seem to imply that it is greedy, rather than optimal, as the agent selects the action leading to the maximum instantaneous reward, rather than a function of expected returns. We experimented with simulating taking multiple actions from the TO and from RL over varying horizons, but found significantly worse performance with this method. One plausible explanation for this result is that when first initialized, the PPO actor network is essentially taking random actions. As the horizon increases on which we simulate taking actions from TO and RL separately, the expected returns of taking a series of random actions regresses toward 0 under our reward scheme, and thus the probability that RL will outperform the TO tends to 0. Said another way, it becomes increasingly unlikely to have multiple “lucky actions” from RL in a row during exploration as the horizon increases. Since the agent will correspondingly almost always choose the actions from the suboptimal TO, the policy network will almost always be updated with supervised learning updates and will correspondingly converge, approximately, to this same suboptimal policy.
On the other hand, by using the maximum instantaneous transition reward from a horizon of 1, there is a much higher probability of sampling a “good” action from an uninitialized random policy. This allows for more efficient exploration of the environment than by overwhelmingly following the (suboptimal) trajectory optimization actions, leading to a better overall policy (such as more aggressive throttle values, steering angles during turns, etc.), while still ensuring a reasonable worst case scenario action from the more dynamically conservative TO solution, for cases in which we sample a worse-performing action with RL. The short horizon also avoids overfitting to the suboptimal trajectory optimization expert.
In this work we have shown the benefits of combining trajectory optimization and deep reinforcement learning methods into one training process. Using these two methods cooperatively allows for online use of our algorithm at any point in the training process, knowing that the worst case scenario will be as good as a model-based trajectory optimization. This additionally leads to much greater sample efficiency, and avoids unnecessary exploration of randomly initialized policies, towards avoiding local optima.
Even if the trajectory optimization is suboptimal due to mismatching dynamics or overly conservative constraints, there is a clear advantage to incorporating prior knowledge of the system to speed up and guide learning. We also observe that trained policies, whether exclusively learned with deep reinforcement learning or from our combined method, are likely to converge to local optima and cannot exhaustively span all observation states, showing the benefit of model-based methods as a proven fail-safe option. The need to be able to put bounds on learned policies and guarantee some sort of behavior is clear, and this work presents preliminary steps in this direction.
The method detailed in this paper can be readily applied to any robotic system, and should be an effective way to reduce sampling complexity, accelerate training, guide the policy search, deploy policies online at any point in the training process, and give an upper bound estimate on time-to-goal through the trajectory optimization.
-  N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. M. A. Eslami, M. A. Riedmiller, and D. Silver, “Emergence of locomotion behaviours in rich environments,” CoRR, vol. abs/1707.02286, 2017. [Online]. Available: http://arxiv.org/abs/1707.02286
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017. [Online]. Available: http://arxiv.org/abs/1707.06347
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, 2015.
-  S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), May 2017, pp. 3389–3396.
-  T. Haarnoja, A. Zhou, S. Ha, J. Tan, G. Tucker, and S. Levine, “Learning to walk via deep reinforcement learning,” CoRR, vol. abs/1812.11103, 2018. [Online]. Available: http://arxiv.org/abs/1812.11103
-  J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for quadruped robots,” CoRR, vol. abs/1804.10332, 2018. [Online]. Available: http://arxiv.org/abs/1804.10332
-  J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter, “Learning agile and dynamic motor skills for legged robots,” Science Robotics, vol. 4, no. 26, 2019. [Online]. Available: https://robotics.sciencemag.org/content/4/26/eaau5872
-  S. Ross, G. J. Gordon, and J. A. Bagnell, “No-regret reductions for imitation learning and structured prediction,” CoRR, vol. abs/1011.0686, 2010. [Online]. Available: http://arxiv.org/abs/1011.0686
-  I. Mordatch and E. Todorov, “Combining the benefits of function approximation and trajectory optimization,” in In Robotics: Science and Systems (RSS), 2014.
-  S. Levine and V. Koltun, “Variational policy search via trajectory optimization,” in Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’13. USA: Curran Associates Inc., 2013, pp. 207–215. [Online]. Available: http://dl.acm.org/citation.cfm?id=2999611.2999635
——, “Guided policy search,” in Proceedings of the 30th
International Conference on Machine Learning
, ser. Proceedings of Machine Learning Research, S. Dasgupta and D. McAllester, Eds., vol. 28, no. 3. Atlanta, Georgia, USA: PMLR, 17–19 Jun 2013, pp. 1–9. [Online]. Available:http://proceedings.mlr.press/v28/levine13.html
-  L. Xie, S. Wang, S. Rosa, A. Markham, and N. Trigoni, “Learning with training wheels: Speeding up training with a simple controller for deep reinforcement learning,” CoRR, vol. abs/1812.05027, 2018. [Online]. Available: http://arxiv.org/abs/1812.05027
-  K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch, “Plan online, learn offline: Efficient learning and exploration via model-based control,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=Byey7n05FQ
-  M. Posa and R. Tedrake, “Direct trajectory optimization of rigid body dynamical systems through contact,” in Algorithmic foundations of robotics X. Springer Tracts in Adv. Robotics, 2013, pp. 527–542.
-  M. Posa, S. Kuindersma, and R. Tedrake, “Optimization and stabilization of trajectories for constrained dynamical systems,” in Proc. IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2016, pp. 1366–1373.
-  M. Posa, C. Cantu, and R. Tedrake, “A direct method for trajectory optimization of rigid bodies through contact,” International Journal of Robotics Research, vol. 33, no. 1, pp. 69–81, 2014.
-  S. Feng, E. Whitman, X. Xinjilefu, and C. G. Atkeson, “Optimization-based full body control for the DARPA Robotics Challenge,” Journal of Field Robotics, vol. 32, no. 2, pp. 293–312, 2015.
-  R. S. Sutton and A. G. Barto, Reinforcement learning - an introduction, ser. Adaptive computation and machine learning. MIT Press, 1998.
-  E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” http://pybullet.org, 2016–2019.
-  J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” CoRR, vol. abs/1506.02438, 2015.
-  D. E. Stewart and J. C. Trinkle, “An implicit time-stepping scheme for rigid body dynamics with Coulomb friction,” in Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), 2000, pp. 162–169. [Online]. Available: https://doi.org/10.1109/ROBOT.2000.844054
-  G. Bellegarda and K. Byl, “Trajectory optimization for a wheel-legged system for dynamic maneuvers that allow for wheel slip,” in 2019 IEEE International Conference on Decision and Control (CDC), in Press.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.
-  P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu, “Openai baselines,” https://github.com/openai/baselines, 2017.
-  P.-W. Chou, D. Maturana, and S. Scherer, “Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. International Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp. 834–843. [Online]. Available: http://proceedings.mlr.press/v70/chou17a.html
-  J. A. E. Andersson, J. Gillis, G. Horn, J. B. Rawlings, and M. Diehl, “CasADi – A software framework for nonlinear optimization and optimal control,” Mathematical Programming Computation, vol. 11, no. 1, pp. 1–36, 2019.
-  A. Wächter and L. T. Biegler, “On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming,” Mathematical Programming, vol. 106, no. 1, pp. 25–57, Mar 2006. [Online]. Available: https://doi.org/10.1007/s10107-004-0559-y
-  A. Y. Ng, D. Harada, and S. J. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in Proceedings of the Sixteenth International Conference on Machine Learning, ser. ICML ’99. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1999, pp. 278–287. [Online]. Available: http://dl.acm.org/citation.cfm?id=645528.657613