1 Introduction
Deep reinforcement learning algorithms have recently generated great interest due to their successful application to a range of difficult problems including Computer Go [silver2016mastering] and highdimensional control tasks such as humanoid locomotion [schulman2015trust, lillicrap2015continuous]. While these methods are extremely general and can learn policies and value functions for complex tasks directly from raw data, they can also be sample inefficient, and partiallyoptimized solutions can be arbitrarily poor. These challenges severely restrict RL’s applicability to real systems such as robots due to data collection challenges and safety concerns.
One straightforward way to mitigate these issues is to learn a policy or value function entirely in a highfidelity simulator [todorov2012mujoco, airsim2017fsr] and then deploy the optimized policy on the real system. However, this approach can fail due to model bias, external disturbances, the subtle differences between the real robot’s hardware and poorly modeled phenomena such as friction and contact dynamics. Simtoreal transfer approaches based on domain randomization [tobin2017domain, sadeghi2016cad2rl] and model ensembles [kurutach2018model, shyam2019model] aim to make the policy robust by training it to be invariant to varying dynamics. However, learning a globally consistent value function or policy is hard due to optimization issues such as local optima and covariate shift between the exploration policy used for learning the model and the actual control policy executed on the task [ross2012agnostic].
Model predictive control (MPC) is a widely used method for generating feedback controllers that repeatedly reoptimizes a finite horizon sequence of controls using an approximate dynamics model that predicts the effect of these controls on the system. The first control in the optimized sequence is executed on the real system and the optimization is performed again from the resulting next state. However, the performance of MPC can suffer due to approximate or simplified models and a limited lookahead. Therefore the parameters of MPC, including the model and horizon need to be carefully tuned to obtain good performance. While using a longer horizon is generally preferred, realtime requirements may limit the amount of lookahead and a biased model can result in compounding model errors.
In this work, we present an approach to RL that leverages the complementary properties of modelfree reinforcement learning and modelbased optimal control. Our proposed method views MPC as a way to simultaneously approximate and optimize a local Q function via simulation, and Q learning as a way to improve MPC using realworld data. We focus on the paradigm of entropy regularized reinforcement learning where the aim is to learn a stochastic policy that minimizes the costtogo as well as KL divergence with respect to a prior policy. This approach enables faster convergence by mitigating the overcommitment issue in the early stages of Qlearning and better exploration [fox2015taming]. We discuss how this formulation of reinforcement learning has deep connections to information theoretic stochastic optimal control where the objective is to find control inputs that minimize the cost while staying close to the passive dynamics of the system [theodorou2012relative]
. This helps in both injecting domain knowledge into the controller as well as mitigating issues caused by over optimizing the biased estimate of the current cost due to model error and the limited horizon of optimization. We explore this connection in depth and derive an infinite horizon information theoretic model predictive control algorithm based on
williams2017information. We test our approach called Model Predictive Q Learning (MPQ) on simulated continuous control tasks and compare it against information theoretic MPC and soft QLearning [haarnoja2017reinforcement], where we demonstrate faster learning with fewer system interactions and better performance as compared to MPC and soft QLearning even in the presence of sparse rewards. The learned Q function allows us to truncate the MPC planning horizon which provides additional computational benefits. Finally, we also compare MPQ versus domain randomization(DR) on simtosim tasks. We conclude that DR approaches can be sensitive to the handdesigned distributions for randomizing parameters which causes the learned Q function to be biased and suboptimal on the true system’s parameters, whereas learning from data generated on true system is able to overcome biases and adapt to the real dynamics.2 Related Work
Model predictive control has a rich history in robotics, ranging from control of mobile robots such as quadrotors [desaraju2016fast] and aggressive autonomous vehicles [williams2017information, wagener2019online] to generating complex behaviors for highdimensional systems such as contactrich manipulation [kumar2014real, fu2016one] and humanoid locomotion [erez2013integrated]. The success of MPC can largely be attributed to online policy optimization which helps mitigate model bias. The information theoretic view of MPC aims to find a policy at every timestep that minimizes the cost over a finite horizon as well as the KLdivergence with respect to a prior policy usually specified by the system’s passive dynamics [theodorou2012relative, williams2017information]. This helps maintain exploratory behavior and avoid overcommitment to the current estimate of the cost function, which is biased due to modeling errors and a finite horizon. Samplingbased MPC algorithms [williams2017information, wagener2019online] are also highly parallelizable enabling GPU implementations that aid with realtime control. However, efficient MPC implementations still require careful system identification and extensive amounts of manual tuning.
Deep RL methods are extremely general and can optimize neural network policies from raw sensory inputs with little knowledge of the system dynamics. Both valuebased and policybased approaches
[schulman2015trust] have demonstrated excellent performance on complex control problems. These approaches, however, fall short on several accounts when applying them to a real robotic system. First, they have high sample complexity, potentially requiring millions of interactions with the environment. This can be very expensive on a real robot, not least because the initial performance of the policy can be arbitrarily bad. Using random exploration methods such a greedy can further aggravate this problem. Second, a value function or policy learned entirely in simulation inherits the biases of the simulator. Even if a perfect simulation is available, learning a globally consistent value function or policy is an extremely hard task as noted in [silver2016mastering, zhong2013value]. This can be attributed to local optima when using neural network representations or the inherent biases in the Q learning update rules [fox2015taming, van2016deep]. In fact, it can be difficult to explain why Qlearning algorithms work or fail [schulman2017equivalence].Domain randomization aims to make policies learned in simulation more robust by randomizing simulation parameters during training with the aim of making the policies invariant to potential parameter error [sadeghi2016cad2rl, tobin2017domain, peng2018sim]. However, these policies are not adaptive to unmodelled effects, i.e they take into account only aleoteric and not epistemic uncertainty. Also, such approaches are highly sensitive to handdesigned distributions used for randomizing simulation parameters and can be highly suboptimal on the realsystems parameters, for example, if a very large range of simulation parameters is used. Modelbased approaches aim to use real data to improve the model of the system and then perform reinforcement learning or optimal control using the new model or ensemble of models [kurutach2018model, shyam2019model, ross2012agnostic]. Although learning accurate models is a promising avenue, we argue that learning a globally consistent model is an extremely hard problem and instead we should learn a policy that can rapidly adapt to experienced realworld dynamics.
The use of entropy regularization has been explored in RL and Inverse RL for its better sample efficiency and exploration properties [ziebart2008maximum, fox2015taming, haarnoja2018soft, haarnoja2017reinforcement, schulman2017equivalence]. This framework allows incorporating prior knowledge into the problem and learning multimodal policies that can generalize across different tasks. fox2015taming analyze the theoretical properties of the update rule derived using mutual information minimization and show that this framework can overcome the overestimation issue inherent in the vanilla Qlearning update. In the past, todorov2009efficient have shown that using KLdivergence can convert the optimal control problem into one that is linearly solvable.
Infinite horizon MPC aims to learn a terminal cost function that can add global information to the finite horizon optimization. [rosolia2017learning] learn a terminal cost as a control Lyapunov function and a safety set for the terminal state. These quantites are calculated using all previously visited states and they assume the presence of a controller that can deterministically drive the any state to the goal. [tamar2017learning] learns a cost shaping to make a short horizon MPC mimic the actions produced by long horizon MPC offline. However, since their approach is to mimic a longer horizon MPC, the performance of the learner is fundamentally limited by the the performance of the longer horizon MPC. On the contrary, learning an optimal value function as the terminal cost can potentially lead to close to optimal performance.
Using local optimization is an effective way of improving an imperfect value function as noted in RL literature by [silver2016mastering, silver2017mastering, sun2018truncated, lowrey2018plan, anthony2017thinking]. However, these approaches assume that a perfect model of the system is available. In order to make the policy work on the real system, we argue that it is essential to learn a value function from real data and utilize local optimization to stabilize learning.
3 Preliminaries
We first develop relevant notation and introduce the entropyregularized RL and information theoretic MPC frameworks. We show that they are complimentary approaches to solve a similar problem.
3.1 Reinforcement Learning with Entropy Regularization
A Markov Decision Process (MDP) is defined by tuple
where is the state space, is the action space, is a one step cost function, is the transition function and is a discount factor. A closedloop policy is a distribution over actions given state. Given a policy and a prior policy , the KL divergence between them at a state is given by . Entropyregularized RL [fox2015taming] aims to optimize the objective(1) 
where and are shorthand for and respectively, is a temperature parameter that penalizes deviation of from . Given , we can define the soft value functions as^{1}^{1}1In this work we consider costs instead of rewards and hence aim to find policies that minimize cumulative costtogo.
(2) 
Given a horizon of timesteps, we can use above definitions to write the value functions as
(3) 
It is straightforward to verify that . The objective in Eq. (1) can equivalently be written as
(4) 
The above optimization can be performed either by policy gradient methods that aim to find the optimal policy
via stochastic gradient descent
[schulman2017equivalence] or value based methods that try to iteratively approximate the value function of the optimal policy. In either case, the output of solving the above optimization is a global closedloop control policy .3.2 Information Theoretic MPC
Solving the above optimization can be prohibitively expensive and hard to accomplish online, i.e. at every time step as the system executes, especially when using complex policy classes like deep neural networks. In contrast to this approach, MPC performs online optimization of a simple policy class with a truncated horizon. To achieve this, MPC algorithms such as Model Predictive Path Integral Control (MPPI) [williams2017information] use an approximate dynamics model , which can be a deterministic simulator such as MuJoCo [todorov2012mujoco]. At timestep , starting from the current state , an open loop sequence of actions is sampled from the control distribution denoted by . The objective is to find an optimal sequence of actions to solve
(5)  
where is a terminal cost function and is the passive dynamics of the system, i.e the distribution over actions produced when the control input is zero. The first action in the sequence is then executed on the system and the optimization is performed again from the resulting next state effectively resulting in a closedloop controller. The reoptimization and entropy regularization helps in mitigating effects of modelbias and inaccuracies with optimization by avoiding overcommitment to the current estimate of the cost. A shortcoming of the MPC procedure is the finite horizon. This is especially pronounced in tasks with sparse rewards where a short horizon can make the agent myopic to future rewards. In order to mitigate this, an approach known as infinite horizon MPC sets the terminal cost as a value function that adds global information to the problem.
Having introduced the fundamental concepts, in the next section we develop our approach to combine entropy regularized RL with information theoretic MPC and derive the MPPI update rule from williams2017information for the infinite horizon case.
4 Approach
Infinitehorizon MPC [zhong2013value] replaces the terminal cost by a value function to add global information to the finitehorizon optimization. We focus on MPPI [williams2017information], and show that it implicitly optimizes an upperbound on the entropyregularized objective and derive the infinite horizon update rule. We start by deriving the expression for the optimal policy, which is intractable to sample and then a scheme to iteratively approximate it with a simple policy class similar to williams2017information. Unlike previous approaches, we argue that learning a value function from real system parameters is necessary to mitigate effects of model error.
4.1 Optimal Hstep Boltzmann Distribution
Let and be the joint openloop control distribution and prior over horizon openloop actions respectively and is shorthand for . Since is deterministic, the following holds
where the final inequality results from replacing product of marginals by the joint distributions. Now, consider the following distribution over Hhorizon
(8) 
where is a normalizing constant given by
(9) 
We show that this is the optimal control distribution as . Substituting Eq. (8) in (4.1)
Since is a constant, we have . Hence for in Eq. (8), the soft value function is a constant with gradient zero and is given by
(10) 
which is often referred to in optimal control literature as the “free energy” of the system [theodorou2012relative, williams2017information]. For H=1, Eq. (10) takes the form of the soft value function from [fox2015taming, haarnoja2018soft]. We note that the inequality in Eq. (4.1) implies that the optimal distribution only optimizes an upper bound to the entropyregularized objective. This provides the insight that optimal control algorithms such as MPPI that use this distribution have a fundamental performance limit. We wish to further investigate this in future work.
4.2 Infinite Horizon MPPI Update Rule
Similar to [williams2017information], we derive the MPPI update rule which is used for online policy optimization. Since sampling actions from the optimal control distribution in Eq. (8) is intractable, we consider control policies
which are easy to sample from. We then optimize for a vector of
control inputs , such that the resulting action distribution minimizes the KL divergence with the optimal policy(11) 
The objective can be expanded out as
(12) 
Since the first term does not depend on the control input, we can remove it from the optimization
(13) 
Consider to be multivariate Gaussians over sequence of the controls with constant covariance at each timestep. We can write the control distribution and prior as follows
(14) 
where and are the control inputs and actions respectively at timestep and is the normalizing constant. Here the prior corresponds to the passive dynamics of the system [theodorou2012relative, williams2017information], although other choices of prior are possible. Substituting in Eq. (13) we get
(15) 
The objective can be simplified to the following by integrating out the probability in the first term
(16) 
Since this is a concave function with respect to every , we can find the maximum by setting its gradient with respect to to zero to solve for optimal
(17) 
where the second equality comes from importance sampling to convert the optimal controls into an expectation over the control distribution instead of the optimal distribution which is impossible to sample from. The importance weight can be written as follows (substituting from Eq. (8))
(18) 
Making change of variables for noise sequence sampled from independant Gaussians with zero mean and covariance we get
(19) 
Note that is the optimal Hstep free energy derived in Eq. (10) and can be estimated from MonteCarlo samples as
(20) 
We can form the following iterative update rule where at every iteration the sampled control sequence is updated according to
(21) 
where is a stepsize parameter as proposed by [wagener2019online]. This gives us the infinite horizon MPPI update rule. For , this corresponds soft Qlearning where stochastic optimization is performved to solve for the optimal action online. Now we develop soft Qlearning algorithm that utilizes infinite horizon MPPI to generate actions as well as Qtargets.
4.3 Information Theoretic Model Predictive QLearning Algorithm
We consider Q functions parameterized by denoted by and update parameters by stochastic gradient descent on the loss for a batch of experience tuples sampled from a replay buffer [mnih2015human] where targets are given by
(22) 
Since the value function updates are performed offline, we can utilize large amounts of computation [tamar2017learning] to calculate . We do so by performing multiple iterations of the infinite horizon MPPI update in Eq. (21) from , which allows for directed exploration and better approximation of the free energy (akin to approaches such as Covariance Matrix Adaption, although MPPI does not adapt the covariance). This helps in early stages of learning by providing better quality targets than a random Q function. Intuitively, this update rule leverages the biased dynamics model for steps and a soft Q function at the end learned from interactions with the real system.
At every timestep during online rollouts, an horizon sequence of actions is optimized using infinite horizon MPPI and the first action is executed on the system. Online optimization with predictive models can look ahead to produce better actions making adhoc exploration strategies such as greedy unnecessary. Using predictive models for generating value targets and online policy optimization helps accelerate convergence as we demonstrate in our experiments in the next section. Algorithm 1 shows the complete MPQ algorithm. A closely related approach in literature is POLO [lowrey2018plan], which also uses MPPI and offline value function learning, however POLO assumes access to the true dynamics and does not explore the connection between MPPI and entropy regularized RL, and thus does not use free energy targets.
5 Experiments
We evaluate the efficacy of MPQ on two fronts: (a) overcoming the shortcomings of both stochastic optimal control and model free RL in terms of computational requirements, model bias, and sample efficiency; and (b) learning effective policies on systems for which accurate models are not known.
5.1 Experimental Setup
We focus on simtosim continuous control tasks using the Mujoco simulator [todorov2012mujoco] (except PendulumSwingup
that uses dynamics equations) to study the properties of our algorithm in a controlled manner. We consider roboticsinspired tasks with either sparse rewards or requiring longhorizon planning. The complexity is further aggravated as the agent is not provided with the true dynamics parameters, but rather a uniform distribution over them with a biased mean and added noise. Details of the tasks considered are as follows

[wide, labelwidth=!, labelindent=0pt]

PendulumSwingup: the agent tries to swingup and stabilize a pendulum by applying torque on the hinge given a biased distribution over its mass and length. The cost penalizes the deviation from the upright position and angular velocity. Initial state is randomized after every episode of 10s.

BallInCupSparse: a sparse version of the task from the Deepmind Control Suite deepmindcontrolsuite2018
. Given a cup and ball attached by a tendon, the goal is to swing and catch the ball. The agent controls motors on the two slide joints on the cup and is provided with a biased distribution over the ball’s mass, moment of inertia and tendon stiffness. A cost of 1 is incurred at every timestep and 0 if the ball is in the cup which corresponds to success. The position of the ball is randomized after every episode, which is 4 seconds long.

FetchPushBlock: proposed by 1802.09464, the agent controls the cartesian position and opening a Fetch robot gripper to push a block to a goal location. The cost is the distance between the center of mass of the block and the goal. We provide the agent a biased distribution over the mass, moment of inertia, friction coefficients and size of the object. An episode is successful if the agent gets the block within 5cm of the goal in 4 seconds. The positions of both block and goal are randomized after every episode.

FrankaDrawerOpen: based on a realworld manipulation problem from [chebotar2019closing] where the agent velocity controls a 7DOF Franka Panda arm to open a cabinet drawer. A simple cost function based on Euclidean distance and relative orientation of the end effector with respect to the handle and the displacement of the slide joint on the drawer is used. A biased distribution over damping and frictionloss of robot and drawer joints is provided. Every episode is 4 seconds long after which the arm’s start configuration is randomized. Success corresponds to opening the drawer within 1cm of a target displacement.
We used BallInCupSparse, FetchPushBlock and FrankaDrawerOpen because they are more realistic proxies for realworld robotics tasks as compared to standard OpenAI Gym [1606.01540] baselines such as Ant and HalfCheetah. The parameters we randomize are reasonable in real world scenarios as estimating moment inertia and friction coefficients is especially error prone. Details of default parameters and randomization distributions are in Table 1
. All experiments were performed on a desktop with 12 Intel Core i73930K @ 3.20GHz CPUs and 32 GB RAM with only few hours of CPU training. Qfunctions are parameterized with feedforward neural networks that take as input an observation vector and action. Refer to
A.1 for detailed explanation of tasks.Environment  Cost Function  True Parameters  Biased Distribution 
PendulumSwingup  
BallInCupSparse  0 if ball in cup  
1 else  
FetchPushBlock  
FrankaDrawerOpen  frictionloss = 0.1  
damping=0.1 
5.2 Analysis of Overall Performance
By learning a terminal value function from real data we posit that MPQ will adapt to true system dynamics and truncate the horizon of MPC by adding global information. Using MPC for Q targets, we also expect to be able to learn with significantly less data as compared to modelfree soft Qlearning. Hence, we compare MPQ with the following natural baselines: MPPI using same horizon as MPQ and no terminal value function, MPPI using a longer horizon, SoftQLearning with target networks. Note that MPQ does not use a target network. We do not compare against modelbased RL methods [kurutach2018model, chua2018deep] as learning globally consistent neural network models adds an additional layer of complexity and is beyond the scope of this work. Note that MPQ is a complementary approach to model learning and one can benefit from the other. We make the following observations:
O 1.
MPQ can truncate the planning horizon leading to computational efficiency over MPPI.
Fig. 1 shows that MPQ outperforms MPPI with the same horizon after only a few training episodes and ultimately performs better than MPPI with a much longer horizon. This phenonmenon can be attributed to: (1) global information encapsulated in the Q function; (2) hardness of optimizing longer sequences; and (3) compounding model error in longer horizon rollouts [venkatraman2015improving]. In FetchPushBlock, MPPI with a short horizon (H=10) is unable to reach close to the block whereas MPQ with H=10 is able to outperform MPPI with H=64 within the first 30 episodes of training i.e. roughly 2 minutes of interaction with true simulation parameters. In the highdimensional FrankaDrawerOpen, MPQ with H=10 achieves a success rate of 5 times MPPI with H=10, and outperforms MPPI with H=64 within a few minutes of interaction. We additionally examine the effects of changing the MPC horizon during training and present the results in A.2.
O 2.
MPQ mitigates effects of modelbias through a combination of MPC, entropy regularization and a Q function learned from true system.
Fig. 1 shows that MPQ with short horizon achieves performance close to, or better than, MPPI with access to true dynamics and a longer horizon (dashed gray line) in all tasks.
O 3.
Using MPC provides stable Q targets leading to sample efficiency over SoftQLearning
In BallInCupSparse, FetchPushBlock and FrankaDrawerOpen, SoftQLearning is does not converge to a consistent policy whereas MPQ achieves good performance within few minutes of interaction with true system parameters.
Case Study: Learning Policies for Systems With Inaccurate Models
Domain Randomization (DR) aims to make a policy learned in simulation robust by randomizing the simulation parameters. However, such policies can be suboptimal with respect to true parameters due to bias in randomization distribution.
Q 1.
Can a Qfunction learned using rollouts on a real system overcome model bias and perform better than DR?
We compare MPQ against a DR approach inspired by peng2018sim where simulated rollouts are generated by sampling different parameters at every timestep from a broad distribution shown in Table 1 whereas real system rollouts use the true parameters. Table 2 demonstrates that a Q function learned using DR with simulated rollouts only is unable to generalize to the true parameters during testing and MPQ has over twice the success rate in BallInCupSparse and thrice in FrankaDrawerOpen. Note that MPC always uses simulated rollouts, the difference is whether the data for learning the Q function is generated using biased simulation (DR) or true parameters.
Task  Agent  Avg. success rate 

BallInCupSparse  MPQH4REAL  0.85 
MPQH4DR  0.41  
MPQH1REAL  0.09  
(350 training episodes)  MPQH1DR  0.06 
FrankaDrawerOpen  MPQH10REAL  0.53 
MPQH10DR  0.17  
MPQH1REAL  0.0  
(200 training episodes)  MPQH1DR  0.0 
6 Discussion
In this work we have presented a theoretical connection between information theoretic MPC and entropyregularized RL that naturally provides an algorithm to leverage the benefits of both. The theoretical insight not only ties together the different fields, but opens avenues to designing pragmatic RL algorithms for realworld systems. While the approach is effective on a range of tasks, some important questions are yet to be answered. First, the optimal horizon for MPC is inextricably tied with the model error and optimization artifacts. Investigating this dependence in a principled manner is important for realworld applications. Another interesting avenue of research is characterizing the quality of a parameterized Q function to adapt the horizon of MPC rollouts for smarter exploration.
References
Appendix A Appendix
a.1 Further experimental details
The learned Q function takes as input the current action and a observation vector per task:

[wide, labelwidth=!, labelindent=0pt]

PendulumSwingup: (3 dim)

BallInCupSparse: [] (12 dim) where is angle of line joining ball and target.

FetchPushBlock: [
] (33 dim) 
FrankaDrawerOpen: [
] (39 dim)
For all our experiments we parameterize Q functions with feedforward neural networks with two layers containing 100 units each and activation. We use Adam [kingma2014adam] optimization with a learning rate of 0.001. For generating value function targets in Eq. (22), we use 3 iterations of MPPI optimization except 1 for FrankaDrawerOpen. The MPPI parameters used are listed in Table 3.
Environment  Cost function  Samples  
PendulumSwingup  24  4.0  0.15  0.5  0.9  
BallInCupSparse  0 if ball in cup  36  4.0  0.15  0.55  0.9 
1 else  
FetchPushBlock  36  3.0  0.01  0.5  0.9  
FrankaDrawerOpen  36  4.0  0.05  0.55  0.9  
a.2 Effect of MPC Horizon on Performance
The horizon of optimization is critical to the performance of MPC algorithms. In order to test the effect of optimization horizon on the training performance by runnining MPQ for different values of H and comparing against MPPI without a terminal value function. The results in Fig. 2 provide the following key takeaway
O 4.
Longer optimization horizon can improve performance but also suffers greatly from model bias and optimization issues
In BallInCupSparse where a very broad range of dynamics parameters is used (see Table 1) MPQ with H=4 has the best performance which subsequent degrades with increasing H. FetchPushBlock and FrankaDrawerOpen show a trend where performance initially improves with increasing horizon but starts to degrade after a certain point due to compounding effects of model bias. This phenomenon indicates that finding the optimal sweet spot for the horizon is an interesting research direction which we wish to pursue thoroughly in the future.
Comments
There are no comments yet.