1 Introduction
Modelfree Reinforcement Learning (RL) is increasingly used in challenging sequential decisionmaking problems including highdimensional robotics control tasks (Haarnoja et al., 2018; Schulman et al., 2017) as well as video and board games (Silver et al., 2016, 2017). While these approaches are extremely general, and can theoretically solve complex problems with little prior knowledge, they also typically require a large quantity of training data to succeed. In robotics and engineering domains, data may be collected from realworld interaction, a process that can be dangerous, time consuming, and expensive.
ModelPredictive Control (MPC) offers a simpler, more practical alternative. While RL typically uses data to learn a global model offline, which is then deployed at test time, MPC solves for a policy online by optimizing an approximate model for a finite horizon at a given state. This policy is then executed for a single timestep and the process repeats. MPC is one of the most popular approaches for control of complex, safetycritical systems such as autonomous helicopters (Abbeel et al., 2010), aggressive offroad vehicles (Williams et al., 2016) and humanoid robots (Erez et al., 2013), owing to its ability to use approximate models to optimize complex cost functions with nonlinear constraints (Mayne et al., 2000, 2011).
However, approximations in the model used by MPC can significantly limit performance. Specifically, model bias may result in persistent errors that eventually compound and become catastrophic. For example, in nonprehensile manipulation, practitioners often use a simple quasistatic model that assumes an object does not roll or slide away when pushed. For more dynamic objects, this can lead to aggressive pushing policies that perpetually overcorrect, eventually driving the object off the surface.
Recently, there have been several attempts to combine MPC with model free RL, showing that the combination can improve over the individual approaches alone. Many of these approaches involve using RL to learn a terminal cost function, thereby increasing the effective horizon of MPC (Zhong et al., 2013; Lowrey et al., 2018; Bhardwaj et al., 2020). However, the learned value function is only applied at the end of the MPC horizon. Model errors would still persist in horizon, leading to suboptimal policies. Similar approaches have also been applied to great effect in discrete games with known models (Silver et al., 2016, 2017; Anthony et al., 2017)
, where value functions and policies learned via modelfree RL are used to guide MonteCarlo Tree Search. In this paper, we focus on a somewhat broader question: can machine learning be used to both increase the effective horizon of MPC, while also correcting for model bias?
One straightforward approach is to try to learn (or correct) the MPC model from real data encountered during execution; however there are some practical barriers to this strategy. Handconstructed models are often crudeapproximations of reality and lack the expressivity to represent encountered dynamics. Moreover, increasing the complexity of such models leads to computationally expensive updates that can harm MPC’s online performance. Modelbased RL approaches such as Chua et al. (2018); Nagabandi et al. (2018); Shyam et al. (2019)
aim to learn general neural network models directly from data. However, learning globally consistent models is an exceptionally hard task due to issues such as covariate shift
(Ross and Bagnell, 2012).We propose a framework, MPQ, for weaving together MPC with learned value estimates to tradeoff errors in the MPC model and approximation error in a learned value function. Our key insight is to view MPC as tracing out a series of local Qfunction approximations. We can then blend each of these Qfunctions with value estimates from reinforcement learning. We show that by using a blending parameter , similar to the trace decay parameter in TD(), we can systematically tradeoff errors between these two sources. Moreover, by smoothly decaying over learning episodes we can achieve the best of both worlds  a policy can depend on a prior model before its has encountered any data and then gradually become more reliant on learned value estimates as it gains experience.
To summarize, our key contributions are:
1. A framework that unifies MPC and Modelfree RL through value function approximation.
2. Theoretical analysis of finite horizon planning with approximate models and value functions.
3. Empirical evaluation on challenging manipulation problems with varying degrees of modelbias.
2 Preliminaries
2.1 Reinforcement Learning
We consider an agent acting in an infinitehorizon discounted Markov Decision Process (MDP). An MDP is defined by a tuple
where is the state space, is the action space, is the perstep cost function, is the stochastic transition dynamics and is the discount factor and is a distribution over initial states. A closedloop policy outputs a distribution over actions given a state. Let be the distribution over stateaction trajectories obtained by running policy on . The value function for a given policy , is defined as and the actionvalue function as . The objective is to find an optimal policy . We can also define the (dis)advantage function , which measures how good an action is compared to the action taken by the policy in expectation. It can be equivalently expressed in terms of the Bellman error as .2.2 ModelPredictive Control
MPC is a widely used technique for synthesizing closedloop policies for MDPs. Instead of trying to solve for a single, globally optimal policy, MPC follows a more pragmatic approach of optimizing simple, local policies online. At every timestep on the system, MPC uses an approximate model of the environment to search for a parameterized policy that minimizes cost over a finite horizon. An action is sampled from the policy and executed on the system. The process is then repeated from the next state, often by warmstarting the optimization from the previous solution.
We formalize this process as solving a simpler surrogate MDP online, which differs from by using an approximate cost function , transition dynamics and limiting horizon to . Since it plans to a finite horizon, it’s also common to use a terminal stateaction value function that estimates the costtogo. The start state distribution is a diracdelta function centered on the current state . MPC can be viewed as iteratively constructing an estimate of the Qfunction of the original MDP , given policy at state :
(1) 
MPC then iteratively optimizes this estimate (at current system state ) to update the policy parameters
(2) 
Alternatively, we can also view the above procedure from the perspective of disadvantage minimization. Let us define an estimator for the 1step disadvantage with respect to the potential function as . We can then equivalently write the above optimization as minimizing the discounted sum of disadvantages over time via the telescoping sum trick
(3) 
Although the above formulation queries the at every timestep, it is still exactly equivalent to the original problem and hence, does not mitigate the effects of modelbias. In the next section, we build a concrete method to address this issue by formulating a novel way to blend Qestimates from MPC and a learned value function that can balance their respective errors.
3 Mitigating Bias in MPC via Reinforcement Learning
In this section, we develop our approach to systematically deal with model bias in MPC by blendingin learned value estimates. First, we take a closer look at the different sources of error in the estimate in (1) and then propose an easytoimplement, yet effective strategy for trading them off.
3.1 Sources of Error in MPC
The performance of MPC algorithms critically depends on the quality of the Qfunction estimator in (1). There are three major sources of approximation error. First, model bias can cause compounding errors in predicted state trajectories, which biases the estimates of the costs of different action sequences. The effect of model error becomes more severe as . Second, the error in the terminal value function gets propagated back to the estimate of the Qfunction at the start state. With discounting, the effect of error due to inaccurate terminal value function diminishes as increases. Third, using a small with an inaccurate terminal value function can make the MPC algorithm greedy and myopic to rewards further out in the future.
We can formally bound the performance of the policy with approximate models and approximate learned value functions. In Theorem 3.1, we show the loss in performance of the resulting policy as a function of the model error, value function error and the planning horizon.
Theorem 3.1 (Proof Appendix a.1.2).
Let MDP be an approximation of such that , we have and . Let the learned value function be an approximation of the true value function . The performance of the MPC policy is bounded w.r.t the optimal policy as
(4) 
This theorem generalizes over various established results. Setting gives us the 1step simulation lemma in Kearns and Singh (2002) (Appendix A.1.1). Setting , i.e. true model, recovers the costshaping result in Sun et al. (2018).
Further inspecting terms in (4), we see that the model error increases with horizon (the first two terms) while the learned value error decreases with , which matches our intuitions.
In practice, the errors in model and value function are usually unknown and hard to estimate making it impossible to optimally set the MPC horizon. Instead, we next propose a strategy to blend the Qestimates from MPC and the learned value function at every timestep along the horizon, instead of just the terminal step such that we can properly balance the different sources of error.
3.2 Blending Model Predictive Control and Value Functions
A naive way to blend Qestimates from MPC with Qestimates from the value function would be to consider a convex combination of the two
(5) 
where . Here, the value function is contributing to a residual that is added to the MPC output, an approach commonly used to combine modelbased and modelfree methods (Lee et al., 2020). However, this is solution is rather ad hoc. If we have at our disposal a value function, why invoke it at only at the first and last timestep? As the value function gets better, it should be useful to invoke it at all timesteps.
Instead, consider the following recursive formulation for the Qestimate. Given , the stateaction encountered at horizon , the blended estimate is expressed as
(6) 
where . The recursion ends at . In other words, the current blended estimate is a convex combination of the modelfree value function and the onestep modelbased return. The return in turn uses the future blended estimate. Note unlike (5), the modelfree estimate is invoked at every timestep.
We can unroll (6) in time to show , the blended horizon estimate, is simply an exponentially weighted average of all horizon estimates
(7) 
where is a horizon estimate. When , the estimator reduces to the just using and when we recover the original MPC estimate in (1). For intermediate values of
, we interpolate smoothly between the two by interpolating all
estimates.Implementing (7) naively would require running versions of MPC and then combining their outputs. This is far too expensive. However, we can switch to the disadvantage formulation by applying a similar telescoping trick
(8) 
This estimator has a similar form as the estimator for the value function. However, while uses the
parameter for biasvariance tradeoff, our blended estimator aims tradeoff bias in dynamics model with bias in learned value function.
Why use blending when one can simply tune the horizon ? First, limits the resolution we can tune since it’s an integer – as gets smaller the resolution becomes worse. Second, the blended estimator uses far more samples. Say we have access to optimal horizon . Even if both and
had the same bias, the latter uses a strict subset of samples as the former. Hence the variance of the blended estimator will be lower, with high probability.
4 The Mpq Algorithm
We develop a simple variant of QLearning, called ModelPredictive QLearning with Weights (MPQ) that learns a parameterized Qfunction estimate . Our algorithm, presented in Alg. 1, modifies Qlearning to use blended Qestimates as described in the (8) for both action selection and generating value targets. The parameter is used to tradeoff the errors due to modelbias and learned Qfunction, . This can be viewed as an extension of the MPQ algorithm from Bhardwaj et al. (2020) to explicitly deal with model bias by incorporating the learned Qfunction at all timesteps. Unlike MPQ, we do not explicitly consider the entropyregularized formulation, although our framework can be modified to incorporate softQ targets.
At every timestep , MPQ proceeds by using horizon MPC from the current state to optimize a policy with parameters . We modify the MPC algorithm to optimize for the greedy policy with respect to the blended Qestimator in (8), that is
(9) 
An action sampled from the resulting policy is then executed on the system. A commonly used heuristic is to warm start the above optimization by shifting forward the solution from the previous timestep, which serves as a good initialization if the noise in the dynamics in small
(Wagener et al., 2019). This can significantly cut computational cost by reducing the number of iterations required to optimize ( 9) at every timestep.Periodically, the parameters
are updated via stochastic gradient descent to minimize the following loss function with
minibatches of experience tuples of size sampled from the replay buffer(10) 
The horizon MPC with blended Qestimator is again invoked to calculate the targets
(11) 
Using MPC to reduce error in Qtargets has been previously explored in literature (Lowrey et al., 2018; Bhardwaj et al., 2020), where the model is either assumed to be perfect or modelerror is not explicitly accounted for. MPC with the blended Qestimator and an appropriate allows us to generate more stable Qtargets than using or modelbased rollouts with a terminal Qfunction alone. However, running Hhorizon optimization for all samples in a minibatch can be timeconsuming, forcing the use of smaller batch sizes and sparse updates. In our experiments, we employ a practical modification where during the action selection step, MPC is also queried for value targets which are then stored in the replay buffer, thus allowing us to use larger batch sizes and updates at every timestep.
Finally, we also allow to vary over time. In practice, is decayed as more data is collected on the system. Intuitively, in the early stages of learning, the bias in dominates and hence we want to rely more on the model. A larger value of is appropriate as it upweights longer horizon estimates in the blendedQ estimator. As estimates improve over time, a smaller is favorable to reduce the reliance on the approximate model.
5 Experiments
Task Details: We evaluate MPQ on simulated robot control tasks based on the MuJoCo simulator (Todorov et al., 2012), including a complex manipulation task with a 7DOF arm and inhand manipulation with a 24DOF anthropomorphic hand (Rajeswaran* et al., 2018) as shown in Fig. 1. For each task, we provide the agent with a biased version of simulation that is used as the dynamics model for MPC. We use Model Predictive Path Integral Control (MPPI) (Williams et al., 2017), a stateoftheart samplingbased algorithm as our MPC algorithm throughout.

[wide, labelwidth=!, labelindent=0pt]

CartpoleSwingup: A classic control task where the agent slides a cart along a rail to swingup the pole attached via an unactuated hinge joint. Model bias is simulated by providing the agent incorrect masses of the cart and pole. The masses are set lower than the true values to make the problem harder for MPC as the algorithm will always input smaller controls than desired as also noted in Ramos et al. (2019). Initial position of cart and pole are randomized at every episode.

SawyerPegInsertion: The agent controls a 7DOF Sawyer arm to insert a peg attached to the endeffector into a hole at different locations on a table in front of the robot. We test the effects of inaccurate perception by simulating a sensor at the target location that provides noisy position measurements at every timestep. MPC uses a deterministic model that does not take sensor noise into account as commonly done in controls. This biases the cost of simulated trajectories, causing MPC to fail to reach the target.

InHandManipulation: A challenging inhand manipulation task with a 24DOF dexterous hand from Rajeswaran* et al. (2018). The agent must align the pen with target orientation within certain tolerance for succcess. The initial orientation of the pen is randomized at every episode. Here, we simulate bias by providing larger estimates of the mass and inertia of the pen as well as friction coefficients, which causes the MPC algorithm to optimize overly aggressive policies and drop the pen.
Please refer to the Appendix A.2 for more details of the tasks, success criterion and biased simulations.
Baselines: We compare MPQ against both modelbased and modelfree baselines  MPPI with true dynamics and no value function, MPPI with biased dynamics and no value function and Proximal Policy Optimization (PPO) (Schulman et al., 2017).
Learning Details
: We represent the Qfunction with a feedforward neural network. Simulation parameters like mass or friction coefficients are biased using the formula
, where is a biasfactor. We also employ a practical modification to Alg. 1 to speed up training as discussed in Section 4. Instead of maintaining a large replaybuffer and recalculating targets for every experience tuple in a minibatch, as done by Bhardwaj et al. (2020); Lowrey et al. (2018), we simply query MPC for the value targets online and store them in a smaller buffer, which allows us to perform updates at every timestep. For PPO, we used the publically available implementation at https://rb.gy/61iarq. Refer to the Appendix A.2 for more details.for the same validation episodes. Shaded region depicts the standard error of the mean that denotes the confidence on the average reward estimated from finite samples. Training is performed for 100k steps with validation after every 4k steps. When decaying
as per a schedule, it is fixed to the current value during validation. In (b),(d),(e), (f) denotes the value at the end of training. PPO asymptotic performance is reported as average reward of last 10 validation iterations. (g) shows the best validation reward at the end of training for different horizon values and MPPI trajectory samples (particles) using and5.1 Analysis of Overall Performance
O 1.
MPQ is able to overcome modelbias in MPC for a wide range of values.
Fig. 2(a) shows a comparison of MPQ with MPPI using true and biased dynamics with and for various settings of . There exists a wide range of values for which MPQ can efficiently tradeoff modelbias with the bias in the learned Qfunction and outperform MPPI with biased dynamics. However, setting to a high value of or , which weighs longer horizons heavily, leads to poor performance as compounding effects of modelbias are not compensated by . Performance also drops as decreases below . MPQ outperforms MPPI with access to the true dynamics and reaches close to asymptotic performance of PPO. This is not surprising as the learned Qfunction adds global information to the optimization and corrects for errors in optimizing longer horizons.
O 2.
Faster convergence can be achieved by decaying over time.
As more data is collected on the system, we expect the bias in to decrease, whereas modelbias remains constant. A larger value of that favors longer horizons is better during initial steps of training as the effect of a randomly initialized is diminished due to discounting and better exploration is achieved by forward lookahead. Conversely, as gets more accurate, modelbias begins to hurt performance and a smaller is favorable. We test this by decaying in using a fixed schedule and observe that indeed faster convergence is obtained by reducing the dependence on the model over training steps as shown in 2(b). Figures 2(d) and 2(e) present an ablations that show that MPQ is robust to a wide range of decay rates with and respectively. When provided with true dynamics, MPPI with performs better than due to optimization issues with long horizons. MPQ reaches performance comparable with MPPI and asymptotic performance of PPO in both cases showing robustness to horizon values which is important since in practice we wish to set the horizon as large as our computation budget permits. However, decaying too fast or too slow can have adverse effects on performance. An interesting question for future work is whether can be adapted in a statedependent manner. Refer to Appendix A.2 for details on the decay schedule.
O 3.
MPQ is much more sample efficient compared to modelfree RL on highdimensional continuous control tasks, while maintaining asymptotic performance.
Figures 2 and 3 show a comparison of MPQ with the modelfree PPO baseline. In all cases, we observe that MPQ, through its use of approximate models, learned value functions, and a dynamicallyvarying parameter to tradeoff different sources of error, rapidly improves its performance and achieves average reward and success rate comparable to MPPI with access to ground truth dynamics and modelfree RL in the limit. In InHandManipulation, PPO performance does not improve at all over 150k training steps. In SawyerPegInsertion, the small magnitude of reward difference between MPPI with true and biased models is due to the fact that despite model bias, MPC is able to get the peg close to the table, but sensor noise inhibits precise control to consistently insert it in the hole. Here, the value function learned by MPQ can adapt to sensor noise and allow for finegrained control near the table.
O 4.
MPQ is robust to large degree of model misspecification.
Fig. 2(c) shows the effects of different values of the bias factor used to vary the mass of the cart and pole for MPQ with a fixed decay rate of . MPQ achieves performance better than MPPI () with true dynamics and comparable to modelfree RL in the limit, for a wide range of bias factors , and convergence rate is generally faster for smaller bias. For large values of , MPQ either fails to improve or diverges as the compounding effects of modelbias hurt learning, making modelfree RL the more favorable alternative. A similar trend is observed in Figures 3(a) and 3(b) where MPQ outperforms MPPI with corresponding bias in the mass, inertia and friction coefficients of the pen with atleast a margin of over 30 in terms of success rate. It also achieves performance comparable to MPPI with true dynamics and modelfree RL, but is unable to do so for . We conclude that although MPQ is robust to large amount of model bias, if the model is extremely uninformative, relying on MPC can degrade performance.
O 5.
MPQ is robust to planning horizon and number of trajectory samples in samplingbased MPC.
TD() based approaches are traditionally used for biasvariance tradeoff in modelfree value function estimation. In our framework, plays a similar role, but it tradesoff bias due to the dynamics model and learned value function against variance due to longhorizon rollouts. We empirically quantify this on the CartpoleSwingup task by training MPQ with different values of horizon and number of particles with and respectively. Results in Fig. 2(g) show that  (1) using can overcome effects of modelbias by irrespective of the planning horizon (except for very small values of or ) and (2) using can overcome variance due to limited number of particles with long horizon rollouts. The ablative study in Fig. 2(f) lends evidence to the fact that is preferable to simply decay over time than tuning the discrete horizon value to balance model bias. Not only does decaying achieve a better convergence rate and asymptotic performance, the performance is more robust to different decay rates (as evidenced from Fig. 2(d)), whereas the same does not hold for varying horizon.
6 Conclusion
In this paper, we presented a general framework to mitigate modelbias in MPC by blending modelfree value estimates using a parameter , to systemativally tradeoff different sources of error. Our practical algorithm is theoretically wellfounded and achieves performance close to MPC with access to the true dynamics and asymptotic performance of modelfree RL, while being sample efficient. An interesting avenue for future research is to vary in a stateadaptive fashion. In particular, reasoning about the model and value function uncertainty may allow us to vary to rely more or less on our model in certain parts of the state space. Another promising direction is to extend the framework to explicitly incorporate constraints by exploring different constrained MPC approaches.
Acknowledgments
The authors would like to thank Aravind Rajeswaran for help with code for the peg insertion task.
References
 Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research 29 (13), pp. 1608–1639. Cited by: §1.
 Exploration and apprenticeship learning in reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pp. 1–8. Cited by: §A.1.2.

Thinking fast and slow with deep learning and tree search
. In Advances in Neural Information Processing Systems, pp. 5360–5370. Cited by: §1.  Information theoretic model predictive qlearning. In Learning for Dynamics and Control, pp. 840–850. Cited by: §1, §4, §4, §5.
 Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765. Cited by: §1.
 An integrated system for realtime model predictive control of humanoid robots. In 2013 13th IEEERAS International conference on humanoid robots (Humanoids), pp. 292–299. Cited by: §1.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §1.
 Exploration in metric state spaces. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pp. 306–312. Cited by: §A.1.2.
 Nearoptimal reinforcement learning in polynomial time. Machine learning 49 (23), pp. 209–232. Cited by: §3.1.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.2.2.
 Bayesian residual policy optimization: scalable bayesian reinforcement learning with clairvoyant experts. arXiv preprint arXiv:2002.03042. Cited by: §3.2.
 Plan online, learn offline: efficient learning and exploration via modelbased control. arXiv preprint arXiv:1811.01848. Cited by: §A.2.2, §1, §4, §5.
 Tubebased robust nonlinear model predictive control. International Journal of Robust and Nonlinear Control 21 (11), pp. 1341–1353. Cited by: §1.
 Constrained model predictive control: stability and optimality. Automatica 36 (6), pp. 789–814. Cited by: §1.
 Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. Cited by: §1.
 Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. In Proceedings of Robotics: Science and Systems (RSS), Cited by: §A.2.1, item 3, §5.
 BayesSim: adaptive domain randomization via probabilistic inference for robotics simulators. arXiv preprint arXiv:1906.01728. Cited by: item 1.
 Agnostic system identification for modelbased reinforcement learning. arXiv preprint arXiv:1203.1007. Cited by: §A.1.2, §1.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §5.
 Modelbased active exploration. In International Conference on Machine Learning, pp. 5779–5788. Cited by: §1.
 Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §1, §1.
 Mastering chess and shogi by selfplay with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815. Cited by: §1, §1.
 Lyceum: an efficient and scalable ecosystem for robot learning. arXiv preprint arXiv:2001.07343. Cited by: §A.2.2.

Truncated horizon policy search: combining reinforcement learning & imitation learning
. arXiv preprint arXiv:1805.11240. Cited by: §A.1, §3.1.  Mujoco: a physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §5.
 An online learning approach to model predictive control. arXiv preprint arXiv:1902.08967. Cited by: §A.2.2, §4.
 Aggressive driving with model predictive path integral control. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1433–1440. Cited by: §1.
 Information theoretic mpc for modelbased reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1714–1721. Cited by: §5.
 Value function approximation and model predictive control. In 2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), pp. 100–107. Cited by: §1.
Appendix A Appendix
a.1 Proofs
We present upperbounds on performance of a greedy policy when using approximate value functions and models. We also analyze the case of finite horizon planning with an approximate dynamics model and terminal value function which can be seen as a generalization of (Sun et al., 2018). For simplicity, we switch to using to the learnt modelfree value function (instead of )
Let be an approximation . Let MDP be an approximation of such that , we have and .
a.1.1 A Gentle Start: Bound on Performance of 1Step Greedy Policy
Theorem A.1.
Let the onestep greedy policy be
(12) 
The performance loss of w.r.t optimal policy on MDP is bounded by
(13) 
a.1.2 Bound on Performance of HStep Greedy Policy
Notation: For brevity let us define the following macro,
(16) 
which represents the expected cost achieved when executing policy on using as the terminal cost. We can substitute different policies, terminal costs and MDPs. For example, is the expected cost obtained by running policy on simulator for steps with approximate learned terminal value function .
Lemma A.1.
For a given policy , the optimal value function and MDPs the following performance difference holds
Proof.
We temporarily introduce a new MDP that has the same cost function as a , but transition function of
(17)  
Let represent the difference in distribution of states encountered by executing on and respectively starting from state .
Expanding the RHS of (17)
(18) 
Since the first state is the same
(19)  
where the first inequality is obtained by applying the triangle inequality and the second one is obtained by applying triangle inequality followed by the upper bound on the error in costfunction.
(20) 
By choosing we can ensure that the term inside is upperbounded by
(21) 
∎
The above lemma builds on similar results in (Kakade et al., 2003; Abbeel and Ng, 2005; Ross and Bagnell, 2012).
We are now ready to prove our main theorem, i.e. the performance bound of an MPC policy that uses an approximate model and approximate value function.
Proof of Theorem 3.1
Proof.
Since, is the greedy policy when using and ,
Comments
There are no comments yet.