Blending MPC Value Function Approximation for Efficient Reinforcement Learning

12/10/2020 ∙ by Mohak Bhardwaj, et al. ∙ 0

Model-Predictive Control (MPC) is a powerful tool for controlling complex, real-world systems that uses a model to make predictions about future behavior. For each state encountered, MPC solves an online optimization problem to choose a control action that will minimize future cost. This is a surprisingly effective strategy, but real-time performance requirements warrant the use of simple models. If the model is not sufficiently accurate, then the resulting controller can be biased, limiting performance. We present a framework for improving on MPC with model-free reinforcement learning (RL). The key insight is to view MPC as constructing a series of local Q-function approximations. We show that by using a parameter λ, similar to the trace decay parameter in TD(λ), we can systematically trade-off learned value estimates against the local Q-function approximations. We present a theoretical analysis that shows how error from inaccurate models in MPC and value function estimation in RL can be balanced. We further propose an algorithm that changes λ over time to reduce the dependence on MPC as our estimates of the value function improve, and test the efficacy our approach on challenging high-dimensional manipulation tasks with biased models in simulation. We demonstrate that our approach can obtain performance comparable with MPC with access to true dynamics even under severe model bias and is more sample efficient as compared to model-free RL.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model-free Reinforcement Learning (RL) is increasingly used in challenging sequential decision-making problems including high-dimensional robotics control tasks (Haarnoja et al., 2018; Schulman et al., 2017) as well as video and board games (Silver et al., 2016, 2017). While these approaches are extremely general, and can theoretically solve complex problems with little prior knowledge, they also typically require a large quantity of training data to succeed. In robotics and engineering domains, data may be collected from real-world interaction, a process that can be dangerous, time consuming, and expensive.

Model-Predictive Control (MPC) offers a simpler, more practical alternative. While RL typically uses data to learn a global model offline, which is then deployed at test time, MPC solves for a policy online by optimizing an approximate model for a finite horizon at a given state. This policy is then executed for a single timestep and the process repeats. MPC is one of the most popular approaches for control of complex, safety-critical systems such as autonomous helicopters (Abbeel et al., 2010), aggressive off-road vehicles (Williams et al., 2016) and humanoid robots (Erez et al., 2013), owing to its ability to use approximate models to optimize complex cost functions with nonlinear constraints (Mayne et al., 2000, 2011).

However, approximations in the model used by MPC can significantly limit performance. Specifically, model bias may result in persistent errors that eventually compound and become catastrophic. For example, in non-prehensile manipulation, practitioners often use a simple quasi-static model that assumes an object does not roll or slide away when pushed. For more dynamic objects, this can lead to aggressive pushing policies that perpetually over-correct, eventually driving the object off the surface.

Recently, there have been several attempts to combine MPC with model free RL, showing that the combination can improve over the individual approaches alone. Many of these approaches involve using RL to learn a terminal cost function, thereby increasing the effective horizon of MPC (Zhong et al., 2013; Lowrey et al., 2018; Bhardwaj et al., 2020). However, the learned value function is only applied at the end of the MPC horizon. Model errors would still persist in horizon, leading to sub-optimal policies. Similar approaches have also been applied to great effect in discrete games with known models (Silver et al., 2016, 2017; Anthony et al., 2017)

, where value functions and policies learned via model-free RL are used to guide Monte-Carlo Tree Search. In this paper, we focus on a somewhat broader question: can machine learning be used to both increase the effective horizon of MPC, while also correcting for model bias?

One straightforward approach is to try to learn (or correct) the MPC model from real data encountered during execution; however there are some practical barriers to this strategy. Hand-constructed models are often crude-approximations of reality and lack the expressivity to represent encountered dynamics. Moreover, increasing the complexity of such models leads to computationally expensive updates that can harm MPC’s online performance. Model-based RL approaches such as Chua et al. (2018); Nagabandi et al. (2018); Shyam et al. (2019)

aim to learn general neural network models directly from data. However, learning globally consistent models is an exceptionally hard task due to issues such as covariate shift 

(Ross and Bagnell, 2012).

We propose a framework, MPQ, for weaving together MPC with learned value estimates to trade-off errors in the MPC model and approximation error in a learned value function. Our key insight is to view MPC as tracing out a series of local Q-function approximations. We can then blend each of these Q-functions with value estimates from reinforcement learning. We show that by using a blending parameter , similar to the trace decay parameter in TD(), we can systematically trade-off errors between these two sources. Moreover, by smoothly decaying over learning episodes we can achieve the best of both worlds - a policy can depend on a prior model before its has encountered any data and then gradually become more reliant on learned value estimates as it gains experience.

To summarize, our key contributions are:
1. A framework that unifies MPC and Model-free RL through value function approximation.
2. Theoretical analysis of finite horizon planning with approximate models and value functions.
3. Empirical evaluation on challenging manipulation problems with varying degrees of model-bias.

2 Preliminaries

2.1 Reinforcement Learning

We consider an agent acting in an infinite-horizon discounted Markov Decision Process (MDP). An MDP is defined by a tuple

where is the state space, is the action space, is the per-step cost function, is the stochastic transition dynamics and is the discount factor and is a distribution over initial states. A closed-loop policy outputs a distribution over actions given a state. Let be the distribution over state-action trajectories obtained by running policy on . The value function for a given policy , is defined as and the action-value function as . The objective is to find an optimal policy . We can also define the (dis)-advantage function , which measures how good an action is compared to the action taken by the policy in expectation. It can be equivalently expressed in terms of the Bellman error as .

2.2 Model-Predictive Control

MPC is a widely used technique for synthesizing closed-loop policies for MDPs. Instead of trying to solve for a single, globally optimal policy, MPC follows a more pragmatic approach of optimizing simple, local policies online. At every timestep on the system, MPC uses an approximate model of the environment to search for a parameterized policy that minimizes cost over a finite horizon. An action is sampled from the policy and executed on the system. The process is then repeated from the next state, often by warm-starting the optimization from the previous solution.

We formalize this process as solving a simpler surrogate MDP online, which differs from by using an approximate cost function , transition dynamics and limiting horizon to . Since it plans to a finite horizon, it’s also common to use a terminal state-action value function that estimates the cost-to-go. The start state distribution is a dirac-delta function centered on the current state . MPC can be viewed as iteratively constructing an estimate of the Q-function of the original MDP , given policy at state :

(1)

MPC then iteratively optimizes this estimate (at current system state ) to update the policy parameters

(2)

Alternatively, we can also view the above procedure from the perspective of disadvantage minimization. Let us define an estimator for the 1-step disadvantage with respect to the potential function as . We can then equivalently write the above optimization as minimizing the discounted sum of disadvantages over time via the telescoping sum trick

(3)

Although the above formulation queries the at every timestep, it is still exactly equivalent to the original problem and hence, does not mitigate the effects of model-bias. In the next section, we build a concrete method to address this issue by formulating a novel way to blend Q-estimates from MPC and a learned value function that can balance their respective errors.

3 Mitigating Bias in MPC via Reinforcement Learning

In this section, we develop our approach to systematically deal with model bias in MPC by blending-in learned value estimates. First, we take a closer look at the different sources of error in the estimate in (1) and then propose an easy-to-implement, yet effective strategy for trading them off.

3.1 Sources of Error in MPC

The performance of MPC algorithms critically depends on the quality of the Q-function estimator in (1). There are three major sources of approximation error. First, model bias can cause compounding errors in predicted state trajectories, which biases the estimates of the costs of different action sequences. The effect of model error becomes more severe as . Second, the error in the terminal value function gets propagated back to the estimate of the Q-function at the start state. With discounting, the effect of error due to inaccurate terminal value function diminishes as increases. Third, using a small with an inaccurate terminal value function can make the MPC algorithm greedy and myopic to rewards further out in the future.

We can formally bound the performance of the policy with approximate models and approximate learned value functions. In Theorem 3.1, we show the loss in performance of the resulting policy as a function of the model error, value function error and the planning horizon.

Theorem 3.1 (Proof Appendix a.1.2).

Let MDP be an -approximation of such that , we have and . Let the learned value function be an -approximation of the true value function . The performance of the MPC policy is bounded w.r.t the optimal policy as

(4)

This theorem generalizes over various established results. Setting gives us the 1-step simulation lemma in Kearns and Singh (2002) (Appendix A.1.1). Setting , i.e. true model, recovers the cost-shaping result in Sun et al. (2018).

Further inspecting terms in (4), we see that the model error increases with horizon (the first two terms) while the learned value error decreases with , which matches our intuitions.

In practice, the errors in model and value function are usually unknown and hard to estimate making it impossible to optimally set the MPC horizon. Instead, we next propose a strategy to blend the Q-estimates from MPC and the learned value function at every timestep along the horizon, instead of just the terminal step such that we can properly balance the different sources of error.

3.2 Blending Model Predictive Control and Value Functions

A naive way to blend Q-estimates from MPC with Q-estimates from the value function would be to consider a convex combination of the two

(5)

where . Here, the value function is contributing to a residual that is added to the MPC output, an approach commonly used to combine model-based and model-free methods (Lee et al., 2020). However, this is solution is rather ad hoc. If we have at our disposal a value function, why invoke it at only at the first and last timestep? As the value function gets better, it should be useful to invoke it at all timesteps.

Instead, consider the following recursive formulation for the Q-estimate. Given , the state-action encountered at horizon , the blended estimate is expressed as

(6)

where . The recursion ends at . In other words, the current blended estimate is a convex combination of the model-free value function and the one-step model-based return. The return in turn uses the future blended estimate. Note unlike (5), the model-free estimate is invoked at every timestep.

We can unroll (6) in time to show , the blended horizon estimate, is simply an exponentially weighted average of all horizon estimates

(7)

where is a -horizon estimate. When , the estimator reduces to the just using and when we recover the original MPC estimate in (1). For intermediate values of

, we interpolate smoothly between the two by interpolating all

estimates.

Implementing (7) naively would require running versions of MPC and then combining their outputs. This is far too expensive. However, we can switch to the disadvantage formulation by applying a similar telescoping trick

(8)

This estimator has a similar form as the estimator for the value function. However, while uses the

parameter for bias-variance trade-off, our blended estimator aims trade-off bias in dynamics model with bias in learned value function.

Why use blending when one can simply tune the horizon ? First, limits the resolution we can tune since it’s an integer – as gets smaller the resolution becomes worse. Second, the blended estimator uses far more samples. Say we have access to optimal horizon . Even if both and

had the same bias, the latter uses a strict subset of samples as the former. Hence the variance of the blended estimator will be lower, with high probability.

4 The Mpq Algorithm

Input: Initial Q-function weights , Approximate dynamics and cost function
Parameter: MPC horizon , schedule , discount factor , minibatch size , num mini-batches , update frequency
1
2 for  do
       // Update
3      
       // Blended MPC action selection
4      
5       Execute on the system and observe
6      
7       if  then
8             Sample N minibatches from
             // Generate Blended MPC value targets
9            
10             Update with SGD on loss
11            
12      
Algorithm 1 MPQ

We develop a simple variant of Q-Learning, called Model-Predictive Q-Learning with Weights (MPQ) that learns a parameterized Q-function estimate . Our algorithm, presented in Alg. 1, modifies Q-learning to use blended Q-estimates as described in the (8) for both action selection and generating value targets. The parameter is used to trade-off the errors due to model-bias and learned Q-function, . This can be viewed as an extension of the MPQ algorithm from Bhardwaj et al. (2020) to explicitly deal with model bias by incorporating the learned Q-function at all timesteps. Unlike MPQ, we do not explicitly consider the entropy-regularized formulation, although our framework can be modified to incorporate soft-Q targets.

At every timestep , MPQ proceeds by using -horizon MPC from the current state to optimize a policy with parameters . We modify the MPC algorithm to optimize for the greedy policy with respect to the blended Q-estimator in (8), that is

(9)

An action sampled from the resulting policy is then executed on the system. A commonly used heuristic is to warm start the above optimization by shifting forward the solution from the previous timestep, which serves as a good initialization if the noise in the dynamics in small  

(Wagener et al., 2019). This can significantly cut computational cost by reducing the number of iterations required to optimize ( 9) at every timestep.

Periodically, the parameters

are updated via stochastic gradient descent to minimize the following loss function with

mini-batches of experience tuples of size sampled from the replay buffer

(10)

The -horizon MPC with blended Q-estimator is again invoked to calculate the targets

(11)

Using MPC to reduce error in Q-targets has been previously explored in literature (Lowrey et al., 2018; Bhardwaj et al., 2020), where the model is either assumed to be perfect or model-error is not explicitly accounted for. MPC with the blended Q-estimator and an appropriate allows us to generate more stable Q-targets than using or model-based rollouts with a terminal Q-function alone. However, running H-horizon optimization for all samples in a mini-batch can be time-consuming, forcing the use of smaller batch sizes and sparse updates. In our experiments, we employ a practical modification where during the action selection step, MPC is also queried for value targets which are then stored in the replay buffer, thus allowing us to use larger batch sizes and updates at every timestep.

Finally, we also allow to vary over time. In practice, is decayed as more data is collected on the system. Intuitively, in the early stages of learning, the bias in dominates and hence we want to rely more on the model. A larger value of is appropriate as it up-weights longer horizon estimates in the blended-Q estimator. As estimates improve over time, a smaller is favorable to reduce the reliance on the approximate model.

5 Experiments

Figure 1: Tasks for evaluating MPQ. Left to right - cartpole, peg insertion with 7DOF arm, and in-hand manipulation to orient align pen(blue) with target(green) with 24DOF dexterous hand.

Task Details: We evaluate MPQ on simulated robot control tasks based on the MuJoCo simulator (Todorov et al., 2012), including a complex manipulation task with a 7DOF arm and in-hand manipulation with a 24DOF anthropomorphic hand (Rajeswaran* et al., 2018) as shown in Fig. 1. For each task, we provide the agent with a biased version of simulation that is used as the dynamics model for MPC. We use Model Predictive Path Integral Control (MPPI) (Williams et al., 2017), a state-of-the-art sampling-based algorithm as our MPC algorithm throughout.

  1. [wide, labelwidth=!, labelindent=0pt]

  2. CartpoleSwingup: A classic control task where the agent slides a cart along a rail to swingup the pole attached via an unactuated hinge joint. Model bias is simulated by providing the agent incorrect masses of the cart and pole. The masses are set lower than the true values to make the problem harder for MPC as the algorithm will always input smaller controls than desired as also noted in Ramos et al. (2019). Initial position of cart and pole are randomized at every episode.

  3. SawyerPegInsertion: The agent controls a 7DOF Sawyer arm to insert a peg attached to the end-effector into a hole at different locations on a table in front of the robot. We test the effects of inaccurate perception by simulating a sensor at the target location that provides noisy position measurements at every timestep. MPC uses a deterministic model that does not take sensor noise into account as commonly done in controls. This biases the cost of simulated trajectories, causing MPC to fail to reach the target.

  4. InHandManipulation: A challenging in-hand manipulation task with a 24DOF dexterous hand from Rajeswaran* et al. (2018). The agent must align the pen with target orientation within certain tolerance for succcess. The initial orientation of the pen is randomized at every episode. Here, we simulate bias by providing larger estimates of the mass and inertia of the pen as well as friction coefficients, which causes the MPC algorithm to optimize overly aggressive policies and drop the pen.

Please refer to the Appendix A.2 for more details of the tasks, success criterion and biased simulations.

Baselines: We compare MPQ against both model-based and model-free baselines - MPPI with true dynamics and no value function, MPPI with biased dynamics and no value function and Proximal Policy Optimization (PPO) (Schulman et al., 2017).

Learning Details

: We represent the Q-function with a feed-forward neural network. Simulation parameters like mass or friction coefficients are biased using the formula

, where is a bias-factor. We also employ a practical modification to Alg. 1 to speed up training as discussed in Section 4. Instead of maintaining a large replay-buffer and re-calculating targets for every experience tuple in a mini-batch, as done by  Bhardwaj et al. (2020); Lowrey et al. (2018), we simply query MPC for the value targets online and store them in a smaller buffer, which allows us to perform updates at every timestep. For PPO, we used the publically available implementation at https://rb.gy/61iarq. Refer to the Appendix A.2 for more details.

(a) Fixed
(b) Fixed v/s Decaying
(c) Varying Model Bias
(d) decay with
(e) decay with
(f) Varying Horizon v/s
(g) Bias-Variance Trade-off
Figure 2: CartpoleSwingup experiments. Solid lines show average rewards over 30 validation episodes (fixed start states) with length of 100 steps and 3 runs with different seeds. The dashed lines are average reward of MPPI

for the same validation episodes. Shaded region depicts the standard error of the mean that denotes the confidence on the average reward estimated from finite samples. Training is performed for 100k steps with validation after every 4k steps. When decaying

as per a schedule, it is fixed to the current value during validation. In (b),(d),(e), (f) denotes the value at the end of training. PPO asymptotic performance is reported as average reward of last 10 validation iterations. (g) shows the best validation reward at the end of training for different horizon values and MPPI trajectory samples (particles) using and

5.1 Analysis of Overall Performance

O 1.

MPQ is able to overcome model-bias in MPC for a wide range of values.

Fig. 2(a) shows a comparison of MPQ with MPPI using true and biased dynamics with and for various settings of . There exists a wide range of values for which MPQ can efficiently trade-off model-bias with the bias in the learned Q-function and out-perform MPPI with biased dynamics. However, setting to a high value of or , which weighs longer horizons heavily, leads to poor performance as compounding effects of model-bias are not compensated by . Performance also drops as decreases below . MPQ outperforms MPPI with access to the true dynamics and reaches close to asymptotic performance of PPO. This is not surprising as the learned Q-function adds global information to the optimization and corrects for errors in optimizing longer horizons.

O 2.

Faster convergence can be achieved by decaying over time.

As more data is collected on the system, we expect the bias in to decrease, whereas model-bias remains constant. A larger value of that favors longer horizons is better during initial steps of training as the effect of a randomly initialized is diminished due to discounting and better exploration is achieved by forward lookahead. Conversely, as gets more accurate, model-bias begins to hurt performance and a smaller is favorable. We test this by decaying in using a fixed schedule and observe that indeed faster convergence is obtained by reducing the dependence on the model over training steps as shown in 2(b). Figures 2(d) and  2(e) present an ablations that show that MPQ is robust to a wide range of decay rates with and respectively. When provided with true dynamics, MPPI with performs better than due to optimization issues with long horizons. MPQ reaches performance comparable with MPPI and asymptotic performance of PPO in both cases showing robustness to horizon values which is important since in practice we wish to set the horizon as large as our computation budget permits. However, decaying too fast or too slow can have adverse effects on performance. An interesting question for future work is whether can be adapted in a state-dependent manner. Refer to Appendix A.2 for details on the decay schedule.

(a) InHandManipulation Reward
(b) InHandManipulation Success Rate
(c) SawyerPegInsertion Reward
(d) SawyerPegInsertion Success Rate
Figure 3: Robustness and sample efficiency of MPQ. (a),(b) Varying bias factor over mass, inertia and friction of pen (c),(d) Peg insertion with noisy perception. Total episode length is 75 steps for both. Same bias factor is used for all altered properties per task. Curves depict average reward over 30 validation episodes with multiple seeds and shaded areas are the standard error of the mean. Validation done after every 3k steps and is decayed to 0.85 at end of 75k training steps in both. Asymptotic performance of PPO is average of last 10 validation iterations. Refer to Appendix A.2 for details on tasks and success metrics.
O 3.

MPQ is much more sample efficient compared to model-free RL on high-dimensional continuous control tasks, while maintaining asymptotic performance.

Figures 2 and 3 show a comparison of MPQ with the model-free PPO baseline. In all cases, we observe that MPQ, through its use of approximate models, learned value functions, and a dynamically-varying parameter to trade-off different sources of error, rapidly improves its performance and achieves average reward and success rate comparable to MPPI with access to ground truth dynamics and model-free RL in the limit. In InHandManipulation, PPO performance does not improve at all over 150k training steps. In SawyerPegInsertion, the small magnitude of reward difference between MPPI with true and biased models is due to the fact that despite model bias, MPC is able to get the peg close to the table, but sensor noise inhibits precise control to consistently insert it in the hole. Here, the value function learned by MPQ can adapt to sensor noise and allow for fine-grained control near the table.

O 4.

MPQ is robust to large degree of model misspecification.

Fig. 2(c) shows the effects of different values of the bias factor used to vary the mass of the cart and pole for MPQ with a fixed decay rate of . MPQ achieves performance better than MPPI () with true dynamics and comparable to model-free RL in the limit, for a wide range of bias factors , and convergence rate is generally faster for smaller bias. For large values of , MPQ either fails to improve or diverges as the compounding effects of model-bias hurt learning, making model-free RL the more favorable alternative. A similar trend is observed in Figures 3(a) and 3(b) where MPQ outperforms MPPI with corresponding bias in the mass, inertia and friction coefficients of the pen with atleast a margin of over 30 in terms of success rate. It also achieves performance comparable to MPPI with true dynamics and model-free RL, but is unable to do so for . We conclude that although MPQ is robust to large amount of model bias, if the model is extremely uninformative, relying on MPC can degrade performance.

O 5.

MPQ is robust to planning horizon and number of trajectory samples in sampling-based MPC.

TD() based approaches are traditionally used for bias-variance trade-off in model-free value function estimation. In our framework, plays a similar role, but it trades-off bias due to the dynamics model and learned value function against variance due to long-horizon rollouts. We empirically quantify this on the CartpoleSwingup task by training MPQ with different values of horizon and number of particles with and respectively. Results in Fig. 2(g) show that - (1) using can overcome effects of model-bias by irrespective of the planning horizon (except for very small values of or ) and (2) using can overcome variance due to limited number of particles with long horizon rollouts. The ablative study in Fig. 2(f) lends evidence to the fact that is preferable to simply decay over time than tuning the discrete horizon value to balance model bias. Not only does decaying achieve a better convergence rate and asymptotic performance, the performance is more robust to different decay rates (as evidenced from Fig. 2(d)), whereas the same does not hold for varying horizon.

6 Conclusion

In this paper, we presented a general framework to mitigate model-bias in MPC by blending model-free value estimates using a parameter , to systemativally trade-off different sources of error. Our practical algorithm is theoretically well-founded and achieves performance close to MPC with access to the true dynamics and asymptotic performance of model-free RL, while being sample efficient. An interesting avenue for future research is to vary in a state-adaptive fashion. In particular, reasoning about the model and value function uncertainty may allow us to vary to rely more or less on our model in certain parts of the state space. Another promising direction is to extend the framework to explicitly incorporate constraints by exploring different constrained MPC approaches.

Acknowledgments

The authors would like to thank Aravind Rajeswaran for help with code for the peg insertion task.

References

  • P. Abbeel, A. Coates, and A. Y. Ng (2010) Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research 29 (13), pp. 1608–1639. Cited by: §1.
  • P. Abbeel and A. Y. Ng (2005) Exploration and apprenticeship learning in reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pp. 1–8. Cited by: §A.1.2.
  • T. Anthony, Z. Tian, and D. Barber (2017)

    Thinking fast and slow with deep learning and tree search

    .
    In Advances in Neural Information Processing Systems, pp. 5360–5370. Cited by: §1.
  • M. Bhardwaj, A. Handa, D. Fox, and B. Boots (2020) Information theoretic model predictive q-learning. In Learning for Dynamics and Control, pp. 840–850. Cited by: §1, §4, §4, §5.
  • K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765. Cited by: §1.
  • T. Erez, K. Lowrey, Y. Tassa, V. Kumar, S. Kolev, and E. Todorov (2013) An integrated system for real-time model predictive control of humanoid robots. In 2013 13th IEEE-RAS International conference on humanoid robots (Humanoids), pp. 292–299. Cited by: §1.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §1.
  • S. Kakade, M. J. Kearns, and J. Langford (2003) Exploration in metric state spaces. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 306–312. Cited by: §A.1.2.
  • M. Kearns and S. Singh (2002) Near-optimal reinforcement learning in polynomial time. Machine learning 49 (2-3), pp. 209–232. Cited by: §3.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.2.2.
  • G. Lee, B. Hou, S. Choudhury, and S. S. Srinivasa (2020) Bayesian residual policy optimization: scalable bayesian reinforcement learning with clairvoyant experts. arXiv preprint arXiv:2002.03042. Cited by: §3.2.
  • K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch (2018) Plan online, learn offline: efficient learning and exploration via model-based control. arXiv preprint arXiv:1811.01848. Cited by: §A.2.2, §1, §4, §5.
  • D. Q. Mayne, E. C. Kerrigan, E. Van Wyk, and P. Falugi (2011) Tube-based robust nonlinear model predictive control. International Journal of Robust and Nonlinear Control 21 (11), pp. 1341–1353. Cited by: §1.
  • D. Q. Mayne, J. B. Rawlings, C. V. Rao, and P. O. Scokaert (2000) Constrained model predictive control: stability and optimality. Automatica 36 (6), pp. 789–814. Cited by: §1.
  • A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine (2018) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. Cited by: §1.
  • A. Rajeswaran*, V. Kumar*, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine (2018) Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. In Proceedings of Robotics: Science and Systems (RSS), Cited by: §A.2.1, item 3, §5.
  • F. Ramos, R. C. Possas, and D. Fox (2019) BayesSim: adaptive domain randomization via probabilistic inference for robotics simulators. arXiv preprint arXiv:1906.01728. Cited by: item 1.
  • S. Ross and J. A. Bagnell (2012) Agnostic system identification for model-based reinforcement learning. arXiv preprint arXiv:1203.1007. Cited by: §A.1.2, §1.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §5.
  • P. Shyam, W. Jaśkowski, and F. Gomez (2019) Model-based active exploration. In International Conference on Machine Learning, pp. 5779–5788. Cited by: §1.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §1, §1.
  • D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. (2017) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815. Cited by: §1, §1.
  • C. Summers, K. Lowrey, A. Rajeswaran, S. Srinivasa, and E. Todorov (2020) Lyceum: an efficient and scalable ecosystem for robot learning. arXiv preprint arXiv:2001.07343. Cited by: §A.2.2.
  • W. Sun, J. A. Bagnell, and B. Boots (2018)

    Truncated horizon policy search: combining reinforcement learning & imitation learning

    .
    arXiv preprint arXiv:1805.11240. Cited by: §A.1, §3.1.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §5.
  • N. Wagener, C. Cheng, J. Sacks, and B. Boots (2019) An online learning approach to model predictive control. arXiv preprint arXiv:1902.08967. Cited by: §A.2.2, §4.
  • G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou (2016) Aggressive driving with model predictive path integral control. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1433–1440. Cited by: §1.
  • G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, and E. A. Theodorou (2017) Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1714–1721. Cited by: §5.
  • M. Zhong, M. Johnson, Y. Tassa, T. Erez, and E. Todorov (2013) Value function approximation and model predictive control. In 2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), pp. 100–107. Cited by: §1.

Appendix A Appendix

a.1 Proofs

We present upper-bounds on performance of a greedy policy when using approximate value functions and models. We also analyze the case of finite horizon planning with an approximate dynamics model and terminal value function which can be seen as a generalization of (Sun et al., 2018). For simplicity, we switch to using to the learnt model-free value function (instead of )

Let be an -approximation . Let MDP be an -approximation of such that , we have and .

a.1.1 A Gentle Start: Bound on Performance of 1-Step Greedy Policy

Theorem A.1.

Let the one-step greedy policy be

(12)

The performance loss of w.r.t optimal policy on MDP is bounded by

(13)
Proof.

From (12) we have

(14)
(using )
(using )

Now, let be the state with the max loss ,

Substituting from (14)

Add and subtract and re-arrange

Since is the state with largest loss

Re-arranging terms we get

(15)

which concludes the proof. ∎

a.1.2 Bound on Performance of H-Step Greedy Policy

Notation: For brevity let us define the following macro,

(16)

which represents the expected cost achieved when executing policy on using as the terminal cost. We can substitute different policies, terminal costs and MDPs. For example, is the expected cost obtained by running policy on simulator for steps with approximate learned terminal value function .

Lemma A.1.

For a given policy , the optimal value function and MDPs the following performance difference holds

Proof.

We temporarily introduce a new MDP that has the same cost function as a , but transition function of

(17)

Let represent the difference in distribution of states encountered by executing on and respectively starting from state .

Expanding the RHS of (17)

(18)

Since the first state is the same

(19)

where the first inequality is obtained by applying the triangle inequality and the second one is obtained by applying triangle inequality followed by the upper bound on the error in cost-function.

(20)

By choosing we can ensure that the term inside is upper-bounded by

(21)

The above lemma builds on similar results in (Kakade et al., 2003; Abbeel and Ng, 2005; Ross and Bagnell, 2012).

We are now ready to prove our main theorem, i.e. the performance bound of an MPC policy that uses an approximate model and approximate value function.

Proof of Theorem 3.1

Proof.

Since, is the greedy policy when using and ,