Model-based Lookahead Reinforcement Learning

08/15/2019 ∙ by Zhang-Wei Hong, et al. ∙ 0

Model-based Reinforcement Learning (MBRL) allows data-efficient learning which is required in real world applications such as robotics. However, despite the impressive data-efficiency, MBRL does not achieve the final performance of state-of-the-art Model-free Reinforcement Learning (MFRL) methods. We leverage the strengths of both realms and propose an approach that obtains high performance with a small amount of data. In particular, we combine MFRL and Model Predictive Control (MPC). While MFRL's strength in exploration allows us to train a better forward dynamics model for MPC, MPC improves the performance of the MFRL policy by sampling-based planning. The experimental results in standard continuous control benchmarks show that our approach can achieve MFRL`s level of performance while being as data-efficient as MBRL.



page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model-free Reinforcement Learning (MFRL) has succeeded in several domains, including video game playing [mnih2015human; mnih2016a3c] and robot control [lillicrap2016ddpg; schulman2015trust]. However, high sample complexity prevents applying MFRL to most complex real world applications: MFRL directly optimizes the agent’s policy from interactions with the environment usually requiring millions of data samples [mnih2015human; mnih2016a3c; lillicrap2016ddpg].

Contrary to MFRL, model-based reinforcement learning (MBRL) is typically more data-efficient. However, when the perfect model is inaccessible, a forward dynamics model of the environment must be approximated [deisenroth2011pilco; kurutach2018modelensemble; nagabandi2017neural; kamthe2018gpmpc; chua2018deep]. Despite the learning of the forward dynamics model, MBRL often requires less interactions with the environment compared to learning a policy directly. Nevertheless, a major disadvantage of MBRL is that, due to model approximation errors [nagabandi2017neural], MBRL commonly cannot achieve the performance of MFRL at convergence.

Prior work takes advantage of both MFRL and MBRL to some extent. levine2013guided show that generating training samples for MFRL by model-based trajectory optimization may increase the performance of MFRL, while gu2016continuous find in contrast that insufficient exploration of MBRL may impair the performance of the resultant policy and  nagabandi2017neural show that naive exploration for training forward dynamics models prevents accuracy of the acquired forward dynamics model. Apart from generating samples using MBRL, silver2016mastering; tamar2016value; oh2017value; lowrey2018plan combine MFRL with online model-based planning (i.e. MBRL) to improve the performance of the reinforcement learning (RL) agent, but either is restricted in discrete state and action spaces, relies on the assumption that the state space has 2-D structure, or assumes a perfect forward dynamics model. In sum, despite the notable performance of the conjunction of MFRL with model-based planning, the strong assumptions prohibit applications on more complex tasks.

We propose a novel framework that unifies MFRL and MBRL through Model Predictive Control (MPC). Our approach leverages the merits of MFRL and MBRL at both training and testing time. At training time, we utilize the exploratory policy of MFRL to collect more diverse training samples in the environment than MBRL [gu2016continuous; nagabandi2017neural] and thus prevent the impacts of insufficient exploration on policies and forward dynamics models. Next, at testing time, we combine sampling-based MPC (a model-based online planning approach) with MFRL to further increase the performance beyond training. Different from the contemporary works [silver2016mastering; lowrey2018plan], our approach does not rely on any assumption of perfect forward dynamics models or state/action space. We use an approximated forward dynamics model that can be applied in arbitrary state/action spaces. Furthermore, we jointly leverage the value function and the policy to yield remarkable MPC planning performance and data-efficiency. Finally, we propose a soft-greedy approach that improves action selection in planning. To our best knowledge, we are the first to combine MFRL and MPC, on the least assumptions of state/action space.

We demonstrate the effectiveness of our approach on well-known challenging continuous control tasks in MuJoCo [todorov2012mujoco]. The experimental results show that our approach leads to better performance in less data than that of independently using MFRL or MPC, particularly in complex tasks, and thus confirm that our approach of combining MBRL and MFRL is effective. Furthermore, the results show that our approach favors the quality of forward dynamics model training. In addition, we provide empirical analysis to verify the limitations of contemporary approaches.

The contributions of this paper are the following:

  • We show that using an MFRL policy to collect training data improves the quality of the trained forward dynamics model.

  • We show that using an MFRL policy can enhance MPC‘ s performance.

  • We show advantages and limitations of using value function in MPC.

  • We provide a comprehensive evaluation of each design decision for combining MPC and MFRL.

The remaining parts of this paper are organized as follows. Section 2 introduces the background. Section 3 describes our new approach. Section 4 presents and analyzes the experimental results. Finally, Section 6 concludes the paper.

2 Background

In order to elaborate the motivation and the implementation of the proposed method, this section starts with an introduction to Reinforcement Learning (RL) [sutton1998introduction], then explains MFRL, and finally discusses recently proposed MPC approaches for MBRL.

2.1 Reinforcement Learning (RL)

In this paper, we study a standard deterministic discrete time RL problem, consisting of a 4-tuple . denotes the state and the action space. is a task-specific reward function encoding the task objective. is the true forward dynamics model. At each time step , an RL agent perceives the current state , takes the control action and observes the next state and the immediate reward . The training objective of RL is to search for an optimal controller that selects for each such that the expected return is maximized, where represents the horizon of the task and denotes the constant discount factor. Note that while this paper adheres to deterministic cases for simplicity our framework can be easily extended to a stochastic formulation.

2.2 Model-free Reinforcement Learning (MFRL)

We discuss both the training and evaluation phases in MFRL. First, using data collected by an exploratory policy , the training phase learns a possibly stochastic control policy that infers which maximizes for , and a value function estimating expected returns for . and denote parameters. In this paper, we adhere to an appealing MFRL approach, policy gradient [sutton2000policy] which updates along the direction .

is for variance reduction 

[williams1992simple]. A common choice for is the value function estimate . can be obtained by various value function approximation approaches [schulman2015high; sutton1998introduction; bertsekas2005dynamic]. Finally, the evaluation phase simply samples control actions using : .

2.3 Model Predictive Control for Model-based Reinforcement Learning (MPC-MBRL)

MPC has long been prevalent in robotic control [garcia1989model; tassa2014control] and recently been applied to MBRL [nagabandi2017neural; kamthe2018gpmpc; chua2018deep; williams2017information]. MPC-MBRL simply trains a forward dynamics model to plan the control action at each time step during evaluation.

The training phase trains an approximated forward dynamics model since the real forward dynamics model is usually inaccessible [nagabandi2017neural; kamthe2018gpmpc]. MPC-MBRL collects an initial dataset consisting of a series of transitions by using uniformly random exploration in the environment: , then optimizing

by minimizing the following loss function:


where is a batch of transitions sampled from . Next, MPC-MBRL appends data from online execution of MPC-planning to :  [nagabandi2017neural; chua2018deep], and trains the forward dynamics model according to Eq. 1 again.

In the evaluation phase, MPC-MBRL computes at each time step the control action for the current state , with model-based planning consisting of three stages: trajectory sampling, trajectory evaluation, and action selection. Trajectory sampling stage simulates a set of trajectories: , where denotes the number of simulated trajectories, denotes the index of a trajectory within , and indicates the planning horizon. can be obtained by sequentially applying


where denotes the simulated state at planning step within , is the action applied to , and denotes a given action distribution. Next, trajectory evaluation stage evaluates each trajectory with the task-specific reward function and the terminal reward function :


where denotes the simulated accumulated rewards associated with , and and denote the state and action sequences extracted from respectively. Although is typically ignored (i.e. ), we use to illustrate our approach in Section 3. Finally, action selection stage selects the first control action of the action sequence which yielded highest value w.r.t. Eq. 3:


3 The approach: Model Predictive Control with Model-free Reinforcement Learning (MPC-MFRL)

Typical MFRL algorithms and MBRL methods do not fully utilize the beneficial information that can be extracted from the environment. MFRL ignores the information encapsulated in the forward dynamics of the environment, while MPC neglects the utility of policies and value functions. Contrary to both, our approach jointly leverages forward dynamics, policies, and value functions, and therefore is more likely to have better performance in the case of dearth of data. Section 3.1 details the proposed training method and Section 3.2 the proposed online hybrid MPC-MFRL approach.

Figure 1: Overview of MPC-MFRL at evaluation time: In state , MPC-MFRL samples trajectories using an MFRL policy, evaluates sampled trajectories by an MFRL value function, and then chooses an action based on Eq. 4. The environment transitions to and the process starts from the beginning. The upper row illustrates planning in simulation. The lower row depicts interaction with the real environment.

3.1 The training phase: learning a policy, value function, and a forward dynamics model

We jointly train the control policy , the value function , and the forward dynamics model using the same data at the same time. The training phase iteratively executes the following three steps until convergence. Firstly, we gather a trajectory using the exploratory policy (if using on-policy RL, must be set as ). collects data by interacting with the environment: (as Section 2.1 describes). Secondly, we update and by , as Section 2.1 describes. In this paper, we use Trust Region Policy Optimization (TRPO) [schulman2015trust], a trust region policy gradient method, for training since TRPO has been successful in several domains and is a theoretic sounded policy gradient algorithm, but other MFRL methods [lillicrap2016ddpg; schulman2017proximal] could be used instead. Finally, we append to the training dataset : ( is initialized as ), and optimizes the forward dynamics model by minimizing the loss function defined in Eq. 1 with batches sampled in . Algorithm 5 in supplementary material details the training scheme. Note that though this paper focuses on training a deterministic forward dynamics model, probabilistic models [chua2018deep]

or non-parametric models 

[kamthe2018gpmpc] can be easily applied in our approach.

Jointly training , , and poses the following advantages. First and foremost, can collect more extensive data for the forward dynamics model than does uniform random exploration and MPC on-policy data aggregation [nagabandi2017neural]. Optimized to maximize reward, can collect successful experiences which uniform random exploration cannot. Also, MPC simply exploits rewards while an MFRL exploratory policy balances exploitation and exploration and thereby guarantees data diversity. Second, interacting with an environment using requires less computation time than collecting the training dataset using MPC-planning [nagabandi2017neural; chua2018deep]. Finally, our joint training procedure maximize the utility of each interaction with the environment, comparing to simply learning a policy (MFRL) or a forward dynamics model (MBRL).

3.2 The evaluation phase: planning the control action using the policy, the value function, and the forward dynamics model

Our approach at the evaluation time is built on the top of MPC framework defined in Section 2.3 and improves each stage by MFRL. We use MFRL‘s control policy for trajectory sampling, MFRL‘s value function for trajectory evaluation, and soft-greedy approach for action selection. Fig. 1 illustrates the planning process and Algorithm 1 summarizes the approach. Next, we discuss how our method improves trajectory sampling, trajectory evaluation, and action selection.

Trajectory sampling.

By replacing the action distribution in Eq. 2 with the MFRL control policy , our method can more efficiently sample trajectories of high value than uniform random sampling [richards2005robust] and Cross Entropy Method (CEM) [rubinstein1999cross]. We sample as follows:


The MFRL control policy improves trajectory sampling in the following aspects. To begin with, can readily result in high-value trajectories since is trained to maximize the expected return . Additionally, even before convergence can restrain the search space in trajectory sampling and thus allows trajectories of high value to be sampled more likely than in uniform sampling [richards2005robust]. Moreover, in contrast to CEM [rubinstein1999cross] which uses simulated data for optimization, our method is not susceptible to forward dynamics approximation error since is trained with real experience. Finally, our method does not require costly computations since our method does not rely on online iterative optimization like CEM [rubinstein1999cross].

Trajectory evaluation.

Similar to  lowrey2018plan, our approach substitutes the terminal reward function (Eq. 3) with the MFRL value function to resolve the shortsighted planning problem mentioned in Section 2.3. Trajectory evaluation becomes:


The MFRL value function in Eq.6 estimates the expected return of a given state (see Section 2.2) and therefore prevents shortsighted planning even with a short planning horizon . More importantly, planning with a short horizon avoids compounding errors in simulation of long horizon and saves computation time as well.

Different from  lowrey2018plan, we use an approximated dynamics model rather than a perfect one. The assumption of a perfect dynamics model is unrealistic in most cases, especially on complex tasks, where the perfect dynamics model is inaccessible. In addition, estimating the expected return of a simulated state could be risky since the value function trained with real experience is unlikely to be accurate in the states absent in the training data. Moreover, jointly using value function estimation and a plain random action sampling in simulation of MPC-planning could magnify this problem since those unconstrained or weakly constrained action sampling may induce lots of unreachable states (e.g. states that violate physical constraints) where the value function cannot estimate accurately on.

However, our approach can alleviate the above problem by sampling actions with an MFRL policy in trajectory sampling. MFRL policies can prevent agents from performing actions that could lead to unreachable states since MFRL polices are trained to maximize to expected reward of the task and thus are less likely to perform those infeasible actions.

Action selection.

If we simply take the control action as the sampled action that yielded the max expected return (Eq. 4), the agent may be overly optimistic to the simulated results and is likely to impair the performance. Thus, we propose a soft-greedy approach to alleviate the impact of approximation errors in the forward dynamics model . Our soft-greedy approach takes the control action as the average over the best action sequences w.r.t. . Formally, we describes as the follows:


where sorts all action sequences according to in descending order.

Our proposed soft-greedy action selection approach can prevent biasing toward the best action obtained from imperfect simulation. Averaging has been shown to be able to alleviate the inherent bias of the max-operator (Eq. 4[everitt2011miscellaneous]. Preventing bias due to the max-operator allows MPC-MFRL to operate with inaccurate forward dynamics models caused by, for example, underfitting or overfitting forward dynamics models [burnham2003model].

  Input: a control policy , a value function , a forward dynamics model , number of simulated trajectories , planning horizon , number of best action sequences , a task-specific reward function
  i. Trajectory sampling
  for n  do
     for h  do
     end for
  end for
  ii. Trajectory evaluation
  Compute according to Eq. 6
  iii. Action selection
  Compute according to Eq. 7
Algorithm 1 MPC-MFRL (Evaluation)

4 Experiments

The experiments are designed to answer the main question of whether MPC-MFRL successfully leverages MPC to bridge the gap between MBRL and MFRL, overcoming the drawbacks of prior works. Moreover, we perform additional experiments to answer the following more detailed questions: (1) Do we obtain a more accurate forward dynamics model by collecting training samples using an exploratory policy of MFRL? (2) Does an MFRL control policy enhance the planning performance of MPC? (3) Can the value function favor the performance of MPC planning even in an approximated forward dynamics model? (4) Does the proposed soft-greedy action selection improve performance under forward dynamics model approximation errors? Next, we shortly introduce the experimental setup, and then discuss the experimental results.

4.1 Experimental setup

Benchmark tasks.

We use standard continuous control Mujoco benchmark tasks of varying difficulty from OpenAI gym [gym] varying in dimensions of and : Swimmer , Reacher , HalfCheetah , and Ant .

Implementation of MPC-MFRL.

The MFRL control policy , the MFRL value function , and the approximated forward dynamics model

are implemented as neural networks.

is modeled as a multi-variate Gaussian distribution, where

, where and

are the mean vector and the covariance matrix conditioned on learned parameters

and current state . We optimize by TRPO [schulman2015trust], while updating and using the Adam optimizer [kingma2014adam].


For comparison against both MFRL and MPC-MBRL methods we select the following baselines (see supplementary material for details):

  • MF (S): TRPO that uses stochastic actions for evaluation.

  • MF (D): TRPO that uses deterministic actions for evaluation (i.e. ).

  • MPC-Random

    : MPC that samples actions from a uniform distribution for trajectory roll-outs (Eq. 

    2) and uniform random exploration with on-policy data aggregation (see Section 2.3) for training the forward dynamics model.

  • MPC-CEM: MPC that uses the same model training approach as MPC-Random, while using CEM [rubinstein1999cross] for trajectory sampling since CEM has been shown to work well with MPC in prior work [chua2018deep].

4.2 Evaluation procedure

In order to assess performance and data-efficiency of each method, we evaluate offline [chua2018deep]

each method periodically w.r.t. number of training samples used. At offline evaluation time, we fix all model parameters and measure the average total reward over 10 episodes. We then report the mean and bootstrapped confidence interval of the best average total reward (denoted "Average return" in figures below) so far over 5 distinct random seeds.

4.3 The results of overall performance

Figure 2: Mean and bootstrapped confidence interval (solid lines and error bars, over 5 distinct random seeds) of "Average return" (see Section 4.2 for evaluation details and definition of "Average return") for different methods. "Num. timestep (M)" is the number of millions of interactions with the environment. Our method MPC-MFRL outperforms comparison methods. For comparison method and evaluation details see Section 4.1.

Fig. 2 shows the performance of each method w.r.t. the number of samples. MPC-MFRL achieves better performance than all baseline methods: MPC-MFRL exceeds the performance of MPC-MBRL while being more data-efficient than MFRL. Moreover, the improvement is particularly significant in the more challenging tasks like Ant and HalfCheetah. To conclude, this result verifies the effectiveness of MPC-MFRL on common benchmark tasks.

Interestingly, Fig. 2 shows that MPC-CEM loses to MPC-Random in Ant-v2 and HalfCheetah-v2. Also, we find that even though MPC-CEM obtains the highest expected return in simulation (Eq. 3), the expected return in the real environment is surprisingly low. Prior works of MBRL [sutton2012dyna; kurutach2018modelensemble] suggest that training a policy using fictitious data impairs performance. Thus, the poor performance of MPC-CEM in complex tasks could be caused by optimizing the policy using simulated data of high approximation errors.

4.4 The results of improved exploration for training

We show that our training approach improves the training quality of forward dynamics models (Section 2.3

) by comparing various training schemes w.r.t. testing error of the forward dynamics model and evaluation performance. Fig. 

(a)a shows the testing error of the forward dynamics models trained by different schemes. Half of the testing set consists of data from Random+MPC and half from Policy (see Section C in supplementary material for more details). Policy reduces the testing error faster and more than Random+MPC, which shows that collecting data by an MFRL policy can increase the accuracy of forward dynamics models. In addition to favoring accuracy, Fig. (b)b further shows MPC-MFRL (Policy) outperforms MPC-MFRL (Random+MPC), especially in later stages of evaluation. To summarize, these results verify our approach is superior to prior approaches in terms of accuracy of forward dynamics models and evaluation performance, also showing that accuracy of forward dynamics models greatly affects the performance.

Figure 5: (a) We measure "Average testing error" of a forward dynamics model using a pre-collected testing dataset. Policy indicates collecting data using an MFRL policy, while Random+MPC represents uniform random exploration with on-policy data aggregation [nagabandi2017neural]; (b) The evaluation results of MPC-MFRL with different training schemes: MPC-MFRL (Policy) is the original MPC-MFRL, while MPC-MFRL (Random+MPC) trains the forward dynamics model using data from Random+MPC.The remaining legends are the same as Fig. 2.

4.5 The results of sampling trajectories by an MFRL policy

Fig. (a)a shows MPC-MFRL () outperforms MPC-MFRL () in all tasks and therefore suggests that the MFRL control policy can more readily sample trajectories of high value in the real environment than the baselines. Furthermore, MPC-MFRL () still surpasses MPC-MFRL () even in the early stages where the pure model-free variants of MPC-MFRL (i.e. MF (S/D)) perform poorly compared to MPC-MFRL (). This result confirms that even though the MFRL control policy has not yet converged, our method can still enhance MPC planning performance, and also show the improvement of data-efficiency against MFRL. Note that we ignore comparison with CEM due to its poor performance in Fig. 2.

Figure 8: (a) Varying trajectory sampling methods: MPC-MFRL () and MPC-MFRL () respectively denote the original MPC-MFRL and MPC-MFRL that replaces the MFRL policy with an uniform distribution for trajectory sampling. See Fig. 2 for details on the notation in the figure; (b) The evaluation results of different trajectory evaluation methods and planning horizons: MPC-MFRL , for instance, indicates MPC-MFRL with and planning horizon ; See Fig. 2 for more details on figure notation.

4.6 The strength and the limitations of trajectory evaluation with a value function

We investigate the strength and limitations of our trajectory evaluation approach in this section. Fig. (b)b shows that MPC-MFRL () outperforms MPC-MFRL () with the same planning horizon and thereby verifies that our approach (Eq. 6) is effective. However, on the contrary to the results in the prior work [lowrey2018plan], Fig. (b)b shows that  MPC-MFRL () approximates MPC-MFRL (), which suggests that compounding errors in a longer planning horizon may impair the performance of value function in an approximated dynamics model. In addition, we find that in the experiments of Section 4.5, the terminal reward (i.e. ) in simulated trajectories of MPC-MFRL () are similar than MPC-MFRL (). This observation implies that a uniform random sampling may sample states that lead overstimation of value function in approximated dynamics model. The detail investigation are left as future works.

4.7 The results of soft-greedy action selection

This section verifies that our soft-greedy action selection approach improves performance under an approximate forward dynamics model, followed by studying the effectiveness with models of varying complexity. Fig. (a)a shows that MPC-MFRL (w SG) outperforms MPC-MFRL (w/o SG), suggesting that soft-greedy action selection increases performance of an approximated forward dynamics model. Fig. (b)b shows that MPC-MFRL (w SG) surpasses MPC-MFRL (w/o SG) in all model complexities, thereby suggesting that our soft-greedy action selection is superior to classical greedy action selection in both simple and complex models.

Figure 11: (a) The evaluation results of different action selection approaches: MPC-MFRL (w / SG) indicates the original MPC-MFRL, while MPC-MFRL (w/o SG) represents MPC-MFRL withoug soft-greedy action selection. The rest of legends are identical to Fig. 2; (b) Mean and bootstrapped confidence interval (bold bars and error bars, over 5 distinct random seeds) of performance for action selection approaches with different forward dynamics model complexities. "Num. hidden units" denotes the number of hidden units used. For evaluation details, see Section 4.2

5 Related work

The classic Dyna framework [sutton1990integrated; silver2008sample; kurutach2018modelensemble; kalweit2017uncertainty; clavera2018model] learn a forward dynamics model for simulating experiences to train an MFRL agent, while we focus on the interplay of MFRL and MPC.

Guided Policy Search (GPS) [levine2013guided; chebotar2017path; levine2014learning] uses model-based controllers such as iLQR [tassa2012synthesis] and iLQG [todorov2005generalized] to generate training samples for MFRL policy. The follow-up works [zhang2016learning; nagabandi2017neural] further adopt MPC to provide supervision for an policy. In contrast, our approach concentrates on improving MPC performance with MFRL.

gu2016continuous, feinberg2018model, and buckman2018sample assist value function learning by model-based approaches. gu2016continuous collect training samples for value function learning using a model-based controller. feinberg2018model and buckman2018sample leverage model-based rollout to compute targets of value function training, thus accelerating value function learning. On the contrary, our method focuses on leveraging value function to ameliorate the shortsighted planning of MPC.

silver2016mastering, weber2017imagination, oh2017value, and tamar2016value add planning to an MFRL policy. silver2016mastering and oh2017value, however, either rely on discrete state and action spaces or a perfect forward dynamics model. tamar2016value assume that the state space has 2-D structure. weber2017imagination learn planning by an MFRL approach, hence inheriting the data-inefficiency of MFRL. In contrast, our approach can be applied to an arbitrary type of state and action space and is more data-efficient than MFRL.

pong*2018temporal perform MPC planning solely with a goal-conditioned state-action value function. lowrey2018plan improve long-term planning by evaluating sampled trajectories using a value function. pong*2018temporal and lowrey2018plan do not utilize MFRL policies to enhance planning performance. Planning with only a value function [pong*2018temporal] inherits the data-inefficiency of value function learning in MFRL, while our approach combines simulated rewards and a value function, thereby mitigating this problem. lowrey2018plan assume a perfect forward dynamics model, whereas our work does not rely on such an unrealistic assumption.

6 Conclusion

We propose the MPC-MFRL framework which leverages the advantages of MFRL and MPC to achieve MFRL‘s level of performance while being as data-efficient as MBRL. Moreover, MPC-MFRL allows the agent to continually improve performance with more environment interactions. On the other hand, our novel MPC-MFRL framework brings promising future work to light as well. An application of MPC-MFRL on real robotics systems, for example, can be an appealing direction since MPC-MFRL shows superior data-efficiency particularly crucial for real robots. Another direction could be guiding an MFRL policy by MPC. One can pre-train an MFRL policy in several tasks, then online adapt to a new task by MPC or model-based policy search.