1.1 Motivation
ModelFree Reinforcement Learning (MfRL) algorithms are widely applied to solve challenging control tasks as they eliminate the need to model the complex dynamics of the system. However, these techniques are significantly data hungry and require millions of transitions. Furthermore, these characteristics highly limit successful training on hardware as undergoing such high number of transitions in hardware environments is infeasible. Thus, in order to overcome this hurdle, various works have settled for a two loop modelbased approach, typically referred to as Modelbased Reinforcement Learning (MbRL) algorithms. Such strategies take the benefit of the explored dynamics of the system by learning the dynamics model, and then determining an optimal policy in this model. Hence this “innerloop” optimization allows for a better choice of actions
before interacting with the original environment.
The inclusion of modellearning in RL has significantly improved sampling efficiency [11, 16], and there are numerous works in this direction. DRL algorithms, while exploring, collect significant amount of state transitions, which can be used to generate an approximate dynamics model of the system. In the context of robotics, this model has proven to be very beneficial in developing robust control strategies based on predictive simulations [8]. They have successfully handled minor disturbances and demonstrated sim2real feasibility. Moreover, the process of planning with the learnt model is mainly motivated by the Model Predictive Control (MPC), which is a well known strategy used in classical realtime control. Given the model and the cost formulation, a typical MPC structure can be formulated in the form of a finite horizon trajectory optimization problem. Thus our work is motivated to propose a generalised framework combining Model Free and Model Based methods.
1.2 Related Work
The work with such a view of MbMf approach and exploiting the approximated dynamics with random shooting, [16] demonstrated its efficacy in leveraging the overall learning performance. Further, the work also showed how modelbased (Mb) additions to typical modelfree (Mf) algorithms can accelerate significantly the latter ones. Additionally, in this context of MbMf RL algorithms, [13] also introduced the use of value functions with an MPC formulation and [5] shows a similar formulation with highdimensional image observations. Recent works presented in [29] showed adaptation to Dynamic changes using MPC with world models and [15] proposes an actor critic framework using model predictive rollouts and demonstrated applicability on real hardware. The TOPDM [1], a close approach to DeMo RL demonstrates spinning a pen between the fingers, the most challenging examples in dexterous hand manipulation.
1.3 Contribution
With a view toward strengthening existing MbMf approaches for learning, we propose a generic framework that integrates a modelbased optimization scheme with modelfree offpolicy learning. Motivated by the success of online learning algorithms [27]
in RC buggy models, we combine them with offpolicy Mf learning, thereby leading to a twoloop MbMf approach. In particular, we implement dynamic mirror descent (DMD) algorithms on a modelestimate of the system, and then the outer loop MfRL is used on the real system. The main advantage with this setting is that the inner loop is computationally light; the number of iterations can be large without effecting the overall performance. Since this is a hierarchical approach, the inner loop policy helps improve the outer loop policy, by effectively utilizing the control choices made on the approximate dynamics. This approach, in fact, provides a more generic framework for some of the MbMf approaches (e.g.,
[15], [17]).In addition to the proposed framework, we introduce two new algorithms DeMo RL and DeMo Layer. The Dynamic MirrorDescent Model Predictive RL (DeMoRL),
uses Soft actorcritic (SAC) [4] in the outer loop as offpolicy RL, and CrossEntropy Method (CEM) in the inner loop as DMDMPC [27].
In particular, we use the exponential family of control distributions with CEM on the objective. In each iteration, the optimal control sequence obtained is then applied on the modelestimate to collect additional data.
This is appended to the buffer, which is then used by the outerloop for learning the optimal policy. We show that the DMDMPC accelerates the learning of the outerloop by simply enriching the data with better choices of statecontrol transitions. We finally demonstrate this method on custom robotic environments and MuJoCo benchmark control tasks.
Simulation results show that the proposed methodology is better than or at least as good as MoPAC [15] and MBPO [8] in terms of sampleefficiency.
Furthermore, as our formulation is closer to that of [15], it is worth mentioning that even though we do not show results in hardware, the proposed algorithms can be used to train in hardware more effectively, which will be a part of future work.
The DeMo Layer, a special instance of hierarchical framework guides linear policies trained using Augmented Random Search(ARS). The experiments are conducted Cartpole swing up and quadrupedal walking. Our experimental results show that proposed DeMo Layer could improve the policy and could be used end to end with any RL algorithm during deployment.
1.4 Outline of the Report
The report is structured as follows:

Chapter 2. Problem Formulation
In this chapter, we provide the preliminaries for Reinforcement Learning and Online Learning as followed in the report. We further describe the RL algorithms in specific the Augmented Random Search, Soft Actor Critic and Online Learning approach to MPC. 
Chapter 3. Methodology: Novel Framework & Algorithms
We will describe the hierarchical framework for the proposed strategy, followed by the description of the DMDMPC. With the proposed generalised framework, we formulate the two novel algorithms associated with the strategy DeMo RL and DeMO Layer in this chapter. 
Chapter 4. Experimental Results
In this chapter we run experiments of DeMo RL on benchmark Mujoco Control Tasks and we compare the results with existing and state of the art algorithms MOPAC and MBPO. The experiments of DeMO Layer was conducted on swinging up Cartpole and custom built quadruped Stoch2. Further, we discuss our experimental results and show significance of proposed algorithms. 
Chapter 5. Conclusion & Future Work
Finally, we end the report by summarizing the work done and proposing some interesting future directions.
2.1 Optimal Control MPC
The Model Predictive Control is a widely applied control strategy and gives practical and robust controllers. It considers a stochastic dynamics model an approximation to real system and solves an H step optimisation problem at every time step and applies first control to the real dynamical system to go to the next state . A popular MPC objective is the expected step future costs
(2.1) 
(2.2) 
where, is the cost incurred (for the control problem) and is the terminal cost.
Since optimal control is obtained from which is based on , thus MPC is effectively statefeedback as desired for a stochastic system and is an effective tool for control tasks involving dynamic environments or non stationary setup.
Though MPC sounds intuitively promising, the optimization is approximated in practice and the control command
needs to be computed in real time at high frequency. Hence, a common practice is to heuristically bootstrap the previous approximate solution as the initialization to the current problem.
2.2 Reinforcement Learning Framework
We consider an infinite horizon Markov Decision Process (MDP) given by
where refers to set of states of the robot and refers to the set of control or actions. is the reward function,refers to the function that gives transition probabilities between two states for a given action, and
is the discount factor of the MDP. The distribution over initial states is given by and the policy is represented by parameterized by , a potentially feasible highdimensional space. If a stochastic policy is used, then . For ease of notations, we will use a deterministic policy to formulate the problem. Wherever a stochastic policy is used, we will show the extensions explicitly. In this formulation, the optimal policy is the policy that maximizes the expected return ():where the subscript for denotes the step index. Note that the system model dynamics can be expressed in the form of an equation:
(2.3) 
The offpolicy techniques like TD3, SAC have shown better sample complexity compared to TRPO, PPO. A simple random search based a Model Free Technique, Augmented Random Search [14], proposed a Linear deterministic policy highly competitive to other Model Free RL Techniques like TRPO, PPO and SAC. In the subsequent sections we describe the ARS algorithm in detail along with the improvement in its implementation and we also describe SAC.
2.3 Online Learning Framework
Another sequential decision making technique, Online learning is a framework for analyzing online decision making, essentially with three components: the decision set, the learner’s strategy for updating decisions, and the environment’s strategy for updating perround losses.
At round the learner makes a decision ,along with a side information
, then environment chooses a loss function
and the learner suffers a cost . along with side information like the gradient of loss to aid in choosing the next decision.Here, the learner’s goal is to minimize the accumulated costs i.e., by minimizing the regret. We describe, in detail the Online Learning Approach to Model Predictive Control [27] in subsequent sections.
2.4 Description of Algorithms
We describe the RL and Online Learning algorthms that are used in this work  Augmented Random Search(ARS), Soft Actor Critic(SAC) and Online Learning Approach to MPC.
2.4.1 Augmented Random Search
Random Search is a Derivative Free Optimisation where the gradient is estimated through finite difference Method [18].Objective is to maximize Expected return of a policy parameterised by under noise
The gradient is found from the gradient estimate obtained from gradient of smoothened version of above objective with Gaussian noise unlike from policy gradient theorem. Gradient of smoothened objective is
where is zero mean Gaussian. If is sufficiently small, the Gradient estimate would be close to the gradient of original objective. Further bias could be reduced with a two point estimate,
A Basic Random Search would involve the update of policy parameters according to
(2.4) 
Augmented Random Search, defines an update rule,
(2.5) 
Policy is linear state feedback law,
where x is the state and It proposes three Augmentations to Basic Random Search.
i) Using top best b performing directions,
They order the perturbation directions , in decreasing order according to and and uses only the top b directions.
ii)Scaling by the standard deviation, helps in an adjusting the step size.
iii) Normalization of the states
Accelerating ARS
Most optimisers use Adam to accelerate Stochastic Gradient Descent
[7] in practical implementations. Hence with ARS we estimate the gradient, an acceleration technique is not used. So, we define an acceleration based Gradient Estimate to ARS for faster convergence. Future Work would involve validating this approach. The Modified ARS Algorithm with and are the small and large step sizes respectively.2.4.2 Soft Actor Critic
Soft ActorCritic (SAC) [4] is an offpolicy modelfree RL algorithm based on principle of entropy maximization, with entropy of policy in addition to reward. It uses soft policy iteration for policy evaluation and improvement. It uses two Q Value functions to mitigate positive bias of value based methods and a minimum of the Qfunctions is used for the value gradient and policy gradient. Further, two Q Functions Speeds up training process.
It also uses a target network with weights updated by exponentially moving average, with a smoothing constant , to increase stability.
The SAC policy is updated using the loss function
where , and represent the replay buffer, value function and Qfunction associated with . The exploration by SAC helps in learning the underlying dynamics. In each gradient step we update SAC parameters using data
and represent target networks.
2.4.3 Online Learning for MPC
The Online Learning (OL) makes a decision at time to optimise for the regret over time while MPC also optimizes for a finite step horizon cost at every time instant, thus having a close similarity to OL [27].
The proposed work is motivated by such an OL approach to MPC, which considers a generic algorithm Dynamic Mirror Descent (DMD) MPC, a framework that represents different MPC algorithms. DMD is reminiscent of the proximal update with a Bregman divergence that acts as a regularization to keep the current control distribution parameterized by at time , close to the previous one.
The second step of DMD uses the shift model to anticipate the optimal decision for the next instant.
The DMDMPC proposes to use the shifted previous solution for shift model as approximation to the current problem. The proposed methodology also aims to obtain an optimal policy for a finite horizon problem considering steps into the future using DMD MPC.
Denote the sequence of states and controls as , and , with . The cost for steps is given by
(2.6) 
where, is the cost incurred (for the control problem) and is the terminal cost. Each of the are related by
(2.7) 
with being the estimate of . We will use the short notation to represent (2.7). It will be shown later that in a twoloop scheme, the terminal cost can be the value function obtained from the outer loop.
Now, by following the principle of DMDMPC, for a rollout time of , we sample the tuple from a control distribution () parameterized by . To be more precise, is also a sequence of parameters:
which yield the control tuple . Therefore, given the control distribution paramater at round , we obtain at round from the following update rule:
(2.8) 
where
is the MPC objective/cost expressed in terms of and ,
is the shift model, is the step size for the DMD, and is the Bregman divergence for a strictly convex function .
Note that the shift parameter is critical for convergence of this iterative procedure. Typically, this is ensured by making it dependent on the state . In particular, for the proposed twoloop scheme, we make dependent on the outer loop policy .
Also note that resulting parameter is still statedependent, as the MPC objective is dependent on .
With the two policies, and at time , we aim to develop a synergy in order to leverage the learning capabilities of both of them. In particular, the ultimate goal is to learn them in “parallel”, i.e., in the form of two loops. The outer loop optimizes and the inner loop optimizes for the MPC Objective. We discuss this in more detail in Section 3.
3.1 Generalised Framework: DMD MPC & RL
In classical MfRL, data from the interactions with the original environment are used to obtain the optimal policy parameterized by . While the interactions of the policy are stored in memory buffer, , for offline batch updates, they are used to optimize the parameters for the approximated dynamics of the model, . Such an optimized policy can then be used in the DMDMPC strategy to update the control distribution, . The controls sampled from this distribution are rolled out with the model, , to collect new transitions and store these in a separate buffer . Finally, we update using both the data i.e., from the buffer via one of the offpolicy approaches (e.g. DDPG [12], SAC [4]). In this work, we will demonstrate this using Soft ActorCritic (SAC) [4]. This gives a generalised hierarchical framework with two loops: Dynamic Mirror Descent (DMD) based Model Predictive Control (MPC) forming an inner loop and modelfree RL in the outer loop. A graphical representation of Model Free RL, Model Based RL and the described framework are given in Figure 3.1, Figure 3.2 and Figure 3.3.
There are two salient features in the twoloop approach:

At round , we obtain the shifting operator by using the outer loop parameter . This is in stark contrast to the classical DMDMPC method shown in [27], wherein the shifting operator is only dependent on the control parameter of the previous round .

Inspired by [13, 15], the terminal cost is the value of the terminal state for the finite horizon problem as estimated by the value function (, parameterized by ) associated with the outer loop policy, . This will efficiently utilise the model learned via the RL interactions and will in turn optimize with the updated setup.
Since there is limited literature on theoretical guarantees of DRL algorithms, it is difficult to show convergences and regret bounds for the proposed twoloop approach. However, there are guarantees on regret bounds for dynamic mirror descent algorithms in the context of online learning [6]. We restate them here using our notations for ease of understanding. We reuse their following definitions:
By a slight abuse of notations, we have omitted in the arguments for . We have the following:
Lemma 3.1
Let the sequence be as in (2.4.3), and let be any feasible arbitrary sequence; then for the class of convex MPC objectives , we have
Theorem 3.1
Given the shift operator that is dependent on the outerloop policy parameterised by at state , the Dynamic Mirror Descent (DMD) algorithm using a diminishing step sequences gives the overall regret with the comparator sequence as,
(3.1) 
with
Based on such a formulation, the regret bound is .
Proofs of both Lemma 3.1 and Theorem 3.1 are given in [6]. Theorem 3.1 shows that the regret is bounded by , where the shifting operator is dependent on the outerloop policy. However, this result is not guaranteed for nonconvex objectives, which will be a subject of future work.
Having described the main methodology, we will now study a widely used family of control distributions that can be used in the inner loop, the exponential family.
Exponential family of control distributions
We consider a parametric set of probability distributions for our control distributions in the exponential family, given by natural parameters
, sufficient statistics and expectation parameters [27]. Further, we set Bregman divergence in (2.4.3) to the KL divergence, i.e.,After employing KL divergence, our update rule becomes:
(3.2) 
The natural parameter of control distribution, , is obtained with the proposed shift model from the outer loop RL policy by setting the expectation parameter of : . Note that we have overloaded the notation to map the sequence to , which is the sequence of ^{1}^{1}1Note that if the policy is stochastic, then . This is similar to the control choices made in [15, Algorithm 2, Line 4].. Then, we have the following gradient of the cost:
(3.3) 
where
is the sufficient statistic, and for our experiments we choose Gaussian distribution for control and
. We finally have the following update rule for the expectation parameter [27]:(3.4) 
Based on the data collected in the outer loop, the inner loop is executed via DMDMPC as follows:

The shifting parameter is obtained by using the outer loop parameter . Now, considering step horizon, for , obtain
(3.5) (3.6) (3.7) where represents the covariance for control distribution.

Collect , and apply DMDMPC (3.12) to obtain .
MPC objective formulations
Similar to the exponential family, we can use different types of MPC objectives. Specifically, we will be using the method of elite fractions that allows us to select only the best transitions. This is given by the following:
(3.8) 
where we choose as the top elite fraction from the estimates of rollouts. Alternative formulations are also possible, and, specifically, the objective used by the MPPI method in [28] is obtained by setting the following objective and = 1 in (3.4) and for some :
(3.9) 
This shows that our formulation is more generic and some of the existing approaches could be derived with suitable choice: [15, 1] and [8]. Table 4.3 shows the specific DMDMPC algorithm and the corresponding shift operator used for each case.
MbMf Algorithm  RL  DMDMPC  Shift Operator 

MoPAC  SAC  MPPI  Obtained from MfRL Policy 
TOPDM  TD3  MPPI with CEM  Left shift (obtained from the previous iterate) 
DeMoRL  SAC  CEM  Obtained from MfRL Policy 
3.2 DeMo RL Algorithm
DeMoRL algorithm derives from other MbMf methods in terms of learning dynamics and follows a similar ensemble dynamics model approach. We have shown it in Algorithm 2. There are three parts in this algorithm: Model learning, Soft ActorCritic and DMDMPC. We describe them below.
Model learning. The functions to approximate the dynamics of the system are
probabilistic deep neural networks
[9] cumulatively represented as . Such a configuration is believed to account for the epistemic uncertainty of complex dynamics and overcomes the problem of overfitting generally encountered by using single models [2].SAC. Our implementation of the proposed algorithm uses Soft ActorCritic (SAC) [4] as the modelfree RL counterpart. Based on principle of entropy maximization, the choice of SAC ensures sufficient exploration motivated by the softpolicy updates, resulting in a good approximation of the underlying dynamics.
DMDMPC. Here, we solve for using a MonteCarlo estimation approach. For a horizon length of , we collect trajectories using the current policy and the more accurate dynamic models from the ensemble having lesser validation losses. For all trajectories, the complete cost is calculated using a deterministic reward estimate and the value function through (2). After getting the complete stateactionreward step trajectories we execute the following based on the CEM [21] strategy:

Choose the elite trajectories according to the total step cost incurred. We set for our experiments, and denote the chosen respective action trajectories and costs as and respectively. Note that we have also tested for other values of , and the ablations are shown in the Appendix attached as supplementary.

Using and we calculate as the reward weighted mean of the actions i.e.
(3.10) 
Finally, we update the current policy actions, according to (3.4) as
(3.11)
3.3 DeMo Layer
We consider the special case of above generalised framework where the outer loop RL is not updated and is already trained till convergence. At a state, the RL policy gives distribution over actions. With the shift model obtained from the trained RL the equation 3.12 now has fixed shift model.
(3.12) 
The ,is obtained from RL policy by setting
The updated policy is optimal both in terms of long term expected reward and short term horizon based cost.Following derivations from previous section we have the closed form expression for action with Gaussian distribution for control policy,
This gives an action which is a convex combination of RL action and the action that are good according to the current cost
We sample an action from the updated policy and apply it to the real environment unlike previous case, thus guiding the RL Policy End to End.A graphical representation of the DemoLayer framework is given in Figure 3.4.We describe the three parts of DeMo Layer, here: Model learning, ARS and DMDMPC.
Model Learning:
Cartpole:We used the Model given in Open AI gym with biased length for the MPC
ARS: It is a linear deterministic policy and we have implemented this with modification as in Algorithm 1 for faster convergence.
DMDMPC.: We use the similar strategy as described in DeMo RL.
4.1 DeMo RL: Results and Comparison
Several experiments were conducted on the MuJoCo [25] continuous control tasks with the OpenAIGym benchmark and the performance was compared with recent related works MoPAC [15] and MBPO [8]
. First, we discuss the hyperparameters used for all our experiments and then the performance achieved in the conducted experiments




As the baseline of our framework is built upon MBPO implementation, we use the same hyperparameters for our experiments and both the algorithms. We compare the results of three different seeds and the reward performance plots are shown in Figure 4.1. For the inner DMDMPC loop we choose a constant horizon length of and perform trajectory rollouts. With our elite fraction as , the updated modelbased transitions are added to the MPCbuffer. This process is iterated with a batchsize of thus completing the DMDMPC block in Algorithm 2. Inspired from MoPAC and MBPO, the number of interactions with the true environment for SAC were kept constant to for each epoch.
For HalfCheetahv2, Hopperv2 and InvertedDoublePendulumv2, we clearly note an accelerated progress with approximately faster rate in the reward performance curve. Whereas in Antv2, our rewards were comparable with MoPAC but still significantly better than MBPO. Our final rewards are eventually the same as achieved by MoPAC and MBPO, however the progress rate is faster for all our experiments. and requires lesser number of true environment interactions. Furthermore, all the experiments were conducted with the same set of hyperparameters, thus tuning them individually might give better insights. Table 4.3 shows the empirical analysis of the acceleration achieved by DeMoRL.
Environment  Epochs  20  40  60 

HalfCheetahv2  DeMoRL  7333  10691  11037 
MoPAC  4978  8912  10212  
MBPO  7265  9461  10578  
Antv2  DeMoRL  984.0  2278.4  3845.5 
MoPAC  593.6  2337.3  3649.5  
MBPO  907.5  1275.6  1891.9  
Hopperv2  DeMoRL  3077.3  3077.5  3352.4 
MoPAC  789.9  3137.9  3270.2  
MBPO  813.9  2683.5  3229.9 
Here, we not only show a generic formulation of the DMDMPC, but also demonstrate how new types of objectives can be obtained and further improvements can be made. As shown in the table, we perform better than or at least as good as MoPAC, which uses information theoretic model predictive path integrals (iMPPI) [28], a special case of our setup as shown in (3.9). The MPPI formulation uses all the rollouts to calculate the action sequence while the CEM uses elite rollouts, which contributed to the accelerated progress.
Here we show a detailed study on the elite percentage referring to the previous steps, after getting complete stateactionreward step trajectories, we execute Steps 1 to 3 in page .
Given the sequence of controls , we collect the resulting trajectory and add them to our buffer. Therefore, the quality of is a significant factor affecting the quality of data used for the outer loop RLpolicy. The selection strategy being CEM, a quality metric is dependent on the choice of elite fractions . We perform an ablation study for values of and on HalfCheetahv2 OpenAI gym environment. The analysis was performed based on the reward performance curves as shown in Fig. 4.2 (left). Additionally, we realize the number of the epochs required to reach a certain level of performance as a good metric to measure acceleration achieved. Such an analysis is provided in Fig. 4.2 (right). We make the following observations:

Having a lesser value of might ensure that learned dynamics is exploited the most, but decreases the exploration performed in the approximated environment.

Similarly, having higher value of on the other hand will do more exploration using a “notsoperfect” policy and dynamics.
Thus, the elite fraction balances between exploration and exploitation.
4.2 DeMo Layer: Results for Cartpole and Stoch
We have conducted experiments on two different environments.
Cartpole:
A linear policy is trained on Cartpole using ARS and there exists no linear policy that could acheive swing up and balance. Showed that DeMo Layer could guide the linear policy to acheive swing up and balance on cartpole.
Stoch:
Stoch2, a quadruped Robot is trained using the linear approach given in [19].
With neural network approximation to the Model Dynamics, DeMo Layer is implemented on Stoch to learn robust walking for episode length of 500.
Environment  Linear Policy  Linear Policy with DeMo Layer 

Cartpole  1400  1700 
Stoch2  1500  1850 
Environment  Horizon  Sampled Trajectories 

Cartpole  120  90 
Stoch2  20  200 
The simulation results for cartpole and stoch could be found here ^{1}^{1}1https://github.com/soumyarani/EndtoEndGuidedRLusingOnlineLearning
References
 Charlesworth and Montana [2020] Henry Charlesworth and G. Montana. Solving challenging dexterous manipulation tasks with trajectory optimisation and reinforcement learning. ArXiv, abs/2009.05104, 2020.
 Chua et al. [2018] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/3de568f8597b94bda53149c7d7f5958cPaper.pdf.
 Dalal et al. [2018] Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757, 2018.

Haarnoja et al. [2018]
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine.
Soft actorcritic: Offpolicy maximum entropy deep reinforcement
learning with a stochastic actor.
In
International Conference on Machine Learning
, pages 1861–1870. PMLR, 2018.  Hafner et al. [2019] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2555–2565. PMLR, 09–15 Jun 2019. URL http://proceedings.mlr.press/v97/hafner19a.html.
 Hall and Willett [2013] Eric Hall and Rebecca Willett. Dynamical models and tracking regret in online convex programming. In International Conference on Machine Learning, pages 579–587. PMLR, 2013.
 Jain et al. [2017] Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Accelerating stochastic gradient descent. arXiv preprint arXiv:1704.08227, 3, 2017.
 Janner et al. [2019] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Modelbased policy optimization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlchéBuc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/5faf461eff3099671ad63c6f3f094f7fPaper.pdf.
 Kolter and Manek [2019] J. Zico Kolter and Gaurav Manek. Learning stable deep dynamics models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlchéBuc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/0a4bbceda17a6253386bc9eb45240e25Paper.pdf.
 Lee et al. [2020] Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. Learning quadrupedal locomotion over challenging terrain. Science robotics, 5(47), 2020.
 Levine and Koltun [2013] Sergey Levine and Vladlen Koltun. Guided policy search. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1–9, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL http://proceedings.mlr.press/v28/levine13.html.
 Lillicrap et al. [2016] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR (Poster), 2016. URL http://arxiv.org/abs/1509.02971.
 Lowrey et al. [2019] Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan online, learn offline: Efficient learning and exploration via modelbased control. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Byey7n05FQ.
 Mania et al. [2018] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
 Morgan et al. [2021] Andrew S Morgan, Daljeet Nandha, Georgia Chalvatzaki, Carlo D’Eramo, Aaron M Dollar, and Jan Peters. Model predictive actorcritic: Accelerating robot skill acquisition with deep reinforcement learning. arXiv preprint arXiv:2103.13842, 2021.
 Nagabandi et al. [2018] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE, 2018.
 Nagabandi et al. [2020] Anusha Nagabandi, Kurt Konolige, Sergey Levine, and Vikash Kumar. Deep dynamics models for learning dexterous manipulation. In Conference on Robot Learning, pages 1101–1112. PMLR, 2020.
 Nesterov and Spokoiny [2017] Yurii Nesterov and Vladimir Spokoiny. Random gradientfree minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
 Paigwar et al. [2020] Kartik Paigwar, Lokesh Krishna, Sashank Tirumala, Naman Khetan, Aditya Sagi, Ashish Joglekar, Shalabh Bhatnagar, Ashitava Ghosal, Bharadwaj Amrutur, and Shishir Kolathaya. Robust quadrupedal locomotion on sloped terrains: A linear policy approach, 2020.
 Peng et al. [2020] Xue Bin Peng, Erwin Coumans, Tingnan Zhang, TsangWei Edward Lee, Jie Tan, and Sergey Levine. Learning agile robotic locomotion skills by imitating animals. In Robotics: Science and Systems, 07 2020. doi: 10.15607/RSS.2020.XVI.064.
 Pourchot and Sigaud [2019] Pourchot and Sigaud. CEMRL: Combining evolutionary and gradientbased methods for policy search. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=BkeU5j0ctQ.
 Rajeswaran et al. [2018] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania, June 2018. doi: 10.15607/RSS.2018.XIV.049.
 Robey et al. [2021] Alexander Robey, Lars Lindemann, Stephen Tu, and Nikolai Matni. Learning robust hybrid control barrier functions for uncertain systems. arXiv preprint arXiv:2101.06492, 2021.
 Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimization. In ICML, pages 1889–1897, 2015. URL http://proceedings.mlr.press/v37/schulman15.html.
 Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
 Tomar et al. [2020] Manan Tomar, Lior Shani, Yonathan Efroni, and Mohammad Ghavamzadeh. Mirror descent policy optimization. CoRR, abs/2005.09814, 2020. URL https://arxiv.org/abs/2005.09814.
 Wagener et al. [2019] Nolan Wagener, Ching an Cheng, Jacob Sacks, and Byron Boots. An online learning approach to model predictive control. In Proceedings of Robotics: Science and Systems, FreiburgimBreisgau, Germany, June 2019. doi: 10.15607/RSS.2019.XV.033.
 Williams et al. [2017] Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M. Rehg, Byron Boots, and Evangelos A. Theodorou. Information theoretic mpc for modelbased reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1714–1721, 2017. doi: 10.1109/ICRA.2017.7989202.
 Yu et al. [2020] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Modelbased offline policy optimization. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 14129–14142. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/a322852ce0df73e204b7e67cbbef0d0aPaper.pdf.
 Zhang et al. [2020] Jesse Zhang, Brian Cheung, Chelsea Finn, Sergey Levine, and Dinesh Jayaraman. Cautious adaptation for reinforcement learning in safetycritical settings. In International Conference on Machine Learning, pages 11055–11065. PMLR, 2020.