1 Introduction
Reinforcement learning (RL) methods can generally be divided into modelfree approaches, in which the cost is directly optimized, and modelbased approaches, which additionally employ and/or learn models of the environment. Both of these approaches have different limits and strengths [1]. Typically, modelfree approaches are very effective at learning complex policies, but the convergence might require millions of trials and lead to (globally suboptimal) local minima. On the other hand, modelbased approaches have the theoretical benefit of being able to better generalize to new tasks and environments, and in practice they can drastically reduce the number of trials required [2, 3]. However, for this generalization, an accurate model is necessary (either engineered or learned), which on its own can be challenging to acquire. This issue is very crucial since any bias in the model does not translate to a proportional bias in the policy – a weakly biased model might result in a strongly biased policy. As a consequence, modelbased approaches have been often limited to lowdimensional spaces, and often require a significant degree of engineering to perform well. Thus, it is desirable to design approaches that can leverage the respective advantages and overcome the challenges of each approach.
A second motivation for understanding and bridging the gap between modelbased and modelfree approaches is provided by neuroscience. Evidence in neuroscience suggests that humans employ both modelfree and modelbased approaches for learning new skills, and switch between the two during the learning process [4]
. From the machine learning perspective, one possible explanation for this behavior can be found in the concept of bounded rationality
[5]. The switch from a modelbased approach to a modelfree approach after learning sufficiently good policies can be motivated by the need to devote the limited computational resources to new challenging tasks, while still being able to solve the previous tasks (with a sufficient generalization capability). However, in the reinforcement learning community, there are limited coherent frameworks that combines these two approaches.In this paper, we propose a probabilistic framework that integrates modelbased and modelfree approaches. This bridging is achieved by considering the cost estimated by the modelbased component as the prior for the intertwined probabilistic modelfree component. In particular, we learn a Gaussian Processbased dynamics model. This probabilistic dynamics model is used to compute the trajectory distribution corresponding to a given policy, which in turn can be used to estimate the cost of the policy. This estimate is used by a Bayesian Optimizationbased modelfree policy search to guide the policy exploration. In essence, this probabilistic framework allows us to model and combine the uncertainty in the cost estimates of the two methods. The advantage of doing so is to exploit the structure and generality of the dynamics model throughout the entire stateaction space. At the same time, the evidence provided by the observation of the actual cost can be integrated into the estimation of the posterior.
We demonstrate our method on a 2D navigation task and a more complex simulated manipulation task that requires pushing an object. Our results show that the proposed approach can overcome the model bias and inaccuracies in the dynamics model to learn a wellperforming policy, and yet retain and improve upon the fast convergence rate of modelbased approaches.
The rest of the paper is organized as follows. In Section 2, we discuss related works. In Section 3, we present a brief formulation of the problem, and in Section 4 a brief introduction to Gaussian processes and Bayesian optimization. The proposed approach is described in Section 5. In Section 6, we present the experimental results. Finally, we conclude in Section 7, and present future research directions.
2 Related Work
Gaussian processes have been widely used in the literature to learn dynamics model [2, 6, 7, 8, 9] and for control [1, 10, 11]. By learning a forward dynamics model, it is possible to predict the sequence of states (e.g., a trajectory) generated by the given policy. The main hypothesis here is that the learning of the dynamics is a good proxy, full of structure, that allows to predict the behavior of a given policy without evaluating it on the real system. This structure is particularly valuable, as the alternative would be to directly predict the cost, from the policy parameters (e.g., in Bayesian optimization), which can be challenging, especially for high dimensional policies with thousands or millions of parameters. However, modelbased approaches do not usually incorporate evidence from the cost. Hence, if the model is inaccurate (e.g., due to intrinsic limitations of the models or the compounding of the inaccuracies over trajectory propagation) the expected cost might be wrong, even if the cost has been measured for the considered policy. This issue is often referred to as model bias [12].
To overcome the model bias, [13]
proposed to optimize not only the expected cost, but to also incorporate the predicted variance. This modification result in a explorationexploitation tradeoff that very closely connects to the acquisition functions used in Bayesian optimization. However, unlike our work,
[13] does not try to make use of modelfree approaches and therefore its approach can be considered as a specific case of our framework where, once again, the evidence derived from directly observing the cost is disregarded.Recently, it has been proposed that a solution to overcoming the model bias is to directly train the dynamics in a goaloriented manner [14, 15] using the cost observations. However, this approach has the drawback that the generality of the dynamics model is lost. Moreover, directly optimizing the high dimensional dynamics in a goaloriented fashion can be very challenging. In contrast, we learn a general dynamics model and yet take the cost observations into account which allows us to overcome the limitations of both pure modelbased and modelfree methods.
Several prior works have sought to combine modelbased and modelfree reinforcement learning, typically with the aim of accelerating modelfree learning while minimizing the effects of model bias on the final policy. [16] proposed a method that generates synthetic experience for modelfree learning using a learned model, and explored the types of models that might be suitable for continuous control tasks. This work follows on a long line of research into using learned models for generating synthetic experience, including the classic Dyna framework [17]. Authors in [18] use models for improving the accuracy of modelfree value function backups. In other works, models have been used to produce a good initialization for the modelfree component [19, 20]. However, our method directly combines modelbased and modelfree approaches into a single RL method without using synthetic samples that degrade with the modeling errors.
[21] also proposed to combine modelbased and modelfree methods, with the modelfree algorithm learning the residuals from the modelbased return estimates. Our approach also uses the modelbased return estimates as a bias for modelfree learning, but in contrast to [21], the modelbased component is incorporated as a prior mean into a Bayesian modelfree update, which allows our approach to reason about the confidence of the modelfree estimates across the entire policy space. Our approach is most similar to [22] wherein a linear model is learned in the feature space which is used to provide a prior mean for the modelfree updates. The features used to learn the model are handpicked. In contrast, our approach works for general dynamics models that are learned from scratch.
3 Problem Formulation
The goal of reinforcement learning is to learn a policy that maximizes the sum of future rewards (or equivalently minimize the sum of future costs). In particular, we consider a discretetime, potentially stochastic and nonlinear, dynamical system
(1) 
where denote the state of the system, is the action, and is the state transition map. Our objective is to find the parameters of the policy that minimizes a given cost function subject to the system dynamics. In the finite horizon case, we minimize the cost function
(2) 
where is time horizon, and is the cost function at timestep . One of the key challenges in designing the policy is that the system dynamics of Equation (1) is typically considered unknown. In this work, we propose a novel approach that combines modelfree and modelbased RL methods to learn the optimal policy .
4 Background
Our general approach will be to use Gaussian Processes (GPs) to learn the dynamics model of the system, and Bayesian Optimization (BO) to find the optimal policy parameters. In this section, we provide a brief overview of GPs and BO. In the next section, we combine these two methods together to overcome some of the challenges that pure modelbased and pure modelfree RL methods face.
4.1 Gaussian Process based Dynamics Models
GPs are a stateoftheart probabilistic regression method [23]. In general, a GP can be used to model a nonlinear map,
, from an input vector
to the function value . Hence, we assume that function values , associated with different values of, are random variables and that any finite number of these random variables have a joint Gaussian distribution
[23]. For GPs, we define a prior mean function, , and a covariance function, , which defines the covariance (or kernel) of any two function values, and. The choice of kernel is problemdependent and encodes general assumptions such as smoothness of the unknown function. In this work, we employ the squared exponential kernel where the hyperparameters are optimized by maximizing the marginal likelihood
[23].The GP framework can be used to predict the distribution of the function at an arbitrary input based on the past observations, . Conditioned on , the prediction of the GP for the input is a Gaussian distribution with posterior mean and variance as
(3) 
where is the kernel matrix with , is the prior mean function, and . Thus, the GP provides both the expected value of the function at any arbitrary point as well as a notion of the uncertainty of this estimate.
In the context of this paper, GPs have two separate uses within our approach: 1) to learn the unknown dynamics model in Equation (1), where, represents and represents , and 2) within Bayesian optimization (discussed in Section 4.2), to map policy parameters to predicted cost. Central to our choice of employing GPs is their capability of explicitly modeling the uncertainty in the underlying function. This uncertainty allows to account for the modelbias in the dynamics model, and to deal with the exploration/exploitation tradeoff in a principled manner in Bayesian optimization.
4.2 Bayesian Optimization
BO is a gradientfree optimization procedure that aims to find the global minimum of an unknown function [24, 25, 26]. At each iteration, BO uses the past observations to model the objective function , which is modeled using a GP. BO uses this model to determine the next informative sample location by optimizing the socalled acquisition function. Different acquisition functions are used in literature to trade off exploration and exploitation during the optimization process [26]. In this work, we use the expected improvement (EI) acquisition function [27]. Intuitively, EI selects the next parameter point where the expected improvement in performance is maximal. In this paper, we use BO as our modelfree method, i.e., we use BO to find the optimal policy parameters that minimize the cost function in Equation (2) directly based on the observed cost on the system, as we will now detail in the next section.
5 Using Modelbased Prior for Modelfree RL
We now present our novel approach to incorporating a modelbased prior in modelfree RL, which we term “ModelBased ModelFree” (MBMF). As with most modelbased approaches, our algorithm starts by training a forward dynamics model from singlestep ahead data . This is done using GP regression (see Section 4.1). For multidimensional states, we learn a separate GPbased dynamics for every dimension of the state. Once the dynamics model is trained, for any given policy parameterization , we can predict the expected trajectory distribution by iteratively computing the distribution of the states for . We use MonteCarlo simulation to find the trajectory distribution which is highly parallelizable and known to be very effective for GPs [28]. Nevertheless, other schemes can be used to compute a good approximation of this distribution [2]. Given the trajectory distribution, we compute the predicted distribution of the cost as a function of the policy parameters using Equation (2). We denote the expected value of this predicted cost function as
At the same time, similarly to BO, we train a GPbased response surface, , that predicts given the measured tuples of . Here, denotes the observed cost corresponding to the policy for the given horizon, as defined in Equation (2). However, unlike plain BO, we employ the prediction of the cost distribution from the dynamics model as the prior mean of the response surface^{1}^{1}1A more correct, but computationally harder approach would be to treat the full cost distribution as a prior for the response surface.. This modified response surface is then used to optimize the acquisition function to compute the next policy parameters to evaluate on the real system. The policy is then rolled out on the actual system. The observed stateinput trajectories and the realized cost data is next added to and respectively, and the entire process is repeated again. A summary of our algorithm is provided in Algorithm 1.
Intuitively, the learned dynamics model has the capability to estimate the cost corresponding to a particular policy; however, it suffers from the model bias which translates into a bias in the estimated cost. The BO response surface, on the other hand, can predict the true cost of a policy in the regime where it has observed the training samples, as it was trained directly on the observed performances. However, it can have a huge uncertainty in the cost estimates in the unobserved regime. Incorporating the modelbased cost estimates as the prior allows it to leverage the structure of the dynamics model to guide its exploration in this unobserved regime. Thus, using the modelbased prior in BO leads to a sampleefficient exploration of the policy space, and at the same time overcomes the biases in the modelbased cost estimates, leading to the optimal performance on the actual system.
Note that we collect trajectory data at each iteration so, in theory, we can update the dynamics model, and hence the response surface prior, at each iteration. However, it might be desirable to update the prior every iterations instead, as the GP model might change significantly between consecutive iterations, especially when the dataset is small. We will demonstrate the effect of on the learning progress in Section 6.
It should also be noted that algorithms like PILCO [2] can be thought of as a special case of our approach, where the response surface consists exclusively of the prior mean provided from the dynamics model, without any consideration of the evidences (i.e., the measured costs). In other words, PILCO does not take the dataset into account. Leveraging allows the BO to learn an accurate response surface by accounting for the differences between the “belief" cost based on the dynamics model and the actual cost observed on the system.
Remark 1
Remark 2
Even though we use Gaussian processes to learn the dynamics model in this work, the proposed approach is agnostic to the function approximator used to learn the dynamics model. Other dynamics models such as timevarying linear models and neural networks can thus easily be used.
6 Experimental Results
In this section, we compare the performance of MBMF with a pure modelbased method, a pure modelfree method, as well as a combination of the two where the model is used to “warm start" the modelfree method.
6.1 Experimental Setting
Task details
We apply the proposed approach as well as the baseline approaches on two different tasks. In the first task, a 2D point mass is moving in the presence of obstacles. The setup of the task is shown in Figure 2. The agent has no information about the position and the type of the obstacles (the Grey cylinders). The goal of the agent is to reach the goal position (the Green circle) from the starting position (the Red circle). For the cost function, we penalize the squared distance from the goal position.
In the second task, an underactuated three degreeoffreedom (DoF) robotic arm (only two of the three joints can be controlled) is trying to push an object from one position to another. The setup of the task is shown in Figure
4. The Red box represents the object which needs to be moved to the goal position, denoted by the Green box. As before, the squared distance from the goal position of the object is used as the cost function.These tasks pose challenging learning problem because they are underobserved, underactuated, and have both contact and noncontact modes, which result in discontinuous dynamics.
Implementation details
For learning the Gaussian process models, we use the Pythonbased GPy package [29]. We use the Dividing Rectangles (DIRECT) algorithm [30] for all the policy searches in this paper. For simulating the tasks, we use OpenAI Gym [31] environments and the Mujoco [32] physics engine. In our experiments, we employ linear policies, but more complex policies can be easily incorporated as well.
Baselines details
For the modelbased method, we learn a GPbased dynamics model and use this dynamics model to perform policy search. Given the dynamics model and the cost function, we learn a linear policy using DIRECT. The resultant policy is then executed on the real system and the corresponding state and action trajectories, as well as the resultant cost are obtained. The observed trajectories are then added to the training set, and the entire process is repeated again. We denote this baseline as MB in our plots.
For the modelfree method, we use Bayesian Optimization to directly find the optimal policy parameters, and denote it as MF in plots. In the final variant, we use the modelbased method above to optimize the policy for a given number of iterations, after which we switch to BO and continue the optimization. The cost observations obtained during the executions of the modelbased method were used to initialize the BO. We denote it as MB+MF in the plots. We will simulate this baseline for different switching points, which corresponds to the number of iterations after which we switch to the modelfree approach. We denote this number by in our plots. Finally, we denote our approach as MBMF and also simulate it for multiple prior update frequencies .
6.2 2D Point Mass
The optimal cost obtained for different approaches as learning progresses are shown in Figure 1, where each iteration corresponds to one execution on the real system. MBMF is able to learn a good policy roughly within 15 iterations. The modelfree approach (the dotdashed Blue curve) improves as the learning progresses, but is still significantly outperformed by all other approaches, indicating the datainefficiency of a pure modelfree approach.
The pure modelbased approach (the dashed Orange curve) continues to improve as learning progresses; however, it is outperformed by MBMF. Interestingly, in this case, using modelbased method to warm start the modelfree method doesn’t improve the performance, as evident from the dotted Red and the marked Green curves, indicating that using the modelbased component to initialize the modelfree component may not be sufficient for the policy improvement. In contrast, using model information as a prior for the modelfree method outperforms the other approaches and continues to improve as learning progresses, indicating the utility of systematically incorporating the model information during policy exploration. The mean and the standard deviation of the costs obtained by the baselines and the MBMF approach across 30 trials are listed in Table
1. We also note that the MBMF approach has significantly smaller variance compared to all the other baselines, indicating the consistency in the performance of the MBMF approach.We found that the frequency at which the prior is updated in the MBMF approach affects the learning, as indicated by the Purple and the Brown curves in Figure 1, for which the prior was updated at every iteration and every 10 iterations, respectively. As evident from Figure 1, switching the model prior too frequently or too slowly both might lead to a suboptimal performance. Switching too often makes modelbased method too sensitive to the changes in the dynamics model, which can change significantly especially earlyon in the learning, and can “misguide" the policy exploration. On the other hand, switching too slowly may strip the modelfree method of the full potential of the dynamics model. In this particular case, the optimal frequency turns out to be ; however, roughly perform as well as . It might also be interesting to note that the MBMF approach is roughly the same or outperforms the baselines for all values of .


We also plot the trajectories obtained by executing the learned controller on the actual system for the MB, MF and MBMF approaches in Figure 2. The initial and the goal positions are denoted by the Red and the Green circles respectively. For comparison purposes, the globally optimal trajectory (the dotted Red curve) was also computed using the actual system dynamics obtained through Mujoco. We plot the trajectories for different trials, which correspond to different (but same across all methods) initial data. As evident from the figure, the MBMF approach is consistently able to reach the goal state, whereas the MB and MF approaches fail to achieve a consistent good performance. In particular, the optimal trajectory requires the system to overcome the obstacle next to the starting position. A pure MB approach is unable to consistently learn this behavior, potentially because it requires learning a discontinuous dynamics model. Consequently, it is often unable to reach the goal position and gets stuck in the obstacles (Figures 2, 2). Similarly, a pure MF approach is unable to learn to overcome the obstacles within 25 iterations. MBMF approach, however, can take evidence into account and is able to overcome this challenge to reach the goal state consistently, demonstrating its robustness to the training data, which is also evident from a lower variance in the performance of MBMF in Table 1.
6.3 Three DoF Robotic Arm
As evident from Figure 3, MBMF outperforms the other approaches and is continue to improve policy over iterations. Interestingly, in this case, if the prior update frequency is too small ( is large), then the MBMF lags behind the pure modelbased approach, as it is not fully leveraging the dynamics model information. It nevertheless eventually catches up with the MB approach. However, if the right update frequency is chosen, then MBMF can leverage the advantages of both MB and MF approaches and outperforms the two. The mean and the standard deviation of the costs obtained by the baselines and the MBMF approach are listed in Table 1. As before, the MBMF approach has a smaller variance compared to all the other baselines, indicating that the MBMF approach is robust to the initial training data and consistency outperforms the other approaches.
The corresponding trajectory comparison between MB and MBMF approaches in Figure 4 also highlight the efficacy of MBMF in leveraging the advantages of both the model and modelfree components to quickly learn the optimal policy. A pure MB approach struggles with learning to move the object vertically in a straight line, potentially due to the complexity of the dynamics given the contactrich nature of the task. The MBMF approach, on the other hand, has the capability to tradeoff the observed costs and the predicted cost. As a result, it has been able to move the object closer to the goal position within a small number of iterations (20 in this case).
7 Conclusion
In this paper, we proposed MBMF, a novel probabilistic framework to combine modelbased and modelfree RL methods. This bridging is achieved by using the cost estimated by the modelbased component as the prior for the modelfree component. Our results show that the proposed approach can overcome the model bias and inaccuracies in the dynamics model, and yet retain the fast convergence rate of modelbased approaches. There are several interesting future directions that emerge out of this work. First, it would be interesting to investigate how this approach performs on more complex tasks. Moreover, the predictiontime of Gaussian processes scales cubically with the number of training samples [23], which makes the proposed approach prohibitive for the higherdimensional systems or policies. Exploring more scalable versions of the proposed approach is an interesting future direction. Finally, a natural direction of research is the inclusion of other intermediate representations, such as value functions and trajectories, in the proposed approach.
References
 Deisenroth et al. [2013] M. P. Deisenroth, G. Neumann, and J. Peters. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(12):1–142, 2013.
 Deisenroth et al. [2015] M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian processes for dataefficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–423, 2015.
 Levine and Abbeel [2014] S. Levine and P. Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pages 1071–1079, 2014.
 Gläscher et al. [2010] J. Gläscher, N. Daw, P. Dayan, and J. P. O’Doherty. States versus rewards: dissociable neural prediction error signals underlying modelbased and modelfree reinforcement learning. Neuron, 66(4):585–595, 2010.
 Simon [1982] H. A. Simon. Models of bounded rationality: Empirically grounded economic reason, volume 3. MIT press, 1982.
 NguyenTuong et al. [2009] D. NguyenTuong, M. Seeger, and J. Peters. Model learning with local Gaussian process regression. Advanced Robotics, 23(15):2015–2034, 2009.
 NguyenTuong and Peters [2011] D. NguyenTuong and J. Peters. Model learning for robot control: a survey. Cognitive processing, 12(4):319–340, 2011.
 Schreiter et al. [2015] J. Schreiter, P. Englert, D. NguyenTuong, and M. Toussaint. Sparse Gaussian process regression for compliant, realtime robot control. In International Conference on Robotics and Automation, 2015.
 Pan and Theodorou [2014] Y. Pan and E. Theodorou. Probabilistic differential dynamic programming. In Advances in Neural Information Processing Systems, pages 1907–1915, 2014.
 Kocijan et al. [2004] J. Kocijan, R. MurraySmith, C. E. Rasmussen, and A. Girard. Gaussian process model based predictive control. In American Control Conference, pages 2214–2219. IEEE, 2004.

Calandra et al. [2015]
R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth.
Bayesian optimization for learning gaits under uncertainty.
Annals of Mathematics and Artificial Intelligence
, 76(1):5–23, 2015.  Deisenroth and Rasmussen [2011] M. Deisenroth and C. E. Rasmussen. PILCO: A modelbased and dataefficient approach to policy search. In International Conference on Machine Learning, pages 465–472, 2011.
 McHutchon [2014] A. McHutchon. Nonlinear modelling and control using Gaussian processes. PhD thesis, Department of Engineering, University of Cambridge, 2014.
 Bansal et al. [2017] S. Bansal, R. Calandra, T. Xiao, S. Levine, and C. J. Tomlin. Goaldriven dynamics learning via Bayesian optimization. arXiv preprint arXiv:1703.09260, 2017.
 Donti et al. [2017] P. L. Donti, B. Amos, and J. Z. Kolter. Taskbased endtoend model learning. arXiv preprint arXiv:1703.04529, 2017.
 Gu et al. [2016] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep Qlearning with modelbased acceleration. arXiv preprint arXiv:1603.00748, 2016.
 Sutton [1991] R. S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991.
 Heess et al. [2015] N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015.
 Farshidian et al. [2014] F. Farshidian, M. Neunert, and J. Buchli. Learning of closedloop motion control. In Intelligent Robots and Systems Conference, pages 1441–1446. IEEE, 2014.
 Nagabandi et al. [2017] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. arXiv preprint arXiv:1708.02596, 2017.
 Chebotar et al. [2017] Y. Chebotar, K. Hausman, M. Zhang, G. Sukhatme, S. Schaal, and S. Levine. Combining modelbased and modelfree updates for trajectorycentric reinforcement learning. arXiv preprint arXiv:1703.03078, 2017.
 Wilson et al. [2014] A. Wilson, A. Fern, and P. Tadepalli. Using trajectory data to improve bayesian optimization for reinforcement learning. The Journal of Machine Learning Research, 15(1):253–282, 2014.
 Rasmussen and Williams [2006] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. The MIT Press, 2006.
 Kushner [1964] H. J. Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86:97, 1964.
 Osborne et al. [2009] M. A. Osborne, R. Garnett, and S. J. Roberts. Gaussian processes for global optimization. In International Conference on Learning and Intelligent OptimizatioN, pages 1–15, 2009.
 Shahriari et al. [2016] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
 Močkus [1975] J. Močkus. On Bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, 1975.
 Kupcsik et al. [2014] A. Kupcsik, M. P. Deisenroth, J. Peters, A. P. Loh, P. Vadakkepat, and G. Neumann. Modelbased contextual policy search for dataefficient generalization of robot skills. Artificial Intelligence, 2014.
 GPy [since 2012] GPy. GPy: A Gaussian process framework in python. http://github.com/SheffieldML/GPy, since 2012.
 Gablonsky et al. [2001] J. M. Gablonsky et al. Modifications of the DIRECT algorithm. 2001.
 Brockman et al. [2016] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI Gym, 2016.
 Todorov et al. [2012] E. Todorov, T. Erez, and Y. Tassa. MuJoCo: A physics engine for modelbased control. In Intelligent Robots and Systems Conference, pages 5026–5033. IEEE, 2012.
Comments
There are no comments yet.