Baconian: A Unified Opensource Framework for Model-Based Reinforcement Learning

04/23/2019 ∙ by Linsen Dong, et al. ∙ Nanyang Technological University 24

Model-Based Reinforcement Learning (MBRL) is one category of Reinforcement Learning (RL) methods which can improve sampling efficiency by modeling and approximating system dynamics. It has been widely adopted in the research of robotics, autonomous driving, etc. Despite its popularity, there still lacks some sophisticated and reusable opensource frameworks to facilitate MBRL research and experiments. To fill this gap, we develop a flexible and modularized framework, Baconian, which allows researchers to easily implement a MBRL testbed by customizing or building upon our provided modules and algorithms. Our framework can free the users from re-implementing popular MBRL algorithms from scratch thus greatly saves the users' efforts.



There are no comments yet.


page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model-Based Reinforcement Learning (MBRL) is proposed to reduce sample complexity introduced by model-free Deep Reinforcement Learning (DRL) methods [Nagabandi et al.2018]. Specifically, MBRL approximate the system dynamics with a parameterized model, which can be utilized for policy evaluation and improvement when the training data is very limited and costly to obtain in real world.

Implementing a DRL testbed from scratch can be tedious and bug-introducing. Luckily, many opensource frameworks have been released to facilitate DRL-related research, including baselines [Dhariwal et al.2017], rllab [Duan et al.2016], Coach [Caspi et al.2017], and Horizon [Gauci et al.2018]. These frameworks, however, are mainly implemented for model- free DRL methods and a unified MBRL opensource framework is in need

The challenges for implementing a MBRL framework are mainly twofold. First, the methods for modeling the dynamics are different from case to case [Polydoros and Nalpantidis2017, Deisenroth et al.2013]

. The dynamics model can be classified into stochastic dynamics model and deterministic dynamics model, or global dynamics model and local dynamics model respectively from different aspects of views. In implementation, dynamics model can be approximated by different methods, including Gaussian Process (GP), Gaussian Mixture Model (GMM), Neural Network (NN), Linear Regression. Second, the control flow of MBRL can vary a lot from model-free RL and even vary for different MBRL algorithms

[Sutton1991]. To develop a general framework that can capture the major MBRL algorithms is not a trivial task.

To fill this gap, we design and implement a unified MBRL framework, Baconian, by trading off the diversity of MBRL algorithms included against the complexity of the framework. To the best of our knowledge, this is the first opensource MBRL framework.111

2 System Overview

The system overview of Baconian is shown in Figure 1. We design Baconian with the objective to minimize users’ coding efforts on developing and testing MBRL algorithms. With Baconian, the user can easily setup a MBRL experiment by configuring the target algorithms and modules without the need for understanding the inner mechanisms. Some usage examples of Baconian are given in our project website.222

Experiment Configuration.

The Experiment Configuration layer consists of Experiments Configurator, Status Collector, and Experiment Recorder. Experiments Configurator manages the initialization of the modules. Status Collector controls the status information collected from different modules including agent, environment, and algorithms to format a globally shared status. Experiment Recorder will record the information generated from the experiment, such as training loss, rewards, model checkpoints. These records will be handed to the Logging/Visualization layer for rendering.

MBRL Training Flow.

In the MBRL Training Flow layer, the agent interacts with the environment to collect the training samples which are then used to update the dynamics model and value function/policy. The training flows, including the sampling process from environment, training, and testing process both for value function/policy and dynamics model, are controlled according to user’s configuration.

Logging and Visualization.

The Logging and Visualization layer handles the records obtained from Experiment Recorder for further processing and rendering, including logging, visualization and analysis.

Figure 1: The system design of Baconian

3 Implementation

Baconian follows the Object-Oriented Design (OOD) and applies design patterns including strategy method, observer method, singleton method, and decorator method to build a flexible and reusable framework. The simplified system implementation diagram is shown in Figure 2.

Figure 2: Simplified system UML diagram.

Algorithm Module.

Most popular MBRL algorithms with various dynamics models are implemented in Baconian. For example, we implement MBRL methods such as Dyna, iLQR, LQR, and MPC, as well as the model-free DRL methods including DQN, DDPG, and PPO as the supporting training algorithms for MBRL. These algorithms are implemented in a modularized manner. Different kinds of dynamics models are implemented such as Gaussian Process, Multi-Layer Neural Network, and Linear Regression. We also provide APIs to support user-defined dynamics model, terminal function and reward function.

Control Flow Module.

In Baconian, the control flow of MBRL is delegated to an independent module, which can improve the flexibility and extensibility of framework. We abstract and implement two commonly used control flows, one is the model-free DRL flow including sampling, policy evaluation, and policy improvement. The other one is Dyna-like control flow [Sutton1991], which is used in model-based algorithms. Furthermore, users can define control flows by inheriting the control flow module without extra changes to other modules.

Status Control.

Status control is a must for DRL experiments. For instance, off-policy DRL methods need to switch between behavior policy and target policy during sampling and testing or decay the exploration action noise w.r.t the training progress status. To reduce the users’ effort to deal with such issues, we develop a hierarchical status system which can be utilized by agent and algorithm to dynamically control their behaviors. We further develop a global status collector module which collect and share the status information collected from other modules during running time, such as the total number of samples obtained by an agent, total training steps.

Parameter Management.

To facilitate hyper-parameters setting, tuning, and model storage/reload functions, we unified the parameter management in a single module which is composed into other modules that demand parameter management. Extra support for operations on Tensorflow variables and graphs is also included.

4 Benchmark Results

We conduct the benchmark on continuous task Pendulum-v0 of gym. The results are given in Table 1. In all experiments, 10,000 samples are used in training and the cumulative reward is averaged over 10 times independent tests for each algorithm with different random seeds. The benchmark codes and hyper-parameters setting are also included in the project website.333

Algorithms Cumulative Reward
DDPG (baselines) -499.89
DDPG -417.67
Dyna (DDPG, global MLP dynamics) -1076.82
MPC (global MLP dynamics) -618.72
Table 1: Preliminary benchmark of continuous task: Pendulum-v0. DDPG(baselines) result is obtained by baselines [Dhariwal et al.2017] with same hyper-parameters for comparison.

5 Conclusion

In this paper, we presented a unified, reusable, and flexible framework, Baconian, for MBRL research. We aim to drive the development of MBRL by helping users to conduct MBRL experiments effortlessly. In the future, we will continue the development of Baconian including implementing additional state-of-art MBRL algorithms, e.g., E2C [Watter et al.2015], GPS [Levine and Koltun2013] and their benchmark results on different tasks.


  • [Caspi et al.2017] Itai Caspi, Gal Leibovich, Gal Novik, and Shadi Endrawis. Reinforcement learning coach, December 2017.
  • [Deisenroth et al.2013] Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics. Foundations and Trends® in Robotics, 2(1–2):1–142, 2013.
  • [Dhariwal et al.2017] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines., 2017.
  • [Duan et al.2016] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In

    International Conference on Machine Learning

    , pages 1329–1338, 2016.
  • [Gauci et al.2018] Jason Gauci, Edoardo Conti, Yitao Liang, Kittipat Virochsiri, Yuchen He, Zachary Kaden, Vivek Narayanan, and Xiaohui Ye. Horizon: Facebook’s open source applied reinforcement learning platform. arXiv preprint arXiv:1811.00260, 2018.
  • [Levine and Koltun2013] Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pages 1–9, 2013.
  • [Nagabandi et al.2018] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE, 2018.
  • [Polydoros and Nalpantidis2017] Athanasios S Polydoros and Lazaros Nalpantidis. Survey of model-based reinforcement learning: Applications on robotics. Journal of Intelligent & Robotic Systems, 86(2):153–173, 2017.
  • [Sutton1991] Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991.
  • [Watter et al.2015] Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pages 2746–2754, 2015.