Intelligent Trainer for Model-Based Reinforcement Learning

by   Yuanlong Li, et al.
Nanyang Technological University

Model-based deep reinforcement learning (DRL) algorithm uses the sampled data from a real environment to learn the underlying system dynamics to construct an approximate cyber environment. By using the synthesized data generated from the cyber environment to train the target controller, the training cost can be reduced significantly. In current research, issues such as the applicability of approximate model and the strategy to sample and train from the real and cyber environment have not been fully investigated. To address these issues, we propose to utilize an intelligent trainer to properly use the approximate model and control the sampling and training procedure in the model-based DRL. To do so, we package the training process of a model-based DRL as a standard RL environment, and design an RL trainer to control the training process. The trainer has three control actions: the first action controls where to sample in the real and cyber environment; the second action determines how many data should be sampled from the cyber environment and the third action controls how many times the cyber data should be used to train the target controller. These actions would be controlled manually if without the trainer. The proposed framework is evaluated on five different tasks of OpenAI gym and the test results show that the proposed trainer achieves significant better performance than a fixed parameter model-based RL baseline algorithm. In addition, we compare the performance of the intelligent trainer to a random trainer and prove that the intelligent trainer can indeed learn on the fly. The proposed training framework can be extended to more control actions with more sophisticated trainer design to further reduce the tweak cost of model-based RL algorithms.



There are no comments yet.


page 8

page 10

page 11

page 12

page 13


Deep Reinforcement Learning for Event-Triggered Control

Event-triggered control (ETC) methods can achieve high-performance contr...

A Real-Time Model-Based Reinforcement Learning Architecture for Robot Control

Reinforcement Learning (RL) is a method for learning decision-making tas...

Towards a Reinforcement Learning Environment Toolbox for Intelligent Electric Motor Control

Electric motors are used in many applications and their efficiency is st...

Model Embedded DRL for Intelligent Greenhouse Control

Greenhouse environment is the key to influence crops production. However...

Learning to Guide: Guidance Law Based on Deep Meta-learning and Model Predictive Path Integral Control

In this paper, we present a novel guidance scheme based on model-based d...

Learning Powerful Policies by Using Consistent Dynamics Model

Model-based Reinforcement Learning approaches have the promise of being ...

System identification and modeling for interacting and non-interacting tank systems using intelligent techniques

System identification from the experimental data plays a vital role for ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep reinforcement learning (DRL), which combines reinforcement learning (RL) and deep neural networks (DNN), has demonstrated its prowess in solving complex decision-making problems, like Go

[1]. A series of recent breakthroughs have also shown that DRL algorithms, such as Deep Deterministic Policy Gradients (DDPG) and Trust Region Policy Optimization (TRPO), perform well for continuous control problems [2] [3]. Along with the capability and promise, DRL also incurs high training cost and thus presents great challenges in practice. For example, in [4]

the authors showed that only after about one million trials, DRL agent can learn to score a goal with high probability. For practical control applications that rely on data directly sampled from real physical systems, the time and resource costs required during training can be prohibitively high


One of the existing approaches to address this challenge is the model-based reinforcement learning [6], which has been applied to robot arm training [6] and online tree search based planning [7] [8] [9]. Model-based RL uses the data collected from the real-world system to train a system dynamic model. The learned system model then generates synthesized data that, along with real world data, are used to train the controller and search for good actions. Generally, producing synthesized data in a cyber environment is relatively inexpensive, thus model-based RL has the advantage of low training cost.

Despite the efforts related to model-based RL, the following crucial issues have not been sufficiently investigated:

  • The effectiveness of the model-based approach as a data source. It depends on whether the model of underlying system dynamics can be well learned. For some cases as will be shown in this paper, using the model-generated data can cause serious training degeneration. Even worse, the state observed for the decision making problem may be only a partial observation of the real system, which makes it hard to foreseen whether the learned system model can work or not in practice. In practice, when we try to solve a problem that has never been tried before, we will have no knowledge whether the model can help or not.

  • The setting of hyper-parameters related to the model. The general model-based RL approaches use a certain fixed amount of model-generated data in training, with the related hyper-parameters tuned manually. For example, the model-generated data can be noisy, we may need a proper amount of these data which can help exploration without affecting the final training quality. The additional tuning cost may swallow the advantage of fewer sampling cost of model-based RL. A naive approach that may be proposed to solve this issue is to re-train the controller with different training parameters with the collected data samples. Such solution may not work well in RL case, as the training quality of an RL policy is largely determined by the quality of the data. When the policy used in sampling is under-performed due to improper parameter setting, the acquired data will be a waste due to its low quality. In this case, the demand of a training algorithm with self-adaptive parameter setting is critical for model-based RL to achieve the target of sampling cost saving in practice.

To solve these problems and develop a practical model-based DRL algorithm, we propose an intelligent-trainer enhanced model-based DRL framework, which can learn the optimal model-related parameters and training/sampling settings in an online manner and achieve close-to-optimal performance (comparing with the algorithm with manually optimized hyper-parameters). We first construct, based on the standard model-based DRL training process, a training process environment (TPE). We then add an intelligent trainer, which interacts with the TPE to control the sampling and training process of the target controller. Different from the existing approaches that directly modify the training algorithm of the target controller [9], this “reinforcement on reinforcement” design can work with different model-based RL controllers and with different trainer designs. We test a single-head trainer, in which a single DQN trainer is learned in an online manner to optimize the sampling and training of a single model-based RL controller. This online learning, however, is a single-head action process with all actions correlated, such that the trainer could be trapped to an unfavorable state due to random actions and could never recover. To resolve this issue, we build an ensemble trainer, which comprises multiple trainers that take independent actions in their training processes and by comparing their performance, we can find out the best actions. Due to the sampling budget constraint, the multiple trainers share a same sampling budget. We design a memory sharing mechanism such that all trainers can have enough real data in training. However, in this case the real data samples will be generated by different policies which are of various performance. The data generated by an under-performed policy can lower the quality of other policies. To solve this issue, based on the rewards accumulated by different trainers, we select the target controller of the best trainer as the reference agent and partly use this reference agent for other trainers in the sampling process from the real environment, so the real data samples are more probably come from well-trained agents. We also design a weight transfer process to reboot some of the under-performed trainer if its performance is not satisfying. The proposed multi-head operation has the advantage of using the same amount of data as required by single-head operation, with no extra data needed to guide the training and ensemble process. It is expected to achieve close-to-optimal performance to the well-tuned model-based RL across different tasks, with a same setting to the newly introduced hyper-parameters in the ensemble trainer.

We extensively test the proposed framework in five representative tasks from OpenAI gym[10]. We primarily investigate a deep Q-network (DQN) based intelligent trainer. The results show that the DQN trainer can achieve good performance compared with a fixed setting model-based RL baseline. Moreover, the ensemble trainer can overcome the performance degradation caused by controllers’ sensitivity to cyber data. The test results show that, for all tasks, our framework can deliver performance close to that by manually optimized training algorithm and in two of five tasks we can even achieve better results. The main contributions of the paper are:

First, we propose a “reinforcement on reinforcement” trainer framework for model-based RL. This framework decouples the cyber model related setting from the training algorithm of the target controller, thus provides much needed flexibility for tuning and optimization of these settings.

Second, we design an intelligent trainer based on DQN. This DQN trainer enables online learning of proper settings for training and sampling, without incurring additional sampling cost for model-based RL algorithm.

Third, we design an ensemble trainer, which enhances the performance of the single-head intelligent trainer with the same amount of real data sampling. With proposed memory sharing, reference sampling, and weight transfer schemes, the ensemble trainer successfully learns the best control settings for different scenarios and achieves good performance even when the training of the target controller is extremely sensitive to the quality of cyber data.

As a result, the proposed framework reduces the algorithm tuning cost and make model-based RL algorithm more applicable in practice. To facilitate the research in model-based DRL algorithms, we open-source our training framework [11].

The remainder of this paper is structured as follows. Section II briefly surveys the related works. Section III provides a detailed description of the proposed trainer framework, including its key components, single-head trainer design, and ensemble trainer design. Section IV presents the numerical evaluation results of the proposed framework. SectionV concludes the paper.

Ii Related Works

To build intelligent agents that can learn to accomplish various control tasks, researchers have been actively studying reinforcement learning for decades, such as [12] [13] [14] [15] [16]

. With recent advancement of deep learning (DL), DRL

[1] has demonstrated its strength in various applications. For example, in [17] a DRL agent is proposed to solve financial trading tasks; in [18] a neural RL agent is trained to mimic the human motor skill learning; in [19] an off-policy RL method is proposed to solve nonlinear and nonzero-sum games.

Despite the significant performance improvement, the high sampling cost necessitated by DRL has become a significant issue in practice. To address this issue, model-based RL is introduced to learn the system dynamics model, so as to reduce the data collection and sampling cost. In [6] the authors provided a model-based RL for a robot controller that samples from both real physical environment and learned cyber emulator. In [20] the authors adapted a model, trained previously for other tasks, to train the controller for a new but similar task. This approach combines prior knowledge and the online adaptation of dynamic model, thus achieves better performance. In these approaches, the number of samples taken from the cyber environment to train the target controller is either predetermined or can only be adjusted manually, resulting in both sampling inefficiency and additional algorithm tuning cost. In [21]

the authors proposed a model-assisted bootstrapped DDPG algorithm, which uses a variance ratio computed from the multiple heads of the critic network to decide whether the cyber data can be used or not. The method relies on the bootstrapped DQN design, which is not suitable to other cases.

Instead of treating the cyber model as a data source for training, some approaches use cyber model to conduct pre-trial tree searches in applications, for which selecting the right action is highly critical [7] [8]. The cyber model can prevent selecting unfavorable actions and thus accelerates the learning of the optimal policy. In [9], the authors introduced a planning agent and a manager who decides whether to sample from the cyber engine or to take actions to minimize the training cost. Both approaches focuses on the tree search in action selection which is different to our design that we aim to select the proper data source in sampling. Some recent works investigate integrating model-based and model-free approaches in RL. In [22] the authors combined model-based and model-free approaches for Building Optimization and Control (BOC), where a simulator is used to train the agent, while a real-world test-bed is used to evaluate the agent’s performance. In [23] the model-based DRL is used to train a controller agent. The agent is then used to provide weight initialization for a model-free DRL approach, so as to reduce the training cost. Different to this approach, we focus on directly sample from the model to reduce sampling cost in the real environment.

Iii Approach

In this section we present the proposed trainer framework. We first introduce how we package the standard training process of model-based RL as an RL environment, and then present an RL-based trainer that can optimize the training inside this environment. At last we introduce a more robust design of ensemble trainer that can solve the issues of action correlation.

The logic flow of the proposed intelligent trainer framework is shown in Fig. 1. The TPE is a standard model-based DRL system utilizing the model as a data source for training. The training data are provided by the physical environment, which represents the real-world system, and the cyber environment, which is an emulator of the physical system. The emulator can be either knowledge-based or learning-based (e.g., a neural network prediction model). The target controller observes the states and takes actions to maximize the reward received. In addition, we setup an intelligent trainer, which is an independent entity, as opposed to being treated as part of the target controller [24]. This trainer is also an RL agent that controls and optimizes the sampling and training process of the target controller in the real and cyber environment via feedbacks and action outputs. Thus, this proposed framework can be considered as a “reinforcement on reinforcement” architecture. Such modularized design can easily work for different kinds of target controller training algorithms (such as DDPG, TRPO) and the extra layer of intelligent trainer can be any optimizer than can output the control action when given a TPE observation.

Fig. 1: Block diagram of the proposed intelligent trainer framework.

Iii-a Training Process Environment (TPE)

TPE has two important functions to execute the entire training process of a general model-based RL:

  • Initialization: execute initialization tasks for the model-based RL training process. These tasks include initializing the real training environment, the cyber emulator, and the target controller.

  • Step(state, action): execute one step of training of the model-based RL algorithm. As shown in Fig. 2, this process includes sampling from the real and cyber environment, training the target controller, and training the dynamic model of cyber emulator. Note that in each step, we keep the number of real data samples to sample fixed () while optimize the amount of cyber data used in the training. We found that such design is more stable in implementation as it can simplify the action design.

For the interaction between TPE and the intelligent trainer, we define three components State, Action, and Reward of TPE as follows. To distinguish the RL components in different layers, in the following, superscript is used to indicate variables in the target controller layer, while is used to indicate variables in the intelligent trainer layer.

  • State

    : The state is a vector that is exposed to an outside agent who can use the state to access the training progress. Ideally one can put as much information as possible into the state design to measure the training progress. However, we found that using a constant (zero) to represent the TPE state can still work as this simple setting allows the trainer to learn a good action quickly. We also test other more informative state representation designs, such as using the last average sampling reward or the normalized sampling count. They can achieve better performance in certain cases. A comparative study of these different designs are provided in Section


  • Action: the action space comprises three controllable parameters that is exposed to an outside agent who can utilize these actions to control the training progress. We represent these parameters as probability values, all defined in the range of . Such a representation greatly simplifies and accelerates the training process. Details of the three control parameters will be given subsequently.

    • Action is the ratio of the number of real data sampled to that of total data sampled (real and cyber). Recall that in each step we sample a fixed number of real data samples. controls the number of cyber data to sample in each step by


      The rational of such design is to bound the action in range [0, 1], such that it can fit to different tasks.

      In addition to sampling setting, we also use action to represent the probability to take a mini-batch from the real data memory in training the target controller. Naturally, represents the probability to take a mini-batch from the cyber data buffer. With fixed batch size, if we train with batches of real data, then batches of cyber data are used in this step:


      Note that we use only one action to control both the sampling and training process to accommodate some DRL algorithms, such as TRPO, where the sampling and training process cannot be decoupled.

    • Action is related to the selection of the starting state of a new episode when we sample from the cyber environment. In the sampling process, the starting state of an episode matters. For example, we can select a starting state from the real data buffer that stores data collected previously from the physical system. In this case, the subsequent sampling process will be a local search process similar to the imagination process used in [8]. Alternatively, we can use a data point randomly selected from the state space to favor exploration. It thus can control the trade-off between exploitation and exploration during the sampling process. In our design, , with , represents the probability of choosing starting state from the real data buffer, as



      is a uniformly distributed random number drawn from [0, 1].

    • Action , with is related to the selection of a starting state of a new episode when we sample from the real environment. Similar to , we use to control the trade-off between exploitation and exploration. We introduce the following sampling quality function for selecting a starting state of sampling in physical environment:


      where is the value produced by critic network of the target controller, is the current policy. We keep sampling random starting points in the physical environment until a high quality starting point is found, as shown in Algorithm 1. In such way, we select initial states with a higher value, thus accelerate the convergence of the critic network. Note that a higher value of indicates that we favor exploitation over exploration. In brief, when or approaches 1, the optimization process favors exploitation; when or approaches 0, the optimization process favors exploration.

  • Reward: We define the reward as


    where and are the respective average sampling reward of the target controller at step and on real environment. This means, as long as the reward is increasing, the current training action is considered acceptable. Although such a simple design allows the trainer to learn the settings quickly, it may not be effective in all practical cases, especially in the case where the cyber data does not degrade the performance but prolongs the convergence. A more effective rank-based reward design is used in the ensemble trainer in Section III-C.

The aforementioned action space of the TPE has the following three components:

Fig. 2: Block diagrams of the logic flow of TPE, from sampling to training. In the implementation, in order to use the sampling process to collect reward information, the third module is set before the first one. The definitions of the parameters are given in the texts.

Iii-B Intelligent Trainer

The intelligent trainer is designed to optimize control action , , and during the training of the target controller in an online and on-policy manner. At each time step, the trainer collects one sample from TPE. Then the TPE advances for one time step, as described in Algorithm 2. Note that with this design, only one target controller is involved in training and testing all actions in a single streamline of training. This single-head trainer needs to learn quickly with limited training time steps and samples. Several trainer learning algorithms, like DQN and REINFORCE, can be used for this problem. In the following, we use a DQN controller to demonstrate the trainer design. A comparison of different trainer designs is given in Section IV-C.

We implement a specialized DQN controller that carries out discretized control actions with a relatively small-scale Q network. At each time step, the trainer evaluates all the actions with the Q network and selects the action with the highest Q value.

The training of the DQN controller follows standard epsilon-greedy exploration [24] strategy. To enhance the training stability, the DQN controller is equipped with a memory, like the replay buffer in DDPG [2]. As such, the trainer can extract good actions from the noisy data received from TPE. During the experiment, we notice that samples from mere one single action could flood the buffer. The homogeneity in actions could prolong or even halt the training of DQN. To solve this problem, for a given action we limit the total number of the samples to , where and are the size of buffer and the size of the action set, respectively. If the number of samples for a given action exceeds this limit, a new sample will replace a randomly selected old one.

We present the pseudo code of the whole training framework in Algorithm 2 and the sampling reset procedure in Algorithm 1. We keep resetting the environment until a state with high quality is acquired. In practice, the reset function can be achieved in a virtual manner, since the realization of a state in the real physical system is not required in the computation of its quality. A real reset happens after the good state is selected. As such, the proposed reset process will not cause additional cost in practice.

1:  if the current sampling environment is the real environment then
2:     Initialize data set , quality set .
3:     for  do
4:        Generate one initial state and compute its quality .
5:        Append to and append to .
6:        if  and  then
7:           Break.
8:        end if
9:     end for
10:     Return the last state of .
11:  else
12:     if  then
13:        Randomly select a state from the real data memory.
14:        Set the cyber environment to state .
15:        Return .
16:     else
17:        Randomly initialize the cyber environment.
18:        Return the current state of the cyber environment.
19:     end if
20:  end if
Algorithm 1 Sampling Reset Procedure
1:  Initialization: initialize the trainer agent (with a DQN network), the training process environment, and the target controller. Initialize real data memory and cyber data memory as an empty set. Sample a small data set of size to initialize the cyber emulator and initialize the real environment.
2:  Set number of total samples generated from real environment . Set the maximum number of samples allowed to use as .
3:  Training Process:
4:  while  do
5:     Generate action from the trainer agent.
6:     One step in TPE:
7:     Train the target controller if there is enough data in its memory buffer.
8:     Sample data points from real environment according to the sampling reset Algorithm 1, and append the data to the real data memory.
9:     Sample data points from the cyber environment, and append the data to the cyber data memory.
10:     Train the dynamic model.
11:     Update .
12:     Collect the state, action and reward data of TPE.
13:     Update the trainer agent.
14:  end while
Algorithm 2 Intelligent Trainer Enhanced Model-Based DRL Training Algorithm

Iii-C Ensemble Trainer

In this subsection we present a more robust trainer design that learns by comparison. The design rational is that the single-head trainer, described previously, for some cases cannot adequately assess the quality of the action. As such, one action could degrade the subsequent training process. In other words, the actions could be correlated and their quality could become indistinguishable. Also, for actions that generate non-negative reward but could lead to slow convergence or locally optimal policy, the reward function design (5) is unable to accurately assess their quality. To address these issues, we propose an ensemble trainer which uses a multi-head training process, similar to the boosted DQN [25]. The design rationale is to diversify actions on different trainers without posting additional sampling cost, then evaluate the actions by ranking their performance.

We design an ensemble trainer, which consists of three different trainers with different settings, as shown in Fig. 3. Trainer 0: its actions are provided by the intelligent trainer; trainer 1, its actions are provided by a random trainer; trainer 2, it uses only real data, which means setting the three actions to 1, 0, and 0. The settings in trainer 0 and 1 enable the exploitation and exploration of the action space. Trainer 2 is a normal DRL training process without using the cyber data generated by the dynamic model. The reason we choose to ensemble these three distinct trainers is because they can provide sufficient coverage of different trainer actions and each of them can work well in different cases. Note that it is not a trivial task to have an efficient ensemble trainer and at the same time not incurring additional real data cost, as we found that the samples from different trainers can have different quality which may degenerate the ensemble’s overall performance. In the following, we propose solutions to deal with this issue.

Iii-C1 Real-Data Memory Sharing

In the ensemble learning process, for each trainer the target controller within the corresponding TPE is trained independently. To ensure that no additional sampling cost is introduced, we evenly split the real data samples among the three trainers – one third for each. In such case, with fewer real data, the target controller may not be adequately trained. To address this issue, we devise a memory sharing process before the training of the target controller, as shown in Fig. 3. The memory sharing scheme is a pseudo sampling process. Each target controller collects the real data samples also from the other two trainers. As a result, at each step, each trainer receives new data samples – the same amount of data as in the single-head training. Note that with memory sharing, the real data from an underperformed target agent could degrade, even fail the ensemble performance. To solve this problem, we introduce next a reference sampling scheme.

Iii-C2 Reference Sampling

The idea behind the reference sampling is to select the best trainer, then to use its target controller for other trainers to sample real data samples with a probability . In our algorithm, at the first of every three steps, is forced to set to 0. As such this first step, without reference sampling taking place, serves as an evaluation step for the trainer. In next two steps, is determined by the function in the following equation.


where is the current step number of trainers, and

is the skewness ratio, which measures the degree of the outperformance of the best trainer;


are the estimated upper and lower bounds respectively. The details of

are shown in the weight transfer procedure below. With such design, the better the performance of the best trainer, the higher will be used.

Iii-C3 Rank-based Trainer Reward Calculation

After the training process of the target controllers of all trainers, for each trainer we calculate the average sampling reward of its corresponding target controller as the raw reward of this trainer. Note that is different from the sign reward used in (5). Next, we sort the tuple (, , ) in an ascending order. We then define the rank of -th trainer as the index of in the sorted tuple. The reward of trainer is then defined as its rank.

The rationale is that if the action of a trainer is good for training, it should help the trainer to achieve better performance (measured by the average sampling reward) thus lead to higher rank.

Note that with the above reward design, the trainers will generate three data samples at the trainer level in each step, and all these data will be used to update the intelligent trainer. Due to the reference sampling mechanism, the rank information may not correctly measure the performance of the trainers. To solve this issue, we will throw away these samples when is not zero.

Iii-C4 Weight Transfer

After collection of the trainer reward data, we add a particular weight transfer mechanism to solve the issue that some target agent may fail due to unfavorable trainer actions. The rationale is that after collecting the reward information for a certain large number of steps, we can judge which trainer is currently the best one with high confidence. In this case, we can transfer the best target agent to the other trainers, such that those trainers who fall behind can restart from a good position. In particular, after the trainer reward data are collected, we examine the number of steps that have been taken since the last weight transfer. If is larger than a threshold , we compute an accumulative reward for each trainer in the last steps as :


where is the index of current trainer step. The trainer with maximum will be set as the best trainer. We then examine if the DQN trainer is the best; if not, we will transfer the weight parameters of the target controller trained by the best trainer to the target controller trained by the DQN trainer.

We also utilize the accumulated rank reward to detect whether the best trainer is significantly better than other trainers. We calculate a performance skewness ratio to measure the degree of the outperformance of the best trainer:


where , and are the best, median and worst of the three trainers, respectively. The skewness ratio is used to determine the as shown above.

Algorithm 3 shows the operational flow of the ensemble trainer. In summary, the ensemble trainer evaluates the quality of the actions by ranking the rewards received by target controllers. It can maintain the training quality by memory sharing scheme, without incurring additional sampling cost. It can maintain the sample quality by reference sampling. It can recover an underperformed trainer from poor actions. Though saving on the sampling cost, the ensemble trainer requires three times the training time. The increased training time can be partially reduced by the early stop of some underperformed trainers when necessary.

Fig. 3: Flow chart of the ensemble trainer, which consists of three trainers and is incorporated with memory sharing, reference sampling, and weight transfer. The definitions of the parameters are given in the text.
1:  Initialization: initialize the three trainer agents and the corresponding training process environments, along with the target controllers. Run the initialization process for each trainer. Initialize the best player to be NoDyna trainer and the probability to use best player to sample is .
2:  Set number of total samples generated from real environment . Set maximum number of samples allowed to use as .
3:  Training Process:
4:  while  do
5:     for trainer 0, 1, 2 do
6:        Generate action from the trainer agent.
7:        One step in TPE:
8:        Execute memory sharing procedure.
9:        Train the target controller if there is enough data in its memory buffer.
10:        Sample data points from real environment with reference sampling probability , and append the data to the real data memory.
11:        Sample data from cyber environment according to the trainer action, and append the data to the cyber data memory.
12:        Share the real data memory across all trainers.
13:        Train the dynamic model of the current trainer.
14:        Update .
15:        Collect the state, action and raw reward data of TPE.
16:     end for
17:     Compute reward for each trainer from the raw reward data and calculate the accumulative reward for trainers .
18:     Store TPE data of all three trainers into the DQN memory to train the intelligent trainer.
19:     Update the trainer agents.
20:     Execute Algorithm 4 to do performance skewness analysis and weight transfer, update .
21:  end while
Algorithm 3 Ensemble Trainer Algorithm
1:  if  then
2:     Compute accumulative reward of trainer as for .
3:     Update best trainer index as .
4:     Compute the skewness ratio for the best player.
5:     Update best player reference probability according to (6).
6:     if DQN trainer is not the best trainer then
7:        Do weight transfer from the best trainer to DQN trainer.
8:     end if
9:     Reset .
10:  end if
Algorithm 4 Performance Skewness Analysis Procedure

Iv Experiments and Discussions

In this section, we evaluate the proposed intelligent trainer and ensemble trainer for five different tasks (or cases) of OpenAI gym: Pendulum (V0), Mountain Car (Continuous V0), Reacher (V1), Half Cheetah ([26]), and Swimmer (V1).

Iv-a Experiment Configuration

For the five test cases, different target controllers with promising published results are used: DDPG for Pendulum and Mountain Cars; TRPO for Reacher, Half Cheetah, and Swimmer. The well-tuned parameters used in open-sourced codes [27][28] are used for the hyper-parameters settings of the target controller (including and , as defined in Section III. Simple neural networks with guideline provided in [25] are used for the cyber models. As our experiments have shown, it is very useful to normalize both input and output for the dynamic model. In this paper, we use the normalization method provided by [27]

, in which the mean and standard deviation of the data is updated during the training process. For hyperparamters

and used in the reset procedure in Algorithm 1, we set and respectively, which indicates that we have maximum and minimum trial numbers 50 and 5 respectively.

Baseline algorithms Intelligent trainers
NoCyber Fixed Random DQN DQN-5 actions DQN-larger memory REINFORCE DQN-TPE V1 DQN-TPE V2
Trainer type None None None DQN DQN DQN REINFORCE DQN DQN
Action (1, 0, 0) (0.6, 0.6, 0.6)
Data source Real Real & Cyber Real & Cyber Real & Cyber Real & Cyber Real & Cyber Real & Cyber Real & Cyber Real & Cyber
Memory size - - - 32 32 2000 - 32 32
TPE state - - - Constant Constant Constant Constant Last sampling reward Real sample count
TABLE I: Configurations of different algorithms.
Pendulum Mountain Car Reacher Half Cheetah Swimmer
TPE Steps 1000 30000 1000 400 200
TABLE II: Number of total TPE steps for different tasks.

Iv-B Comparison of Single-Head Intelligent Trainer with Baseline Algorithms

Multiple variants of the single-head intelligent trainer are compared with baseline algorithms. There are three baseline algorithms and four intelligent trainers. Their designs are summarized in Table I. The three baseline algorithms are:

  • The NoCyber trainer is a standard DRL training process without using cyber data.

  • The Fixed trainer follows the standard model-based RL, with all actions set to 0.6 throughout the training process.

  • The Random trainer outputs action 0.2 or 1.0 with equal probability. The same action values will be used by the DQN trainer. These values are picked such that an extensive amount of cyber data can be used in the training, for example, when is set to 0.2, the amount of cyber data sampled is five-time the real data sampled. The value 0.2 is chosen without any tuning, i.e., it is not tuned to make DQN trainer work better. Our focus is not to find out the best settings of these parameters (as it will vary in practice), but to figure out if the proposed trainer can select the better action among the predefined action value set.

We notice that, for some tasks, the total number of steps of the TPE is only 200, as shown in Table. II. To simplify the learning process, we discretize each dimension of the trainer action.

The four intelligent trainers are:

  • DQN trainer. The trainer action chooses from two values of 0.2 and 1.0 like the Random trainer. That is, for . The DQN controller is trained with a memory buffer of size 32. At each time steps, four randomly selected batches of batch size eight are used to update the controller. For exploration purpose, the epsilon-greedy method is used, with the first 10% of the trainer steps for epsilon-greedy exploration by setting final epsilon to 0.1. Note that the setting 0.6 used in Fixed trainer is the expected mean of the actions from intelligent trainer if the trainer predicts uniformly random actions.

  • DQN-5 actions. To test the effect of more action values in the action discretization, we introduce a second trainer, by selecting five values from .

  • DQN-larger memory. To test the impact of larger trainer memory, we introduce a third intelligent trainer with memory size of 2000. In this case more trainer samples are stored and relatively older historical data are used in the training the DQN controller.

  • REINFORCE. The fourth intelligent trainer is the same to DQN trainer except the DQN controller is replaced by a REINFORCE controller. REINFORCE algorithm requires data of multiple episodes to train, we manually set five steps (manually tuned) of TPE as an episode.

The configurations for these algorithms are summarized in Table I.

(a) Pendulum V0
(b) Mountain Car Continuous V0
(c) Reacher V1
(d) Half Cheetah
(e) Swimmer V1
(f) Trainer action
Fig. 4: Accumulative rewards on (a) Pendulum, (b) Mountain Car, (c) Reacher, (d) Half Cheetah, and (e) Swimmer. The curves show the average accumulative reward while the shaded region shows the standard deviation of the reward in ten independent runs. (f) Mean action taken by DQN trainer on tasks of Mountain Car, Reacher, and Swimmer.

The test results of three baseline trainers and four intelligent trainers are shown in Fig. 4. We obtain the test results by periodically evaluating the target controller in an isolated test environment. This ensures that data collection from the test environment will not interfere with the training process. In other words, none of the data collected from the test environment is used in the training. We observe that:

  • The tasks of Pendulum, Mountain Car, and Reacher can benefit from cyber data used in training. For tasks of Half Cheetah and Swimmer, NoCyber trainer performs significantly better than trainers using cyber data. This indicates that using the cyber data may not be always beneficial. Thus, the use of cyber model should be considered carefully.

  • In most tasks, the intelligent trainer performs better than the Fixed trainer. For example, DQN-5 actions performs better than Fixed trainer for the tasks of Mountain Car, Reacher, and Half Cheetah, and performs similarly for the tasks of Pendulum and Swimmer. This indicates the viability of the intelligent trainer.

  • For the tasks of Pendulum and Mountain Car, the Random trainer performs the best. This can be attributed to the fact that adding more noises would encourage exploration. For example, to achieve better performance, the Mountain Car requires more exploration to avoid local optimum that could lead the target agent to unfavorable searching directions. We also observe that the performance of DQN-5 actions is more stable than that of DQN, due to the increased dimension of action space that improves the training diversity. We argue that even the DQN trainer is no better than the Random trainer in these tasks, the DQN trainer is still learning something. The reason is that we are trying to learn a fixed good action through DQN trainer, which means that the DQN trainer will not be able to provide the randomness which proves to be good in these tasks. Also we can observer that for the Half Cheetah task, the DQN trainer is much better than the Random trainer. This suggests that the DQN trainer can indeed learn in an online manner.

  • We further examine the effect of using cyber data when it seems not working. For the Half Cheetah, we examine the results of multiple independent runs and cyber data causes instability in performance, resulting in higher variance and low mean reward in ten independent tests. For Swimmer, the poor performance with cyber data is due to a special feature that the first two dimensions are linearly correlated in its state definition. The trained cyber model in this case is unable to correctly identify this feature and predict the state transition. Our results show that even incorporating 10% cyber data in training, severe performance degradation can occur. When cyber data are used, the target controller can be trapped by a local optimum that is difficult to recover from. We resolve this issue by using ensemble trainer.

Variants Pendulum Mountain Car Reacher Half Cheetah Swimmer
DQN -43323 1434.59 -7846 696492 4918
DQN-5 actions -43204 1493.88 -7724 985847 2473
DQN-larger memory -41329 1615.98 -7831 1354488 2142
DQN-TPE V1 -41869 1849.96 -7456 868597 1522
DQN-TPE V2 -46533 1826.19 -7478 1172288 2233
TABLE III: Accumulative rewards of different trainer variants when using different trainer and TPE designs.

To analyze the behavior of the trainer, we show in Fig. 4(f) the actions taken by the DQN trainer for the tasks of Mountain Car, Reacher, and Swimmer during the training process. We observe that for Mountain Car, the mean value of fluctuates around 0.5. This agrees with our observation that for the Mountain Car, random baseline algorithm performs the best. For Reacher and Swimmer, the trainer quickly learns to use more of the real data, with the mean value of action eventually reaching to larger than 0.6. This again indicates the viability of the trainer. Note that for Swimmer, even the mean value of action is larger than 0.6, the performance of the target controller is still very poor (Fig. 4) due to training process’ sensitivity to cyber data. This again verifies the necessity of an ensemble trainer that can quickly recover from degraded performance during training.

(a) Pendulum V0
(b) Mountain Car Continuous V0
(c) Reacher V1
(d) Half Cheetah
(e) Swimmer V1
(f) Trainer action
Fig. 5: Accumulative rewards of ensemble trainer with the NoCyber, Random and DQN trainer on (a) Pendulum, (b) Mountain Car, (c) Reacher, (d) Half Cheetah, and (e) Swimmer. (f) shows the mean action taken by the DQN trainer in the ensemble trainer for Mountain Car, Reacher, and Swimmer.
(a) Mountain Car Continuous V0
(b) Reacher V1
(c) Swimmer V1
Fig. 6: Accumulative reward of different individual trainers of the ensemble trainer: on (a) Mountain Car, (b) Reacher, and (c) Swimmer.

Iv-C Sensitivity Analysis on Various Trainer and TPE Designs

We have evaluated the performances of trainer variants of DQN trainer and different TPE designs. In addition to previously mentioned DQN, DQN-5 actions, and DQN-large memory, we have tested DQN trainers with two different TPE state designs, as also listed in Table I. DQN-TPE V1 adopts the last average sampling reward of the target controller as the state of TPE; DQN-TPE V2 adopts the ratio (a value in the range of [0,1]) of the real samples used to the predefined maximum number of real samples as the state of TPE. Table III presents the accumulative rewards for five test cases: Pendulum, Mountain Car, Reacher, Half Cheetah, and Swimmer.

  • For Mountain Car, Reacher and Half Cheetah, DQN-5 actions, DQN-larger memory, DQN-TPE V1 and DQN-TPE V2 consistently outperform DQN. This indicates that for some applications, the intelligent trainer that uses more action selections, larger memory, or more informative state representation can achieve better performance. The results hint that a smart design of trainer or TPE can compensate the situation of lack of training data.

  • For Swimmer, we observe that none of the tested variants of DQN or TPE can achieve satisfying performance. This is due to the fact that even a very small amount of cyber data can cause the target controller to be trapped in a local minimum that cannot be recovered.

Fig. 7: Accumulative rewards of ensemble trainer with two variants: without memory sharing and without reference sampling, for Swimmer.

Iv-D Solving Action Correlation with Multi-head Ensemble Trainer

As discussed in Section III-C, the purpose of constructing an ensemble trainer is to overcome the action correlation problem in single-head trainer. In this subsection, we provide evidence of the virtue of the ensemble design by comparing its performance with single-head trainers. The ensemble trainer comprises a DQN trainer (with TPE state design V2), a Random Trainer, and a NoCyber trainer. Following the design in Section III-C, these three trainers jointly sample and train three independent target controllers. The target controller of the best trainer will be used in the test. For the step threshold in weight transfer, it should be set to a TPE step count that a just sufficient number of trajectories (at least one episode) has been sampled. For such reason we set for all tasks except Mountain Car. For Mountain Car task, as in each TPE step, only one real sample is generated which is far from enough to evaluate the performance. We set to for this task. The upper and lower bounds and are estimated in the experiments, we found that and work well for all cases.

The results, as presented in Fig. 5, show that the ensemble trainer achieves overall good performance even in the cases the single-head trainer fails. For the tasks of Pendulum, Mountain Car and Reacher, the ensemble trainer performs almost as well as the DQN or Random trainer. For the task of Swimmer and Half Cheetah, the ensemble trainer performs as well as the NoCyber trainer, even though the learning process makes it learn slower in the Half Cheetah task. With the proposed ensemble trainer, we are more likely to achieve sampling cost saving in practice as we it is hard to predict which kind of algorithm variant will deliver the best performance in advance. We compute the expected saving in Table IV with the ensemble trainer when assuming the baseline sampling cost is the average cost of the three single-head trainers NoCyber, DQN trainer and Random trainer. Note that for tasks Mountain Car, Half Cheetah and Swimmer, the single-head trainer may fail to achieve the predefined performance target, in this case we set the cost as the maximum number of samples we tried in the experiment. That means the expected saving is actually larger than the number shown in Table IV.

Pendulum Mountain Car Reacher Half Cheetah Swimmer
Target reward -500 75 -10 2500 100
Samples saving 26% 36% 2% 38% 56%
TABLE IV: Sampling saving to achieve certain predefined performance of the ensemble trainer. The baseline cost is the expected cost of the three algorithms NoCyber, Random trainer and DQN trainer.

In Fig. 5 (f), we observe that the action taken by the DQN varies significantly from the single-head case. For Swimmer case, the action

gradually converges to one which allows better performance. For Reacher case, we observe a phase transition in the middle, during which it changes from preferring fewer cyber data to more cyber data. This proves that when and how many cyber data should be utilized may be related to the training progress. For the Mountain Car task, we observe that it quickly converges to favor more cyber data which is helpful in this task. This proves that the proposed ensemble trainer can assess the control actions better than the single-head trainer.

In Fig. 6, we show the interactions of trainers in the ensemble by presenting individual results of the constituent trainers: DQN in ensemble, RANDOM in ensemble, and NoCyber in ensemble, for the tasks of Mountain Car, Reacher, and Swimmer (In the following of this paragraph, we omit the term of “in ensemble” for the sake of brevity). In all three cases, we can observe that within the ensemble, the original good trainer (single-head) still performs very good. For example, for the Mountain Car task, the Random trainer performs almost as good as the single-head Random trainer. For task Swimmer, the DQN trainer can now perform as good as the NoCyber trainer, which proves that the weight transfer process is working as expected.

To further examine the effect of memory sharing and reference sampling, in Fig. 7 we compare the performance of three different ensemble designs, for the task of Swimmer. All of them comprise the same three trainers: DQN, Random, and NoCyber, but differ in the incorporated schemes: ensemble trainer (with memory sharing and reference sampling); ensemble trainer without memory sharing (with reference sampling); ensemble trainer without reference sampling (with memory sharing). All these variants are with weight transfer. The results show that, without memory sharing, the ensemble performance degrades. This is because each of the three intelligent trainers uses only one-third of the original data samples (which is why the curve stops at 1/3 of the others in the x-axis). Without reference sampling, the ensemble performs very similar to the DQN trainer (Fig. 4). This is because without reference sampling, most of the real data samples are from underperformed target controllers of DQN and Random trainers. The data from underperformed target controllers deteriorates the learning process of the NoCyber trainer. The results indicate that memory sharing and reference sampling are essential for ensemble trainer.

V Conclusion

In this paper we propose an intelligent trainer for general model-based reinforcement learning algorithm. The proposed approach treats the training process of model-based RL as the target system to optimize, and use a trainer that monitors the sampling and training process. Furthermore, an ensemble trainer that can enhance the performance of the trainer without incurring additional sampling cost is used to solve the problem of limited and correlated training data for the trainer. With the proposed trainer framework, the model-based RL can be used for practical applications to reduce the sampling cost while achieve close-to-optimal performance.

For the future work, the proposed trainer framework will be further improved by adding more control actions to ease algorithm adjustment cost. A more advanced design is to use one trainer to train different DRL controllers for multiple tasks, which can learn the common knowledge shared by different DRL algorithms for these tasks.