Asynchronous Episodic Deep Deterministic Policy Gradient: Towards Continuous Control in Computationally Complex Environments

03/03/2019 ∙ by Zhizheng Zhang, et al. ∙ USTC 0

Deep Deterministic Policy Gradient (DDPG) has been proved to be a successful reinforcement learning (RL) algorithm for continuous control tasks. However, DDPG still suffers from data insufficiency and training inefficiency, especially in computationally complex environments. In this paper, we propose Asynchronous Episodic DDPG (AE-DDPG), as an expansion of DDPG, which can achieve more effective learning with less training time required. First, we design a modified scheme for data collection in an asynchronous fashion. Generally, for asynchronous RL algorithms, sample efficiency or/and training stability diminish as the degree of parallelism increases. We consider this problem from the perspectives of both data generation and data utilization. In detail, we re-design experience replay by introducing the idea of episodic control so that the agent can latch on good trajectories rapidly. In addition, we also inject a new type of noise in action space to enrich the exploration behaviors. Experiments demonstrate that our AE-DDPG achieves higher rewards and requires less time consuming than most popular RL algorithms in Learning to Run task which has a computationally complex environment. Not limited to the control tasks in computationally complex environments, AE-DDPG also achieves higher rewards and 2- to 4-fold improvement in sample efficiency on average compared to other variants of DDPG in MuJoCo environments. Furthermore, we verify the effectiveness of each proposed technique component through abundant ablation study.



There are no comments yet.


page 1

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep neural networks have pushed further the envelope of reinforcement learning in a wide variety of domains, such as Atari games

[1], continuous systems control [2], musculoskeletal models control for medical applications [3], etc. Deep reinforcement learning (Deep-RL) methods perform trail-and-error training through frequent interactions with the environments. Despite the impressive results, the problem of data insufficiency is still exposed seriously for Deep-RL in computationally complex environments, which leads to huge even intolerable time cost for training.

Data throughput and efficiency grossly dominate the performances of Deep-RL algorithms. Numerous distributed methods [4, 5, 6, 7, 8, 9] are proposed to solve this problem, the distributed implementations of which can be summarized into two categories: communicating gradients and communicating experiences. The popular distributed algorithm A3C [5] was proposed to improve data throughput by executing multiple agents in parallel and communicating gradients with respect to the parameters of the policy to a central parameter server. However, the distributed gradients calculation sacrifices the training stability, especially when the degree of parallelism increases or when the interaction become high-delayed in computationally complex environments. A feasible way capable of avoiding the training instability while increasing data throughput is to parallelize the processes of experience collection. The new scalable distributed Deep-RL agent IMPALA [6] adopts asynchronous experience collection for training a single agent on many tasks simultaneously. In IMPALA, multiple actors are used to interact with environments and communicate their trajectories with the learner responsible for policy updating. Although IMPALA has made great progress in solving multi-task problems, there are still some problems when performing parallel data collection on a single task in computationally complex environments. Similar with IMPALA, the Ape-X architecture in [7] and D4PG in [8]

extend the vanilla deep-learning-based frameworks to the distributed setting by involving a leaner network and multiple actor networks.

We argue that the data collection and the policy learning in an asynchronous framework are mutually influential with each other. Asynchronous data collection in off-policy Deep-RL methods facilitates exploring more potential strategies but increases the difficulty of distilling knowledge from the generated trajectories, which is also discussed in the early RL work [10]. An intuitive reason for this issue is that it is more likely to meet the mismatching problem between the speed of data collection and the speed of policy learning in asynchronous frameworks, which leads to a decline in the proportion of the the valuable samples for training and poor sample diversity. In this work, in addition to the asynchronous system, we focus on addressing two major challenges:

  • Sample imbalance. Data throughput is significantly increased due to asynchronous frameworks while small learning is still maintained to ensure training stability and avoid convergence to local optimal solutions. In this situation, parallelism of experience collection will aggravate sample imbalance where low-reward samples outnumber high-reward samples.

  • Sample diversity. When asynchronously collecting experiences and training agents using off-policy methods on a single task, a lot of similar trajectories will be put into the same memory buffer for experience replay. Crucially, poor sample diversity will bring harmful affects to training efficiency.

In this paper, with respect to continuous control in computationally complex environments, we propose Asynchronous Episodic DDPG (AE-DDPG) to address the aforementioned challenges. Unlike communicating gradients in A3C, the agent in AE-DDPG interacts with multiple stochastic environments simultaneously, which can achieve very high data throughput. To tackle the problem of sample imbalance, we employ the episodic control (EM) thinking [11][12] in re-designing the experience replay of DDPG, which enables the agent to latch on high-reward policies rapidly. To the best of our knowledge, AE-DDPG is the first one that introduces episodic memory into Deep-RL methods for continuous problems. For the sake of improving sample diversity, we consider taking the power law signal with spectrum as noise injected in action space to enrich the agents’ exploration behaviors.

We evaluate our proposed method on a realistic physiologically-based model control task, namely Learning to Run [3]. Experimental results show that AE-DDPG outperforms not only the vanilla DDPG but also other popular RL methods in training efficiency and the resulting final policies. We won the 1st in the first round of NIPS 2017 Learning to Run Challenge by using this model. We also conduct experiments on other continuous tasks in MuJoCo environments to evaluate its generalization to other domains. Besides, we also verify the effectiveness of the technique components applied in AE-DDPG in our ablation study.

Ii Related Works

Ii-a Asynchronous Methods for Deep-RL

The early work [13] had studied the convergence properties of Q-learning in the asynchronous settings. With the development of Deep-RL, the popular asynchronous algorithm A3C [5] was proposed to reduce the training time. As described before, the key idea of A3C is to execute multiple workers and communicate gradients. However, training stability and sample efficiency are negatively affected as the number of workers increases. The newly proposed asynchronous architecture IMPALA [6], Ape-X [7] and D4PG [8] all decouple experience collection and policy updating by involving a learner and multiple actors, in which actors need to copy the parameters of the leaner for n steps interactions. Although policy-lag caused by copying parameters from the learner to workers is mitigated by V-trace algorithm in [6], the continuity of temporally correlated exploration in action space [2] will be affected harmfully, especially when training agents on a continuous control task with high-delayed interactions. Unlike communicating gradients in A3C and communicating experiences in IMPALA, Ape-X, D4PG, etc., we support for the setting of a single actor-critic pair and develop an asynchronous interactive mechanism to improve data throughput for the asynchronous implementation in continuous and computationally complex environments.

Ii-B Experience Replay

Experience replay [14] is a kind of technology that allows agents to reuse experience from the past. Prioritized experience replay [15]

weights the replay probabilities of experiences according to their measured temporal difference errors. But its additional run-time leads to diminished training efficiency as the number of trajectories increases. Hindsight experience replay

[16] allows sample-efficient learning from the sparse and binary reward signals. In this paper, we aim to introduce the idea of episodic control to rapidly assimilate advanced knowledge from high-reward experiences and improve the diversity of actually sampled trajectories for experience replay.

Ii-C Episodic Control

Episodic control is inspired by the functionality of hippocampus in the brain [17]. The key idea of previous works on episodic control [11, 12] is to utilize highly rewarded experiences to help to recreate past successes in near-deterministic environments. Besides, episodic memory deep Q-networks [18] leverages episodic memory to regularize the learning target of deep Q-Networks rather than direct control. Note that episodic control in previous works requires table-based look-up in general. Therefore, it is mostly used to solve discrete problems in near-deterministic environments. Differently, we attempt to utilize episodic memory to encourage more effective experience replay in allusion to continuous control in complex stochastic environments.

Ii-D Noise for Exploration

Noise for exploration in deep reinforcement learning mainly includes two categories: action space noise [2] and parameter space noise [19]. In terms of action space noise, uncorrelated Gaussian noise and noise based on the Ornstein-Uhlenbeck (OU) process are used mostly to teat the problem of exploration. In addition, parameter space noise [19] is also proposed to be an alternative acted on agents’ parameters directly. In this paper, we introduce a new action space noise to alleviate the harmful affects of asynchronous experience collection on sample diversity.

Iii Background

RL commonly models the trail-and-error learning procedures as the Markovian Decision Processes (MDP). At time t, the agent observes the current state of its interactive environment and chooses an action according to its policy . Then the environment returns the agent a scalar feedback signal and translates to the next state according to the transition probability . The goal to find an optimal policy can be formulated as the mathematical problem of maximizing the expectation of cumulative discounted return , where is the discount factor.

DDPG [2] is an off-policy actor-critic algorithm [20] proposed for continuous control with Deep-RL, which can be viewed as a successful modification to DPG algorithm [21]. DDPG consists of a neural network based policy function and a neural network based value function, which corresponds to the actor and the critic. We parameterize the actor and the critic by and respectively. Similarly, the target networks for actor and critic, parameterized by and respectively, are introduced to alleviate the training instability in DDPG. We update the action distribution of the actor by applying policy gradient:


We update critic by minimizing the loss:




The target networks are updated by enabling them track the learned networks with :


Previous works have modified the vanilla DDPG from different aspects. For example, MA-BDDPG with multi-actor [22] and Multi-DDPG with multi-critic [23] are representative variants that make use of bootstrapped models to improve the sample efficiency and training stability. Another notable modified version is the expansion introduced in robotics to solve mapless navigation problems [24]. This variant separates the sample collecting process to another thread from the training thread in a direct way. However, it hasn’t addressed the crucial issues we describe in the introduction.

Iv Methodology

In this section, we first introduce the overall architecture of AE-DDPG, which is illustrated in Figure 1. We then elaborate the algorithm we design for experience replay in this asynchronous architecture. A new action space noise is introduced for exploration in the final.

Iv-a AE-DDPG Architecture

The agent in AE-DDPG performs asynchronous experience collection and synchronous policy learning in training. We first introduce the asynchronous interaction to improve the data throughout in AE-DDPG. Different with the bootstrapped models, there is a single actor-critic pair in our proposed AE-DDPG. Asynchronous sample collection helps to collect more data for policy learning especially in computationally complex environments, wherein the interaction is commonly time consuming. We therefore enable the actor in AE-DDPG to interact with multiple stochastic environments simultaneously. To achieve it, we run multiple environment simulators in parallel threads. These environment simulators are initialized randomly on the same task. In this way, the actor in AE-DDPG interacts with these environments simulators in parallel for asynchronous sample collection. A notable difference with the parallel, accelerated RL framework in [9] is that we needn’t gather all individual observations into a batch for inference at each step. Hence, the random fluctuations and straggler effect described in [9] can be alleviated effectively in our framework. However, we haven’t taken account of better utilization of multiple CPU and GPU like [9]. Instead, we aim to improve the sample efficiency and policy exploration for the distributed RL framework in this paper.

We then introduce the memory buffers for experience replay in AE-DDPG. There are multiple experience cache buffers and two experience memory buffers in our proposed framework. As depicted in Figure 1, the trajectories generated by each interaction thread are cached into its individual cache buffer firstly. These buffers are thread-independent and they are not used for experience replay directly. Specially, we design two different memory buffers for the actual experience replay, they are “Memory” and “HMemory”. Trajectories cached in the cache buffers will be put into the two memory buffers, namely “Memory” or/and “HMemory”, according to the storing rule. Correspondingly, the trajectories stored in “Memory” or/and “HMemory” are sampled to be used for policy updating according to the sampling rule. Both the storing rule and the sampling rule are described in detail in our following section.

Fig. 1: Architecture of asynchronous episodic deep deterministic policy gradient (AE-DDPG).

Iv-B Bio-inspired Episodic Experience Replay

Asynchronous sample collection helps to improve data throughout especially when adopting RL algorithms in computationally complex environments, but this also leads to diminishing returns of sample efficiency as the degree of parallelism increases. It’s easy to see that numerous similar interaction transitions are pushed into the memory when sharing experiences with a distributed RL framework. Therefore, we should balance the speed of data generation and the speed of data utilization to avoid worsening sample imbalance and improve the sample diversity. We aim to achieve this by proposing a novel experience replay.

Our design is inspired by the biological study on reward-motivated learning [25], in which the researchers use even-related FMRI to examine anticipatory mechanisms of reward-motivated memory formation. The result of 24-hr postscan suggests that subjects are significantly more likely to remember scenes that follows cues for high-value rather than low-value rewards. Additionally, a famous psychological research result, namely the so-called Peak-End Rule [26], indicates that people are always sensitive to the peak returns or/and the end returns.

Attempting to design a problem-solving RL algorithm with human-like efficiency and adaptability, we propose to decide which interaction trajectories should be attended more according to their cumulative reward at the end of one episode, briefly called “episodic reward” in our following description. To achieve this, we employ the idea of episodic control to improve experience replay. In detail, as described in the last section, we store the interaction trajectories into the thread-independent cache buffers and use two memory buffers for the actual experience replay. The two memory buffers have their respective functions. The module named “Memory” is similar with the memory buffer in vanilla DDPG, while the other “HMemory” is used for memorizing the highly rewarded trajectories. We use multiple threads to generate trajectories asynchronously and use a single thread to perform back-propagation on mini-batches of trajectories. The experience storing rule in our proposed bio-inspired episodic experience replay method is described in Algorithm 1, and the rule of sampling mini-batches of trajectories for parameter updating is introduced in Algorithm 2.

: a set of transitions in one episode.
: one-step transition in trajectories.
: action space noise in time .
: cache buffer for No.i interaction thread.
: “Memory” depicted in Figure 1.
: “HMemory” depicted in Figure 1.
: episodic reward (cumulative reward for one episode).
: highest episodic reward in history.

1:for each episode do
2:     for  do
3:         Receive observation from environments.

         Get the state vector by

5:         Execute action .
6:         Observe reward and next state .
7:         Store transition in and copy it to .
8:     end for
9:     Pack all transitions into .
10:     Calculate the episodic reward .
11:     if  then
12:         Copy to .
13:     end if
14:     Update .
15:end for
Algorithm 1 Storing Rule in Episodic Experience Replay

: number of transitions in a mini-batch.
: hyper-parameter, probability of sampling from “HMemory”.
(Definitions of other symbols are the same as Algorithm 1.)

1:for each sampling do
2:     for  do
3:         Generate a random number .
4:         if  then
5:              Sample transition from .
6:         else
7:              Sample transition from .
8:         end if
9:         Group all sampled transitions into a mini-batch.
10:     end for
11:end for
Algorithm 2 Sampling Rule in Episodic Experience Replay

Note that both of the two memory buffers are FIFO (First In First Out) buffers with limited memory space. The size of “HMemory” should be set smaller than the size of “Memory”. We find two issues when training with the proposed experience replay: (1) highly rewarded trajectories are sampled more frequently; (2) low-reward trajectories in “HMemory” are easy to be dequeued relatively. These are consistent with our intuition that people are sensible to their best experiences, and they always tend to memorize the best experiences and learn from them.

Iv-C Random Walk Noise for Exploration

Commonly, we explore the potential policy strategies in RL by adding a perturbation for the model parameters or the output actions of the RL agent. The latter one is called “action space noise”. Here we denote the action space noise by and formulate its usage as below:


Where denotes the current state of the environment at time , and denotes the policy function. We obtain the practical action , namely the control signal, by adding a random signal to the output of actor network . The noise only used for the exploration in the training stage, while are takes as the control signal directly in the testing stage.

Intuitively, the noise for exploration in continuous control problems should be not only temporally correlated but also instance uncorrelated. We need temporally correlated noise signals for exploration with respect to this type of problems because the executed actions are continuous in time. Thus, the temporally correlated signals benefit exploring more potential actions corresponding to better continuous control policy. The instance uncorrelated signals refer to that one of sampled sequence is uncorrelated with the sequence generated by another sampling process. Thus, the instance uncorrelated noise signals contribute to avoiding repeated and redundant exploration behaviors.

Here we propose to inject one of power law noise into the action space for policy exploration. Power law noise [27] refers to a set of signals that exhibit a spectrum. Theoretically, the process of power low noise with is substantially equal to a rand-walk, which meets the two requirements we described above. We therefore adopt power law noise with spectrum to address the problem of exploration in continuous control tasks. In the following description, we will give more detailed explanations and introduce how we can generate this type of signals.

We can obtain power law noise with

spectrum simply by a first order filtering of white noise. Mathematically, we consider power law noise with

spectrum as the realization of a random process and take white noise as the realization of another random process . Because the spectrum power law signal can be obtained by filtering white noise, the relation of their spectral density can be formulated as:


where and represent the power spectral density of and respectively. The corresponds to the filter we need. Since the power spectral density of white noise , we need to design the filter in Equation (6) as:


An equivalent discrete z transform for Equation (

7) is:


Therefore, we can represent the filtered signal as:


According to the result of inverse z transform, we can get the signal in time domain as the following equation:


The formulas from Equation (6) to Equation (10) show that the signals with spectrum can be obtained by filtering the shots that start with a standard Gaussian generator. The Equation (10) represents a one-state auto regressive (AR) filter which produces the current value of the noise on the basis of the previous value. This is why we call it random walk noise and why this type of noise is temporally correlated and instance uncorrelated.

In terms of exploration, random walk noise is able to improve the efficiency of exploration by capturing the temporal correlation of actions. In addition, it also helps improving sample diversity by enriching exploration behaviors, because different instances generated by this process is uncorrelated.

V Experiments

V-a Environments

We evaluate our proposed method, AE-DDPG, on a highly-simulated computationally complex environment and six continuous tasks in a standard benchmark platform. Furthermore, we conduct a series of experiments to verify the effectiveness of each proposed technique component within AE-DDPG. We first evaluate our proposed method AE-DDPG in a musculoskeletal environment with the task of controlling a highly-simulated human model to run like a human, namely Learning to Run [3]. We then compare AE-DDPG with other improved variants of DDPG on six continuous control tasks from OpenAI Gym [28] simulated in MuJoCo [29] to evaluate the generlization of AE-DDPG to other domains. Finally, we run ablation experiments on Learning to Run environment. All of the simulation environments are illustrated in Figure 2.

Fig. 2: Illustration of evaluation environments and tasks: (a) Musculoskeletal environment: Learning to Run; (b) MuJoCo: Ant-v2; (c) MuJoCo: Halfcheetah-v2. (d) MuJoCo: Hopper-v2; (e) MuJoCo: Humanoid-v2; (f) MuJoCo: Swimmer-v2; (g) MuJoCo: Walker2d-v2.
Fig. 3: Performance of our proposed AE-DDPG and other state-of-the-art RL algorithms on L2R task. To keep the same degree of parallelism for convincing comparison, we extent the original version of DA-BDDPG in [22] by involving 16 actors to collect samples. And we denote the original version and expanded version by DA-BDDPG and DA-BDDPG respectively.

V-A1 Learning to Run (L2R) Environment

The simulated environment of L2R task is implemented in OpenSim [30] which is developed based on Simbody physics and biomechanics engine [31]. As shown in Figure 2, a realistic physiologically-based human model is provided in this environment, which can achieve physically and physiologically accurate motion. Potential obstacles include external obstacles like stumbling blocks and the slippery floor, along with internal obstacles like materials weakness and motor noise. Besides, we can set different difficulty levels in L2R environment, which is corresponding to different number of randomly occurring stumbling blocks.

Given a 18-dimensional action vector corresponding to the excitations of simulated muscles, the environment engine will compute the physical force functions and return the status of the musculoskeletal model in the form of a 41-dimensional observation vector. The task of L2R is to control the provided human model to navigate a complex obstacle course as quickly as possible with the penalty of overusing ligaments taken into account.

V-A2 MuJoCo Environments

We use six continuous robotic control tasks from MuJoCo [29] environments, running in a fast physical simulator, they are shown in Fig.2. The tasks of Ant-v2, HalfCheetah-v2, Hopper-v2, Humanoid-v2 and Walker2d-v2 are to control a four-legged creature model, a cheetah-like robot model, a two-dimensional one-legged robot, a humanoid robot and a two-dimensional bipedal robot respectively move forward as fast as possible. The task of Swimmer-v2 involves a 3-link swimming robot in a viscous fluid. In this task, we need to make it swim as fast as possible by actuating the two joints of the robotic model.

Fig. 4: The network architecture for L2R task. Each convolution layer is represented by its kernel size (for convolution), layer type and number of channels. LeakyReLU is applied in all layers except for the last layers of actor and critic. We use tanh activation for actor’s last layer and linear activation for critic’s last layer.

V-B Training Settings

The difficulty level of L2R environment is set to be 2 for all of our experiments in this paper, which means that there are three stumbling blocks with random sizes and positions in each episode. To handle the high-dimensional observation vector and action vector, we specially design the network architecture depicted in Figure 4. Adam [13] is adopted to train the agent networks with a learning rate of . We use mini-batch size , discount factor , soft update rate , and size of replay buffer . Specially, we tune the probability of sampling from “HMemory” in according to the number of interaction threads.

In MuJoCo environments, we adopt fully connected networks with hidden sizes of (256, 256, 128) and (256, 128) to build the actor and critic respectively. And we use a learning rate of and a mini-batch size of 128. Other hyper-parameters keep the same settings of agents on L2R task.

V-C Results

For convincing comparisons, the comparative models follow the settings of agents in AE-DDPG as possible. In this experimental setting, we try our best to reduce the variability of deep reinforcement learning caused by the potential factors discussed in [32]. Therefore, when running the comparative models. we tune their own hyper-parameters, such as noise for exploration and experience replay, to enable them better performance. For each experimental case, we run 5 independent and repetitive experiments with different random seeds and report the best performance of them.

Fig. 5: The visualization results when training 10k episodes. Upper row: running postures of the agent trained by our AE-DDPG. Lower row: running postures of the agent trained by vanilla DDPG.
Fig. 6:

Performance comparisons on six MuJoCo environments trained for 12 million timesteps, wherein one timestep equals one frame. The shaded region denotes the standard deviation over 5 random seeds.

V-C1 Evaluation in a Computationally Complex Environment

We compare our proposed method with both the state-of-the-art RL algorithms including three distributed variants of DDPG on L2R task. The MA-BDDPG [22] and Multi-DDPG [23]

used for comparison can be regarded as the distributed expansions of DDPG with multi-critic and multi-actor respectively. They tend to encourage data generation and estimate Q-values more accurately through introducing bootstrapped models. Different from both MA-BDDPG and Multi-DDPG, AE-DDPG is in defense of a paired actor-critic setting but has multiple environmental threads interacting with the actor asynchronously. For more convincing comparison, we further expand the vanilla version of MA-BDDPG by involving multiple actors to collect samples asynchronously, and we denote this expanded MA-BDDPG which has both multi-actor and multi-critic within one agent by “MA-BDDPG

” in Fig. 3. Note that we keep a single actor-critic pair but enable the actor to interact with multiple environment threads in our proposed AE-DDPG.

To evaluate our propose method in a fair setting, we have the actor in AE-DDPG interact with 16 stochastic L2R environments simultaneously and keep the same degree of parallelism in other algorithms (except vanilla DDPG and MA-BDDPG). Their best training performance across five repetitive experiments with different random seeds are reported in Figure 3. The mean returns are represented by lines and std returns are represented by shaded areas.

In the left sub-figure of the Fig.3, given the same number of samples, AE-DDPG can achieve higher mean reward score. It indicates that AE-DDPG is more sample efficient than other algorithms for learning continuous control strategy in such a computationally complex environment. In the middle sub-figure, AE-DDPG can achieve higher mean reward score with requiring less time consuming than other algorithms when the training tends to be stable. We insist on that asynchronous sample collection benefits reducing time consuming especially in computationally complex environments, but it also leads to rapidly diminishing returns of sample efficiency due to the increasing sample imbalance and the decreasing of sample diversity as I mentioned in previous section. AE-DDPG shows strong ability in solving this problem by introducing bio-inspired episodic experience replay and random walk noise to encourage exploration and latch on the interaction trajectories rapidly. Implicitly, the right sub-figure shows better exploration ability for the potential strategies.

A particularly notable issue is the comparison with A3C, where A3C is a little more effective in the beginning of the training but it fails in keeping this advantage in the follow-up learning. This might be caused by the mismatching between the speed of policy updating with gradient communication and the speed of data collection. We alleviate this problem by communicating experiences instead of communicating gradients. Here, IMPALA [6] is not taken into comparison since it is a set of scalable architectures designed for multi-tasks, and it doesn’t address the issue of sample efficiency from the aspects of experience replay and noise.

We further visualize the running postures learned by different agents. The human model trained by AE-DDPG is the closest one to a real adult runner. The model trained by the vanilla DDPG can move forward a few steps but it falls down soon (See Figure 5). When training with Proximal Policy Optimization (PPO) [33], the simulated human always keeps its two legs together and performs jump-like behaviors when runs forward. Trust Region Policy Optimization (TRPO) [34] has weak effect on this task so that the human model trained by this algorithm has difficulty in keeping balance.

Fig. 7: Comparative experiments with different experience replay methods. Blue curve: bio-inspired experience replay (ours). Green curve: prioritized experience replay. Orange curve: original experience replay in vanilla DDPG.

V-C2 Evaluation on a Standard Benchmark Platform

To verify the generalization of our approach in other simple stochastic environments using a fast simulator, we choose six continuous MuJoCo environments from a standard benchmark platform OpenAI Gym as the evaluation tasks for this comparison experiment. Here, we compare a 16-thread AE-DDPG to the distributed DDPG-based variants, the modified MA-BDDPG and Multi-DDPG, in the introduced standard benchmark platform.

In Fig. 6, we are glad to find that although the technologies in AE-DDPG are designed for continuous control in computationally complex environments, they are still effective and robust for the standard benchmark environments with fast simulators. Despite a single actor-critic pair within the agent, AE-DDPG has a high efficiency for data collection due to the asynchronous interactive mechanism. By enabling an actor interacting with multiple environments threads simultaneously, AE-DDPG can effectively avoid/alleviate the delay and the unstable interference caused by policy updating among the different actors or between the actors and the learners in the training. According to the performance comparisons show in Fig. 6, we insist on that our proposed experience relay and action space noise help the agent to explore the potential actions and distill useful information from them, which leads to high sample efficiency. The comparison results across six different tasks show that our proposed AE-DDPG with a single actor-critic pair has high sample efficiency exceeding the bootstrapped models.

Fig. 8:

Comparative experiments with different types of noise injected in action space. Blue curve: Random walk noise (ours). Green curve: noise sampled from Gaussian distribution (

). Orange curve: noise generated by an Ornstein–Uhlenbeck process ().

V-C3 Ablation Study for Episodic Experience Replay

To make clear the individual effects of our proposed experience replay, we conduct a series of experiments to compare it with prioritized experience replay [15] and the original experience replay, that is vanilla monte-carlo sampling. With only different experience replay methods applied, other modules and settings in this set of comparative experiments remain the same with our above description.

The result depicted in Figure 7 shows that episodic memory makes sense in alleviating the affects of sample imbalance by distilling important information from huge experiences collected from asynchronous interactions. The bio-inspired experience replay takes advantage of the insight from episodic control, which encourages the agent to pay more attention to highly rewarded trajectories. An interesting case here is that the prioritized experience replay shows the highest sample efficiency in the early stage of training, but the gain it brings declines gradually as experiences increase. This is because the significance of experience transitions is measured by TD errors in prioritized experience replay. However, the values of some actions might be overestimated and newly high-reward experiences are easy to be ignored, especially when using this method together with asynchronous experience collection. In general, the speed of experience generation mismatches the policy updating frequency more seriously in asynchronous frameworks. Because we generate more transitions in an asynchronous manner to find more potential high-reward actions, while we use small learning rate to keep the stability of gradient-based optimization.

We also have an in-depth analysis for the mediocre performance of the episodic experience replay in the beginning stage. Storing rule of our proposed experience replay leads to short delay for the so-called “HMemory” buffer. Thus, the agent in AE-DDPG seems to take a conservative look at its potential success.

V-C4 Ablation Study for Random Walk Noise

To consider the possibility of developing our introduced noise as a plug-in technology, we further analysis its individual role. Therefore, we compare our proposed random walk noise with two popular noises injected in action space through the experiments on L2R task. One of them is sampled from Gaussian distribution while the other is generated by an Ornstein-Uhlenbeck (OU) process [35]. We tune the parameters for each type of noise to reach their best performances for the convincing comparative results.

We can see that the RL agent using random walk noise achieves the highest mean reward score but with a relatively larger variance. This noise is proved to be successful in encouraging exploration behaviors. Its property of temporal correlation benefits finding effective actions in continuous space. In addition, the “instance uncorrelated” property of random walk noise helps to avoid repeating ineffective searches in the action space and substantially improve the sample diversity.

Vi Conclusion

In this paper, an asynchronous actor-critic method AE-DDPG is proposed for developing a scalable and sample-efficient method to solve continuous control problems in computationally complex environments. Episodic control and power law noise with spectrum are successfully introduced in an asynchronous framework to help to remain even improve sample efficiency while increasing the data throughput. Experiments demonstrate that this modification of DDPG requires less training time and has higher learning efficiency in high-dimensional complex environments. It also shows the satisfactory generalization on other prevalent continuous tasks. We believe that the technique components inside AE-DDPG have the potential to be applied further in other Deep-RL algorithms in the future work.


This work was supported in part by NSFC under Grant 61571413, 61632001,61390514.


  • [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
  • [2] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” ICLR, 2016.
  • [3] u. Kidziński, S. Prasanna Mohanty, C. F. Ong, Z. Huang, S. Zhou, A. Pechenko, A. Stelmaszczyk, P. Jarosik, M. Pavlov, S. Kolesnikov, S. Plis, Z. Chen, Z. Zhang, J. Chen, J. Shi, Z. Zheng, C. Yuan, Z. Lin, H. Michalewski, and S. Delp, Learning to Run Challenge Solutions: Adapting Reinforcement Learning Methods for Neuromusculoskeletal Environments, 01 2018, pp. 121–153.
  • [4] Y.-H. Kuo, J.-P. Hsu, and C.-W. Wang, “A parallel fuzzy inference model with distributed prediction scheme for reinforcement learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 28, no. 2, pp. 160–172, 1998.
  • [5] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in

    International conference on machine learning

    , 2016, pp. 1928–1937.
  • [6] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu, “IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80.   Stockholmsmässan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp. 1407–1416.
  • [7] D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. Van Hasselt, and D. Silver, “Distributed prioritized experience replay,” ICLR, 2018.
  • [8] G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, A. Muldal, N. Heess, and T. Lillicrap, “Distributed distributional deterministic policy gradients,” ICLR, 2018.
  • [9] A. Stooke and P. Abbeel, “Accelerated methods for deep reinforcement learning,” CoRR, 2018.
  • [10] B. Baddeley, “Reinforcement learning in continuous time and space: Interference and not ill conditioning is the main problem when using distributed function approximators,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, no. 4, pp. 950–956, 2008.
  • [11] C. Blundell, B. Uria, A. Pritzel, Y. Li, A. Ruderman, J. Z. Leibo, J. Rae, D. Wierstra, and D. Hassabis, “Model-free episodic control,” arXiv preprint arXiv:1606.04460, 2016.
  • [12] A. Pritzel, B. Uria, S. Srinivasan, A. P. Badia, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell, “Neural episodic control,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70.   International Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp. 2827–2836.
  • [13] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2014.
  • [14] L.-J. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teaching,” Machine learning, vol. 8, no. 3-4, pp. 293–321, 1992.
  • [15] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” ICLR, 2016.
  • [16] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba, “Hindsight experience replay,” in Advances in Neural Information Processing Systems, 2017, pp. 5048–5058.
  • [17] M. Lengyel and P. Dayan, “Hippocampal contributions to control: the third way,” in Advances in neural information processing systems, 2008, pp. 889–896.
  • [18] Z. Lin, T. Zhao, G. Yang, and L. Zhang, “Episodic memory deep q-networks,” IJCAI, 2018.
  • [19] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz, “Parameter space noise for exploration,” ICLR, 2018.
  • [20] I. Grondman, L. Busoniu, G. A. Lopes, and R. Babuska, “A survey of actor-critic reinforcement learning: Standard and natural policy gradients,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 6, pp. 1291–1307, 2012.
  • [21] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in ICML, 2014.
  • [22] G. Kalweit and J. Boedecker, “Uncertainty-driven imagination for continuous deep reinforcement learning,” in Conference on Robot Learning, 2017, pp. 195–206.
  • [23] Z. Yang, K. E. Merrick, H. A. Abbass, and L. Jin, “Multi-task deep reinforcement learning for continuous action control.” in IJCAI, 2017, pp. 3301–3307.
  • [24] L. Tai, G. Paolo, and M. Liu, “Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation,” in Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on.   IEEE, 2017, pp. 31–36.
  • [25] R. A. Adcock, A. Thangavel, S. Whitfield-Gabrieli, B. Knutson, and J. D. Gabrieli, “Reward-motivated learning: mesolimbic activation precedes memory formation,” Neuron, vol. 50, no. 3, pp. 507–517, 2006.
  • [26] A. M. Do, A. V. Rupert, and G. Wolford, “Evaluations of pleasurable experiences: The peak-end rule,” Psychonomic Bulletin & Review, vol. 15, no. 1, pp. 96–98, 2008.
  • [27] J. Timmer and M. Koenig, “On generating power law noise.” Astronomy and Astrophysics, vol. 300, p. 707, 1995.
  • [28] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
  • [29] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on.   IEEE, 2012, pp. 5026–5033.
  • [30] S. L. Delp, F. C. Anderson, A. S. Arnold, P. Loan, A. Habib, C. T. John, E. Guendelman, and D. G. Thelen, “Opensim: open-source software to create and analyze dynamic simulations of movement,” IEEE transactions on biomedical engineering, vol. 54, no. 11, pp. 1940–1950, 2007.
  • [31] M. A. Sherman, A. Seth, and S. L. Delp, “Simbody: multibody dynamics for biomedical research,” Procedia Iutam, vol. 2, pp. 241–261, 2011.
  • [32] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” AAAI, 2018.
  • [33] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  • [34] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, 2015, pp. 1889–1897.
  • [35] G. E. Uhlenbeck and L. S. Ornstein, “On the theory of the brownian motion,” Physical review, vol. 36, no. 5, p. 823, 1930.