Log In Sign Up

Reinforcement learning with experience replay and adaptation of action dispersion

by   Paweł Wawrzyński, et al.
Politechnika Warszawska

Effective reinforcement learning requires a proper balance of exploration and exploitation defined by the dispersion of action distribution. However, this balance depends on the task, the current stage of the learning process, and the current environment state. Existing methods that designate the action distribution dispersion require problem-dependent hyperparameters. In this paper, we propose to automatically designate the action distribution dispersion using the following principle: This distribution should have sufficient dispersion to enable the evaluation of future policies. To that end, the dispersion should be tuned to assure a sufficiently high probability (densities) of the actions in the replay buffer and the modes of the distributions that generated them, yet this dispersion should not be higher. This way, a policy can be effectively evaluated based on the actions in the buffer, but exploratory randomness in actions decreases when this policy converges. The above principle is verified here on challenging benchmarks Ant, HalfCheetah, Hopper, and Walker2D, with good results. Our method makes the action standard deviations converge to values similar to those resulting from trial-and-error optimization.


State Distribution-aware Sampling for Deep Q-learning

A critical and challenging problem in reinforcement learning is how to l...

Safe and Robust Experience Sharing for Deterministic Policy Gradient Algorithms

Learning in high dimensional continuous tasks is challenging, mainly whe...

MASER: Multi-Agent Reinforcement Learning with Subgoals Generated from Experience Replay Buffer

In this paper, we consider cooperative multi-agent reinforcement learnin...

Model-Based Action Exploration

Deep reinforcement learning has great stride in solving challenging moti...

Continuous-action Reinforcement Learning for Playing Racing Games: Comparing SPG to PPO

In this paper, a novel racing environment for OpenAI Gym is introduced. ...

Fidelity-based Probabilistic Q-learning for Control of Quantum Systems

The balance between exploration and exploitation is a key problem for re...

1 Introduction

In reinforcement learning (RL), (Sutton and Barto, 2018) a problem is considered of an agent that makes subsequent actions in a dynamic environment. The actions change the state of the environment, and depending on them, the agent receives numeric rewards. The agent learns to designate actions based on states to receive the highest rewards in the future.

To optimize its behavior, the agent needs to observe the consequences of different actions, i.e.; it needs to apply diverse actions in each state. Therefore, the agent uses a stochastic policy to designate actions, i.e., each time the agent draws them from a certain distribution conditioned on the state. The more dispersed this distribution is, the more experience the agent gathers, but the less likely it becomes that the agent gets to states that yield high rewards. This so-called exploration-exploitation trade-off is essential for efficient reinforcement learning. Despite a lot of significant research, adaptive optimization of this trade-off is still an open problem.

In this paper, we analyze the following approach to designating dispersion of action distribution, thereby quantifying the exploration-exploitation trade-off. We assume that the agent experience is stored in a memory buffer of a fixed size, and the policy changes at a certain pace due to learning. The learning is driven mostly by actions representative for evaluated policies. Therefore, the current action distribution should be so dispersed for the currently taken actions to have sufficiently high density in future policies for effective evaluation and selection of these policies.

The contribution of the paper can be summarized in the following points:

  • We analyze the evaluation of future policies as a primary reason for exploration in RL. We propose a way to quantify exploration to enable that evaluation.

  • We introduce an RL algorithm that automatically designates a dispersion of action distribution (the amount of exploration) in the trained policy. This dispersion is sufficient to evaluate the current policy but not larger. Hence, when the policy converges, the dispersion is suppressed.

  • We present simulations that demonstrate the efficiency of the above algorithm on four challenging learning control problems: Ant, HalfCheetah, Hopper, and Walker2D.

The rest of the paper is organized as follows. Section 2 overviews literature related to the topic of this paper. The following section formulates the problem of designating the amount of randomness in an agent’s policy that optimizes its behavior with RL. Section 4 presents our approach to this problem. In Section 5, simulations are discussed in which our method is compared with RL algorithms PPO, SAC, and ACER. The last section concludes the paper.

2 Related work

Exploration in reinforcement learning.

Most existing reinforcement learning algorithms are designed to optimize policies with a fixed amount of randomness in actions. This amount is defined by a quantity such as action standard deviation or the probability of an exploratory action. Within the common approach to RL, this quantity is tuned manually. A simple approach to automatize this tuning is to train this quantity as one of the policy parameters. This approach was first introduced as a part of the REINFORCE (Williams and Peng, 1991) algorithm and later applied in the Asynchronous Advantage Actor-Critic (Mnih et al., 2016) algorithm and the Proximal Policy Optimization (Schulman et al., 2017) algorithm. However, a policy learned this way degrades to a deterministic one without hand-crafted regularization (Mnih et al., 2016). This regularization is typically introduced as an entropy-based bonus term.

A different approach is to control exploration to increase state-space coverage. One way to achieve this goal is to reward the agent for visiting novel states. State novelty may be determined with a counting table (Tang et al., 2017)

or estimated using environment dynamics prediction error

(Pathak et al., 2017; Stadie et al., 2015). Another way is to maximize the expected difference between the current policy and past policies, thus increasing the coverage of the policies space (Hong et al., 2018).

The maximum entropy RL is a different approach to optimize exploration while keeping policy from degrading to a deterministic one (Jaynes, 1957; Ziebart et al., 2008). In this approach, the policy is optimized with regard to a quality index that combines actual rewards and the entropy of the action distribution. The first off-policy maximum entropy RL algorithm was the Soft Actor-Critic (SAC) (Haarnoja et al., 2018)

algorithm. It is the most prominent RL method to tune the amount of exploration during its operation. Although the idea of achieving that by rewarding for the action distribution entropy was a breakthrough, it is still heuristic and sometimes does not work. And even if it does, it still requires a handcrafted coefficient, namely the weight of the entropy. The follow-up version of SAC

(Haarnoja et al., 2019) tunes this coefficient to approach a target level with gradient descent dynamically. This target level is set for the finite action space to , a value that works empirically on benchmark tasks but is not justified otherwise. Meta-SAC (Wang and Ni, 2020) tunes the target entropy value with a meta-gradient approach.

Existing techniques reduce the balancing of exploitation and exploration to balancing of rewards and entropy. That generally requires problem-dependent coefficients. Within our approach, we define an independent criterion for the amount of exploration, thereby, in principle, avoiding problem-dependent settings.

Efficient utilization of previous experience in RL

The fundamental Actor-Critic architecture of reinforcement learning was introduced in (Barto et al., 1983). Approximators were applied to this structure for the first time in (Kimura and Kobayashi, 1998). To boost the efficiency of these algorithms, experience replay (ER) (Mahadevan and Connell, 1992) can be applied, i.e., storing the events in a database, sampling, and using them for policy updates several times per each actual event. ER was combined with the Actor-Critic architecture for the first time in (Wawrzyński, 2009).

Application of the experience replay to Actor-Critic creates the following problem. The learning algorithm needs to estimate the quality of a given policy based on the consequences of actions registered when a different policy was in use. Importance sampling estimators are designed to do that, but they can have arbitrarily large variances. In ,

(Wawrzyński, 2009) that problem was addressed with truncating density ratios present in those estimators. In (Wang et al., 2016) specific correction terms were introduced for that purpose.

The significance of the relevance of samples was noted for the on-policy algorithms that use experience buffer. This class of algorithms shows another approach to the problem above. They prevent the algorithm from inducing a policy that differs too much from the one used to collect samples. That idea was first applied in Conservative Policy Iteration (Kakade and Langford, 2002). It was further extended in Trust Region Policy Optimization (Schulman et al., 2015)

. This algorithm optimizes a policy with the constraint that the Kullback-Leibler divergence between that policy and the tried one should not exceed a given threshold. The K-L divergence becomes an additive penalty in Proximal Policy Optimization algorithms, namely PPO-Penalty and PPO-Clip 

(Schulman et al., 2017).

A way to avoid the problem of estimating the quality of a given policy based on the tried one is to approximate the action-value function instead of estimating the value function. Algorithms based on this approach are Deep Q-Network (DQN) (Mnih et al., 2013), Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016), and Soft Actor-Critic (SAC) (Haarnoja et al., 2018). SAC uses noise as input for calculating policy, and it is considered one of the most efficient in this family of algorithms.

This paper combines actor-critic structure with experience replay in the old-fashioned way introduced in (Wawrzyński, 2009).

3 Problem formulation

We consider the typical RL setup (Sutton and Barto, 2018). An agent operates in its environment in discrete time . At time  it finds itself in a state, , performs an action, , receives a reward, , and the state changes to .

In this paper, we consider the actor-critic framework (Barto et al., 1983; Kimura and Kobayashi, 1998) of RL. The goal is to optimize a stationary control policy defined as follows. Actions are generated by a distribution


parameterized by two vectors:

that does not affect the dispersion of the action distribution, and that does. Approximators produce both and


with and being their vectors of parameters.

Neural-normal policy.

This policy is applicable for . Both and

are neural networks with input

and weights and , respectively.111They could also be implemented as a single network outputting both and , but for brevity we use this distinction.

The action is sampled from the normal distribution with mean

and covariance matrix . Thus we denote


and generate an action as


where , is a vector of standard deviations for different action components and “” denotes the Hadamard (elementwise) product.

Experience replay.

We assume that the policy is optimized with the use of experience replay. The agent’s experience, i.e., states, actions, and rewards, are stored in a memory buffer. Simultaneously to the agent’s operation in the environment, the experience is called from the buffer and used to optimize and .


The vectors of weights, and define a policy, . The criterion of the policy optimization is the maximization of the value function


for each state ; is a coefficient – the discount factor.

The value function may be estimated based on so-called -step returns, namely


for any .

The optimization criterion of maximizes the probability of experienced actions that led to high rewards afterward. However, should be tuned to keep a proper balance between exploration and exploitation in the agent’s behavior.

4 Method

4.1 General idea

We consider an actor-critic algorithm with experience replay. The algorithm keeps a window of length of previous events and continuously optimizes the policy. We postulate that the dispersion of the distribution of the current policy should be sufficient to produce actions that will enable the evaluation and selection of future policies, i.e., these actions should be likely in future policies. However, the future policies are unknown. We assume that they will be as different from the current policy as the current policy is different from those that produced the actions registered in the current memory buffer.

Figure 1: Illustration. 1 – the policy that generated action ; its mode is , 2 – the current policy, 3 – a hypothetical future policy. Under the proposed method, the dispersion of the current policy ensures that the mode and action of the previous policy are likely according to the current one. Symmetrically, the current action should enable the evaluation of the future policy.

Following the above postulate, we propose two rules to optimize :


For the -th event registered in the memory buffer, the mode of the distribution that has produced the action should be likely according to the current policy and the state .


For the same event, the action should be likely according to the current policy and the state .

The above rules (illustrated in Fig. 1) are intended to have the following effects: 1) When the action distribution for a given state, , changes due to learning, the current action distribution for the state is dispersed, thereby enabling to evaluate a broad range of behaviors expected to be exercised in the course of the learning. 2) However, when the policy approaches the optimum and the pace of its change decreases, the distribution becomes less dispersed, enabling more precise action choices. Eventually, the action distribution converges to a deterministic choice when the policy converges.

4.2 Operationalization

The mode of the distribution that has produced the action is defined as


for and used at time . Following the aforementioned rules R1 and R2, to optimize we minimize the loss


averaged over the events collected in the memory buffer. is a coefficient.

Neural-normal policy.

For this policy the mode (7) is equal to for applied at time . The loss (8) then takes the form


where is a vector of ones ( is a sum of elements in ), and the squares and are elementwise.

4.3 Algorithm

1:  Sample .
2:  Compute the temporal difference
3:  Compute the Critic gradient estimate:
4:  Compute the Actor gradient estimate:
5:  Compute the dispersion loss gradient estimate:
6:  Update with , with , and with .
Algorithm 1 Experience replay in Actor-Critic with Experience Replay and Adaptive eXploration, ACERAX

The algorithm presented here is Actor-Critic with Experience Replay and Adaptive eXploration, ACERAX. It is based on the Actor-Critic structure shown in Section 3, experience replay, and -step returns. The algorithm uses a critic, , an approximator of the value function with weights .

At each time of the agent-environment interaction, the following tuple is registered

As the agent-environment interaction continues, previous experience is being replayed; that is, Algorithm 1 is being recurrently executed.

In Line 1, the algorithm selects an experienced event to replay. In Line 2, it determines the relative quality of , namely the temporal difference multiplied by a softly truncated density ratio. is changing due to learning. Thus the conditional distribution is now different than it was at the time when the action was executed. The product of density ratios in accounts for this discrepancy in distributions. To limit the variance of the density ratios, the soft-truncating function  is applied, e.g.,


for . In the ACER algorithm (Wawrzyński, 2009), the hard truncation function, is used for the same purpose, which is limiting density ratios necessary in designating updates due to action distribution discrepancies. However, soft-truncating distinguishes the magnitude of density ratio and may be expected to work slightly better than the hard truncation.

In Line 3, an improvement direction of the parameters of critic, , is computed. is designed to make approximate the value function better.

In Line 4, an improvement direction for the actor parameter is computed. The increment is designed to increase/decrease the likelihood of occurrence of the sequence of actions proportionally to . The function is a penalty for improper values of , e.g., exceeding a box to which the actions should belong.

In Line 5, an improvement direction for the actor parameter is computed, for defined in (8).

The improvement directions , and are applied in Line 6 to update , , and , respectively, with the use of either ADAM, SGD, or another method of stochastic optimization. They may be applied in minibatches, several at a time.

5 Experimental study

This section presents simulations whose purpose is to evaluate the ACERAX algorithm introduced in Sec. 4. In our first experiment, we determine the algorithm’s sensitivity to its parameter and its approximately optimal value across several RL problems. In the second experiment, we look at action standard deviations the algorithm determines and compare them with constant action standard deviations optimized manually. We call the algorithm with constant approximately optimal action standard deviations ACER, as it differs only in details from that presented (Wawrzyński, 2009) under that name. In the third experiment, we compare ACERAX to two state-of-the-art RL algorithms: the Soft Actor-Critic (SAC) (Haarnoja et al., 2018) algorithm and the Proximal Policy Optimization (PPO) (Schulman et al., 2017) algorithm. We use the Stable Baselines3 implementation (Raffin and Stulp, 2020) of SAC and PPO in the simulations. Our experimental software is available online.222provided in the final version of the paper

For the comparison of the RL algorithms to be the most informative, we chose four challenging tasks inspired by robotics, namely Ant, Hopper, HalfCheetah, and Walker2D (see Fig. 2) from the PyBullet physics simulator (Coumans and Bai, 2019).

5.1 Experimental settings

Each learning run lasted for 3 million timesteps. Every 30000 timesteps, the training was paused, and a simulation was made with frozen weights and without exploration for 5 test episodes. An average sum of rewards within a test episode was registered. Each run was repeated five times.

We have taken implementation SAC and PPO from (Raffin et al., 2021), and the hyperparameters that assure the state-of-the-art performance of these algorithms from (Raffin and Stulp, 2020). For each environment, hyperparameters for ACER and ACERAX, such as step-sizes, were optimized to yield the highest ultimate average rewards. The values of these hyperparameters are reported in the appendix.

Figure 2: Environments used in simulations. From the left: Ant, HalfCheetah, Hopper, Walker2D.

5.2 Searching for

Figure 3: Results of ACERAX and different . From the left: Ant, HalfCheetah, Hopper, Walker2D.

For each environment, we tried . The results are depicted in Fig. 3. It is seen that the learning is not very sensitive to this parameter. means that the previous modes of action distributions and the previous actions are made likely according to the current policy. means that the previous actions are made likely, which is intended to be a form of regularization: It prevents the policy from degrading to a deterministic one too early. The experiments suggest that performance is almost insensitive to this parameter when . We choose for the rest of the experiments.

5.3 Different initial action standard deviations

Figure 4: Comparison of the behavior of during training for different initial biases in the last layer of the network. Each plot represents the maximum or minimum value of the mean calculated over the coordinates of the standard deviation vector. The averages were calculated based on five independent runs. From the left: Ant, HalfCheetah, Hopper, Walker2D.

We have optimized in a grid search constant standard deviations for actions for ACER in these environments. In all environments we selected . In another experiment, we verify the average outputs of the

network depending on its initialization. To this end, we impose different initial biases in the output neurons of this network. Afterward, we register the outputs of this network over time. Fig. 

4 presents the trajectories of average and . It is seen that regardless of the initialization, these outputs converge quite close to . We also see that which justifies the need of designating standard deviations of different action dimensions separately.

5.4 Comparison of ACERAX, ACER, PPO, and SAC

Figure 5: Results for ACERAX, ACER, PPO, and SAC. From the left: Ant, HalfCheetah, Hopper, Walker2D.

Fig. 5 presents learning curves for all four environments for SAC, PPO, ACER, and our proposed ACERAX. It is seen that all algorithms exhibit a similar speed of learning and ultimate rewards, with ACERAX performing very well according to both these criteria. (In fact, ACERAX yields the best average ultimate performance in 3 out of 4 environments, but this result is not statistically significant.) However, the reference algorithms control exploration-exploitation balance with heuristic or problem-dependent coefficients. In ACERAX, this balance is based on an an universal principle that is somehow affected by the coefficient but is not very sensitive to this coefficient.

5.5 Discussion

Exploration is necessary for reinforcement learning, but it inevitably deteriorates currently expected rewards. Best previous approaches to tuning the amount of exploration online are based on additional rewards for action distribution entropy. But the scale of these rewards is generally problem-dependent, as this is never known in advance what fraction of the original rewards need be traded for action distribution entropy to assure proper exploration.

The approach analyzed here is based on the assumption that the policy is optimized on the data in the replay buffer. The current action distribution dispersion should be sufficient to support policy evaluation and selection when this action will be replayed from the buffer in the future.

Our approach generally yielded good performance in the experiments. That happened at the cost of an additional neural network, , that controlled the action distribution dispersion. Our experiments suggest that this network should be much smaller than the network – it had ten times fewer neurons in each layer. We interpret this latter condition as follows: It should be impossible for the network to overfit to the experience, as it would result in nearly zero action distribution dispersion for some states.

6 Conclusions

Exploration/exploitation balance is a fundamental problem in reinforcement learning. In this paper, we analyzed an approach to the adaptive designation of the amount of randomness in action distribution. In this approach, the probability densities are maximized by the modes of the distribution of actions in the replay buffer. Consequently, current actions are likely to support the evaluation and selection of future policies. Furthermore, this strategy diminishes the randomness in actions when the policy converges, giving the agent increasing control over its actions. The RL algorithm based on this strategy introduced here, ACERAX, was verified on four challenging robotic-like benchmarks: Ant, HalfCheetah, Hopper, and Walker2D, with good results. Our method makes the action standard deviations converge to values similar to those resulting from trial-and-error optimization.

Our proposed strategy for optimizing dispersion of action distribution is based on the maximization of weighted logarithms of previous modes of action distributions and previous actions. The optimal weights are potentially problem-dependent, although the strategy appears to be barely sensitive to these weights. Getting rid of any potentially problem-dependent coefficients from this strategy is a curious direction for further research.


Appendix A Algorithms’ hyperparameters

Hyperparameters of SAC and PPO have been taken from Raffin and Stulp (2020). They are presented in Tab. 1 and 2, respectively.

In all experiments we used a discount factor of . Common hyperpameters of ACER and ACERAX are presented in Tab. 3. When possible, we have adopted hyperparameters of these algorithms from SAC. These include the actor and critic sizes and the parameters that define the intensity of experience replay. Step-sizes have been selected from the set

Standard deviations for actions components in ACER have been selected from the set

Different action components for the same environment have the same standard deviation. For all environment its selected value was the same, .

The network used in the ACERAX algorithm had 40 and 30 neurons in its two hidden layers, i.e., it was of size . The step-sizes for this network are presented in Table 4. Biases of the network output layer were initialized with . That means that the initial standard deviation of action components was .

Parameter Value
Actor size
Critic size
Discount factor 0.98
Replay buffer size
Minibatch size 256
Entropy coefficient auto
Target entropy
Learning start
Gradient steps 8
Train frequency 8
Table 1: SAC hyperparameters. The actor and critic sizes define numbers of neurons in hidden layers of respective neural networks.
Parameter Value
Actor size
Critic size
Discount factor 0.99
GAE parameter 0.9
Minibatch size 128
Horizon 512

Number of epochs

Entropy param. 0.0
Clip range 0.4
Initial log
Table 2: PPO hyperparameters.
Parameter Value
Actor size
Critic size
Memory size
Minibatch size 256
Gradient steps 1
Discount factor 0.98
Actor step-size
Critic step-size
Actor step-size
Critic step-size
Actor step-size
Critic step-size
Actor step-size
Critic step-size
Table 3: ACER and ACERAX hyperparameters.
Parameter Value
Initial output bias -1
Ant step-size
HalfChetah step-size
Hopper step-size
Walker2D step-size
Table 4: ACERAX hyperparameters.

Appendix B Computational resources

At the stage of code debugging, we used an external cluster. For the actual experimental study we used a PC equipped with AMD™Ryzen™Threadripper™1920X for about 7 weeks.