1 Introduction
In reinforcement learning (RL), (Sutton and Barto, 2018) a problem is considered of an agent that makes subsequent actions in a dynamic environment. The actions change the state of the environment, and depending on them, the agent receives numeric rewards. The agent learns to designate actions based on states to receive the highest rewards in the future.
To optimize its behavior, the agent needs to observe the consequences of different actions, i.e.; it needs to apply diverse actions in each state. Therefore, the agent uses a stochastic policy to designate actions, i.e., each time the agent draws them from a certain distribution conditioned on the state. The more dispersed this distribution is, the more experience the agent gathers, but the less likely it becomes that the agent gets to states that yield high rewards. This socalled explorationexploitation tradeoff is essential for efficient reinforcement learning. Despite a lot of significant research, adaptive optimization of this tradeoff is still an open problem.
In this paper, we analyze the following approach to designating dispersion of action distribution, thereby quantifying the explorationexploitation tradeoff. We assume that the agent experience is stored in a memory buffer of a fixed size, and the policy changes at a certain pace due to learning. The learning is driven mostly by actions representative for evaluated policies. Therefore, the current action distribution should be so dispersed for the currently taken actions to have sufficiently high density in future policies for effective evaluation and selection of these policies.
The contribution of the paper can be summarized in the following points:

We analyze the evaluation of future policies as a primary reason for exploration in RL. We propose a way to quantify exploration to enable that evaluation.

We introduce an RL algorithm that automatically designates a dispersion of action distribution (the amount of exploration) in the trained policy. This dispersion is sufficient to evaluate the current policy but not larger. Hence, when the policy converges, the dispersion is suppressed.

We present simulations that demonstrate the efficiency of the above algorithm on four challenging learning control problems: Ant, HalfCheetah, Hopper, and Walker2D.
The rest of the paper is organized as follows. Section 2 overviews literature related to the topic of this paper. The following section formulates the problem of designating the amount of randomness in an agent’s policy that optimizes its behavior with RL. Section 4 presents our approach to this problem. In Section 5, simulations are discussed in which our method is compared with RL algorithms PPO, SAC, and ACER. The last section concludes the paper.
2 Related work
Exploration in reinforcement learning.
Most existing reinforcement learning algorithms are designed to optimize policies with a fixed amount of randomness in actions. This amount is defined by a quantity such as action standard deviation or the probability of an exploratory action. Within the common approach to RL, this quantity is tuned manually. A simple approach to automatize this tuning is to train this quantity as one of the policy parameters. This approach was first introduced as a part of the REINFORCE (Williams and Peng, 1991) algorithm and later applied in the Asynchronous Advantage ActorCritic (Mnih et al., 2016) algorithm and the Proximal Policy Optimization (Schulman et al., 2017) algorithm. However, a policy learned this way degrades to a deterministic one without handcrafted regularization (Mnih et al., 2016). This regularization is typically introduced as an entropybased bonus term.
A different approach is to control exploration to increase statespace coverage. One way to achieve this goal is to reward the agent for visiting novel states. State novelty may be determined with a counting table (Tang et al., 2017)
or estimated using environment dynamics prediction error
(Pathak et al., 2017; Stadie et al., 2015). Another way is to maximize the expected difference between the current policy and past policies, thus increasing the coverage of the policies space (Hong et al., 2018).The maximum entropy RL is a different approach to optimize exploration while keeping policy from degrading to a deterministic one (Jaynes, 1957; Ziebart et al., 2008). In this approach, the policy is optimized with regard to a quality index that combines actual rewards and the entropy of the action distribution. The first offpolicy maximum entropy RL algorithm was the Soft ActorCritic (SAC) (Haarnoja et al., 2018)
algorithm. It is the most prominent RL method to tune the amount of exploration during its operation. Although the idea of achieving that by rewarding for the action distribution entropy was a breakthrough, it is still heuristic and sometimes does not work. And even if it does, it still requires a handcrafted coefficient, namely the weight of the entropy. The followup version of SAC
(Haarnoja et al., 2019) tunes this coefficient to approach a target level with gradient descent dynamically. This target level is set for the finite action space to , a value that works empirically on benchmark tasks but is not justified otherwise. MetaSAC (Wang and Ni, 2020) tunes the target entropy value with a metagradient approach.Existing techniques reduce the balancing of exploitation and exploration to balancing of rewards and entropy. That generally requires problemdependent coefficients. Within our approach, we define an independent criterion for the amount of exploration, thereby, in principle, avoiding problemdependent settings.
Efficient utilization of previous experience in RL
The fundamental ActorCritic architecture of reinforcement learning was introduced in (Barto et al., 1983). Approximators were applied to this structure for the first time in (Kimura and Kobayashi, 1998). To boost the efficiency of these algorithms, experience replay (ER) (Mahadevan and Connell, 1992) can be applied, i.e., storing the events in a database, sampling, and using them for policy updates several times per each actual event. ER was combined with the ActorCritic architecture for the first time in (Wawrzyński, 2009).
Application of the experience replay to ActorCritic creates the following problem. The learning algorithm needs to estimate the quality of a given policy based on the consequences of actions registered when a different policy was in use. Importance sampling estimators are designed to do that, but they can have arbitrarily large variances. In ,
(Wawrzyński, 2009) that problem was addressed with truncating density ratios present in those estimators. In (Wang et al., 2016) specific correction terms were introduced for that purpose.The significance of the relevance of samples was noted for the onpolicy algorithms that use experience buffer. This class of algorithms shows another approach to the problem above. They prevent the algorithm from inducing a policy that differs too much from the one used to collect samples. That idea was first applied in Conservative Policy Iteration (Kakade and Langford, 2002). It was further extended in Trust Region Policy Optimization (Schulman et al., 2015)
. This algorithm optimizes a policy with the constraint that the KullbackLeibler divergence between that policy and the tried one should not exceed a given threshold. The KL divergence becomes an additive penalty in Proximal Policy Optimization algorithms, namely PPOPenalty and PPOClip
(Schulman et al., 2017).A way to avoid the problem of estimating the quality of a given policy based on the tried one is to approximate the actionvalue function instead of estimating the value function. Algorithms based on this approach are Deep QNetwork (DQN) (Mnih et al., 2013), Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016), and Soft ActorCritic (SAC) (Haarnoja et al., 2018). SAC uses noise as input for calculating policy, and it is considered one of the most efficient in this family of algorithms.
This paper combines actorcritic structure with experience replay in the oldfashioned way introduced in (Wawrzyński, 2009).
3 Problem formulation
We consider the typical RL setup (Sutton and Barto, 2018). An agent operates in its environment in discrete time . At time it finds itself in a state, , performs an action, , receives a reward, , and the state changes to .
In this paper, we consider the actorcritic framework (Barto et al., 1983; Kimura and Kobayashi, 1998) of RL. The goal is to optimize a stationary control policy defined as follows. Actions are generated by a distribution
(1) 
parameterized by two vectors:
that does not affect the dispersion of the action distribution, and that does. Approximators produce both and(2) 
with and being their vectors of parameters.
Neuralnormal policy.
This policy is applicable for . Both and
are neural networks with input
and weights and , respectively.^{1}^{1}1They could also be implemented as a single network outputting both and , but for brevity we use this distinction.The action is sampled from the normal distribution with mean
and covariance matrix . Thus we denote(3) 
and generate an action as
(4) 
where , is a vector of standard deviations for different action components and “” denotes the Hadamard (elementwise) product.
Experience replay.
We assume that the policy is optimized with the use of experience replay. The agent’s experience, i.e., states, actions, and rewards, are stored in a memory buffer. Simultaneously to the agent’s operation in the environment, the experience is called from the buffer and used to optimize and .
Goal.
The vectors of weights, and define a policy, . The criterion of the policy optimization is the maximization of the value function
(5) 
for each state ; is a coefficient – the discount factor.
The value function may be estimated based on socalled step returns, namely
(6) 
for any .
The optimization criterion of maximizes the probability of experienced actions that led to high rewards afterward. However, should be tuned to keep a proper balance between exploration and exploitation in the agent’s behavior.
4 Method
4.1 General idea
We consider an actorcritic algorithm with experience replay. The algorithm keeps a window of length of previous events and continuously optimizes the policy. We postulate that the dispersion of the distribution of the current policy should be sufficient to produce actions that will enable the evaluation and selection of future policies, i.e., these actions should be likely in future policies. However, the future policies are unknown. We assume that they will be as different from the current policy as the current policy is different from those that produced the actions registered in the current memory buffer.
Following the above postulate, we propose two rules to optimize :
 R1:

For the th event registered in the memory buffer, the mode of the distribution that has produced the action should be likely according to the current policy and the state .
 R2:

For the same event, the action should be likely according to the current policy and the state .
The above rules (illustrated in Fig. 1) are intended to have the following effects: 1) When the action distribution for a given state, , changes due to learning, the current action distribution for the state is dispersed, thereby enabling to evaluate a broad range of behaviors expected to be exercised in the course of the learning. 2) However, when the policy approaches the optimum and the pace of its change decreases, the distribution becomes less dispersed, enabling more precise action choices. Eventually, the action distribution converges to a deterministic choice when the policy converges.
4.2 Operationalization
The mode of the distribution that has produced the action is defined as
(7) 
for and used at time . Following the aforementioned rules R1 and R2, to optimize we minimize the loss
(8) 
averaged over the events collected in the memory buffer. is a coefficient.
Neuralnormal policy.
4.3 Algorithm
The algorithm presented here is ActorCritic with Experience Replay and Adaptive eXploration, ACERAX. It is based on the ActorCritic structure shown in Section 3, experience replay, and step returns. The algorithm uses a critic, , an approximator of the value function with weights .
At each time of the agentenvironment interaction, the following tuple is registered
As the agentenvironment interaction continues, previous experience is being replayed; that is, Algorithm 1 is being recurrently executed.
In Line 1, the algorithm selects an experienced event to replay. In Line 2, it determines the relative quality of , namely the temporal difference multiplied by a softly truncated density ratio. is changing due to learning. Thus the conditional distribution is now different than it was at the time when the action was executed. The product of density ratios in accounts for this discrepancy in distributions. To limit the variance of the density ratios, the softtruncating function is applied, e.g.,
(10) 
for . In the ACER algorithm (Wawrzyński, 2009), the hard truncation function, is used for the same purpose, which is limiting density ratios necessary in designating updates due to action distribution discrepancies. However, softtruncating distinguishes the magnitude of density ratio and may be expected to work slightly better than the hard truncation.
In Line 3, an improvement direction of the parameters of critic, , is computed. is designed to make approximate the value function better.
In Line 4, an improvement direction for the actor parameter is computed. The increment is designed to increase/decrease the likelihood of occurrence of the sequence of actions proportionally to . The function is a penalty for improper values of , e.g., exceeding a box to which the actions should belong.
In Line 5, an improvement direction for the actor parameter is computed, for defined in (8).
The improvement directions , and are applied in Line 6 to update , , and , respectively, with the use of either ADAM, SGD, or another method of stochastic optimization. They may be applied in minibatches, several at a time.
5 Experimental study
This section presents simulations whose purpose is to evaluate the ACERAX algorithm introduced in Sec. 4. In our first experiment, we determine the algorithm’s sensitivity to its parameter and its approximately optimal value across several RL problems. In the second experiment, we look at action standard deviations the algorithm determines and compare them with constant action standard deviations optimized manually. We call the algorithm with constant approximately optimal action standard deviations ACER, as it differs only in details from that presented (Wawrzyński, 2009) under that name. In the third experiment, we compare ACERAX to two stateoftheart RL algorithms: the Soft ActorCritic (SAC) (Haarnoja et al., 2018) algorithm and the Proximal Policy Optimization (PPO) (Schulman et al., 2017) algorithm. We use the Stable Baselines3 implementation (Raffin and Stulp, 2020) of SAC and PPO in the simulations. Our experimental software is available online.^{2}^{2}2provided in the final version of the paper
For the comparison of the RL algorithms to be the most informative, we chose four challenging tasks inspired by robotics, namely Ant, Hopper, HalfCheetah, and Walker2D (see Fig. 2) from the PyBullet physics simulator (Coumans and Bai, 2019).
5.1 Experimental settings
Each learning run lasted for 3 million timesteps. Every 30000 timesteps, the training was paused, and a simulation was made with frozen weights and without exploration for 5 test episodes. An average sum of rewards within a test episode was registered. Each run was repeated five times.
We have taken implementation SAC and PPO from (Raffin et al., 2021), and the hyperparameters that assure the stateoftheart performance of these algorithms from (Raffin and Stulp, 2020). For each environment, hyperparameters for ACER and ACERAX, such as stepsizes, were optimized to yield the highest ultimate average rewards. The values of these hyperparameters are reported in the appendix.
5.2 Searching for
For each environment, we tried . The results are depicted in Fig. 3. It is seen that the learning is not very sensitive to this parameter. means that the previous modes of action distributions and the previous actions are made likely according to the current policy. means that the previous actions are made likely, which is intended to be a form of regularization: It prevents the policy from degrading to a deterministic one too early. The experiments suggest that performance is almost insensitive to this parameter when . We choose for the rest of the experiments.
5.3 Different initial action standard deviations
We have optimized in a grid search constant standard deviations for actions for ACER in these environments. In all environments we selected . In another experiment, we verify the average outputs of the
network depending on its initialization. To this end, we impose different initial biases in the output neurons of this network. Afterward, we register the outputs of this network over time. Fig.
4 presents the trajectories of average and . It is seen that regardless of the initialization, these outputs converge quite close to . We also see that which justifies the need of designating standard deviations of different action dimensions separately.5.4 Comparison of ACERAX, ACER, PPO, and SAC
Fig. 5 presents learning curves for all four environments for SAC, PPO, ACER, and our proposed ACERAX. It is seen that all algorithms exhibit a similar speed of learning and ultimate rewards, with ACERAX performing very well according to both these criteria. (In fact, ACERAX yields the best average ultimate performance in 3 out of 4 environments, but this result is not statistically significant.) However, the reference algorithms control explorationexploitation balance with heuristic or problemdependent coefficients. In ACERAX, this balance is based on an an universal principle that is somehow affected by the coefficient but is not very sensitive to this coefficient.
5.5 Discussion
Exploration is necessary for reinforcement learning, but it inevitably deteriorates currently expected rewards. Best previous approaches to tuning the amount of exploration online are based on additional rewards for action distribution entropy. But the scale of these rewards is generally problemdependent, as this is never known in advance what fraction of the original rewards need be traded for action distribution entropy to assure proper exploration.
The approach analyzed here is based on the assumption that the policy is optimized on the data in the replay buffer. The current action distribution dispersion should be sufficient to support policy evaluation and selection when this action will be replayed from the buffer in the future.
Our approach generally yielded good performance in the experiments. That happened at the cost of an additional neural network, , that controlled the action distribution dispersion. Our experiments suggest that this network should be much smaller than the network – it had ten times fewer neurons in each layer. We interpret this latter condition as follows: It should be impossible for the network to overfit to the experience, as it would result in nearly zero action distribution dispersion for some states.
6 Conclusions
Exploration/exploitation balance is a fundamental problem in reinforcement learning. In this paper, we analyzed an approach to the adaptive designation of the amount of randomness in action distribution. In this approach, the probability densities are maximized by the modes of the distribution of actions in the replay buffer. Consequently, current actions are likely to support the evaluation and selection of future policies. Furthermore, this strategy diminishes the randomness in actions when the policy converges, giving the agent increasing control over its actions. The RL algorithm based on this strategy introduced here, ACERAX, was verified on four challenging roboticlike benchmarks: Ant, HalfCheetah, Hopper, and Walker2D, with good results. Our method makes the action standard deviations converge to values similar to those resulting from trialanderror optimization.
Our proposed strategy for optimizing dispersion of action distribution is based on the maximization of weighted logarithms of previous modes of action distributions and previous actions. The optimal weights are potentially problemdependent, although the strategy appears to be barely sensitive to these weights. Getting rid of any potentially problemdependent coefficients from this strategy is a curious direction for further research.
References
 Barto et al. (1983) Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive elements that can learn difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics B, 13:834–846.

Coumans and Bai (2019)
Coumans, E. and Bai, Y. (2016–2019).
Pybullet, a python module for physics simulation for games, robotics and machine learning.
http://pybullet.org.  Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv:1801.01290.
 Haarnoja et al. (2019) Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., and Levine, S. (2019). Soft actorcritic algorithms and applications. arXiv:1812.05905.
 Hong et al. (2018) Hong, Z.W., Shann, T.Y., Su, S.Y., Chang, Y.H., Fu, T.J., and Lee, C.Y. (2018). Diversitydriven exploration strategy for deep reinforcement learning. In Neural Information Processing Systems (NeurIPS).
 Jaynes (1957) Jaynes, E. T. (1957). Information theory and statistical mechanics. ii. Physical Review, 108:171–190.
 Kakade and Langford (2002) Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning (ICML), pages 267–274.
 Kimura and Kobayashi (1998) Kimura, H. and Kobayashi, S. (1998). An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function. In International Conference on Machine Learning (ICML).
 Lillicrap et al. (2016) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2016). Continuous control with deep reinforcement learning. In ICML.
 Mahadevan and Connell (1992) Mahadevan, S. and Connell, J. (1992). Automatic programming of behavior based robots using reinforcement learning. Artificial Intelligence, 55(2–3):311–365.
 Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. arXiv:1602.01783.
 Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv:1312.5602.
 Pathak et al. (2017) Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. (2017). Curiositydriven exploration by selfsupervised prediction. arXiv:1705.05363.
 Raffin et al. (2021) Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., and Dormann, N. (2021). Stablebaselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8.
 Raffin and Stulp (2020) Raffin, A. and Stulp, F. (2020). Generalized statedependent exploration for deep reinforcement learning in robotics. arXiv:2005.05719.
 Schulman et al. (2015) Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015). Trust region policy optimization. arXiv:1502.05477.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
 Stadie et al. (2015) Stadie, B., Levine, S., and Abbeel, P. (2015). Incentivizing exploration in reinforcement learning with deep predictive models. arXiv:1507.00814.
 Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. Second edition. The MIT Press.
 Tang et al. (2017) Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. (2017). #Exploration: A study of countbased exploration for deep reinforcement learning. arXiv:1611.04717.
 Wang and Ni (2020) Wang, Y. and Ni, T. (2020). Metasac: Autotune the entropy temperature of soft actorcritic via metagradient. arXiv:2007.01932.
 Wang et al. (2016) Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. (2016). Sample efficient actorcritic with experience replay. arXiv:1611.01224.
 Wawrzyński (2009) Wawrzyński, P. (2009). Realtime reinforcement learning by sequential actor–critics and experience replay. Neural Networks, 22(10):1484–1497.
 Williams and Peng (1991) Williams, R. and Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268.
 Ziebart et al. (2008) Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence, pages 1433–1438.
Appendix A Algorithms’ hyperparameters
Hyperparameters of SAC and PPO have been taken from Raffin and Stulp (2020). They are presented in Tab. 1 and 2, respectively.
In all experiments we used a discount factor of . Common hyperpameters of ACER and ACERAX are presented in Tab. 3. When possible, we have adopted hyperparameters of these algorithms from SAC. These include the actor and critic sizes and the parameters that define the intensity of experience replay. Stepsizes have been selected from the set
Standard deviations for actions components in ACER have been selected from the set
Different action components for the same environment have the same standard deviation. For all environment its selected value was the same, .
The network used in the ACERAX algorithm had 40 and 30 neurons in its two hidden layers, i.e., it was of size . The stepsizes for this network are presented in Table 4. Biases of the network output layer were initialized with . That means that the initial standard deviation of action components was .
Parameter  Value 

Actor size  
Critic size  
Discount factor  0.98 
Replay buffer size  
Minibatch size  256 
Entropy coefficient  auto 
Target entropy  
Learning start  
Gradient steps  8 
Train frequency  8 
Stepsize  
Initial 
Parameter  Value 

Actor size  
Critic size  
Discount factor  0.99 
GAE parameter  0.9 
Minibatch size  128 
Horizon  512 
Number of epochs 
20 
Stepsize  
Entropy param.  0.0 
Clip range  0.4 
Initial log 
Parameter  Value 

Actor size  
Critic size  
10  
3  
Memory size  
Minibatch size  256 
Gradient steps  1 
Discount factor  0.98 
Ant  
Actor stepsize  
Critic stepsize  
HalfCheetah  
Actor stepsize  
Critic stepsize  
Hopper  
Actor stepsize  
Critic stepsize  
Walker2D  
Actor stepsize  
Critic stepsize 
Parameter  Value 

size  
Initial output bias  1 
Ant stepsize  
HalfChetah stepsize  
Hopper stepsize  
Walker2D stepsize 
Appendix B Computational resources
At the stage of code debugging, we used an external cluster. For the actual experimental study we used a PC equipped with AMD™Ryzen™Threadripper™1920X for about 7 weeks.