Marginalized State Distribution Entropy Regularization in Policy Optimization

12/11/2019 ∙ by Riashat Islam, et al. ∙ McGill University 15

Entropy regularization is used to get improved optimization performance in reinforcement learning tasks. A common form of regularization is to maximize policy entropy to avoid premature convergence and lead to more stochastic policies for exploration through action space. However, this does not ensure exploration in the state space. In this work, we instead consider the distribution of discounted weighting of states, and propose to maximize the entropy of a lower bound approximation to the weighting of a state, based on latent space state representation. We propose entropy regularization based on the marginal state distribution, to encourage the policy to have a more uniform distribution over the state space for exploration. Our approach based on marginal state distribution achieves superior state space coverage on complex gridworld domains, that translate into empirical gains in sparse reward 3D maze navigation and continuous control domains compared to entropy regularization with stochastic policies.



There are no comments yet.


page 7

page 8

page 13

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A key ingredient of a successful reinforcement learning algorithm is sufficient exploration of the environment. This is particularly important when rewards that the environment provides are sparse. In policy optimization, entropy regularization is added to the objective to prevent the policy from prematurely converging to a deterministic policy (Mnih et al., 2016; Haarnoja et al., 2018) leading to improved optimization performance (Ahmed et al., 2019). Simply regularizing by the policy entropy induces a stochastic policy via the action space. However, exploration in action space does not imply exploration in state space: This policy would perform a random walk in environments with sparse rewards.

In contrast, an effective exploration policy should also seek to maximize coverage of the state space. Most recently, Hazan et al. (2018) proposed a framework that maximizes exploration by maximizing the entropy of the discounted stationary state distribution induced by a policy,

. While the guarantee is useful, their technique relies on having access to an (approximate) model and requires state discretization in continuous state space environments. An alternative is to compute the normalized discounted weighting of a state, or commonly known as the discounted future state distribution. However, as we discuss in later sections, this is also difficult to compute as it requires estimating the probability of a state with a

probability. Additionally, estimating the entropy of the discounted occpancy measure would require separate density estimates, while the entropy of stationary distribution requires knowledge of the environment transition dynamics.

In this work, we propose a practical model-free approach to estimating the distribution over discounted weighting of states, for entropy regularization with the state distribution. Since the normalized discounted weighting is difficult to compute, we instead estimate the marginal state distribution that measures the probability of being in state s at time t dependent on the policy parameters, . We propose to use a lower bound approximation to based on latent space representation of a state to compute and compute the entropy based on the latent representation for entropy regularization. We hypothesize that maximizing the entropy of marginal state distribution can be an effective exploration method that maximizes coverage of the state space, and propose a computationally feasible algorithm based on a variational approach.

Effective exploration in sparse reward domains is a challenging problem and several heuristics have been proposed to incentivize the agent to visit unseen regions of the state space. These methods introduce explicit reward shaping bonuses for exploration

(Bellemare et al., 2016; Pathak et al., 2017; Ostrovski et al., 2017; Machado et al., 2018) but do not explicitly regularize policies to maximize an exploration objective. Specifically, there is no gradient of the exploration bonus with respect to the policy parameters. In policy optimization, regularization is a simple way to extract specific behaviours from policies (Bachman et al., 2018; Goyal et al., 2019; Eysenbach et al., 2019; Pong et al., 2019).

In this work, we investigate the following question: Can maximizing entropy of marginal state distribution for regularization be useful as a simple exploration strategy? Our contributions are:

  • To justify the maximization of state space coverage, we first study the use of discounted state distribution in episodic settings, as an effective entropy regularizer, . These results complements Hazan et al. (2018).

  • We discuss why computing the normalized discounted weighting of a state, or the discounted future state distribution is difficult in practice. We then propose an approximation to the discounted weighting of a state, by instead computing a lower bound representation of the discounted weighting. We show that we can tractably compute the discounted per state weighting, or equivalently the marginal state distribution, based on latent state representation.

  • We then propose to use the entropy of the marginal state distribution, for entropy regularization with the approximation in policy optimization. By introducing a tractable algorithm, we show that it induces a policy that maximize state space coverage and study other qualitative and quantitative behaviours in Grid Worlds and maze tasks.

  • In more complicated domains, we demonstrate that maximum marginal state entropy regularization objective can induce an effective exploration strategy. In particular, it shows improved performance across a wide range of tasks, including challenging maze navigation domains with sparse reward, partially observable environments, and even dense reward continuous control tasks.

2 Preliminaries

In reinforcement learning, we aim to find a policy, , that maps a state , to a distribution over actions, , such that it maximizes the cumulative discounted return: where is the stationary state distribution of the policy, is a discount factor favouring immediate rewards and is a reward at time

in the trajectory. In the infinite horizon setting, the state distribution is the normalized distribution given by

. Here is the per-step state distribution induced by a policy . In policy gradient methods the goal is to find a parameterized policy, , that maximizes the discounted cumulative return, . A practical algorithm is given by following the policy gradient, (Sutton et al., 1999). Maximum entropy based objectives (Schulman et al., 2017a; Ahmed et al., 2019) augment the reward with policy entropy penalty, , to help with exploration by avoiding premature convergence of to a deterministic policy. The entropy regularized objective is: where determines the relative importance of the entropy term against the reward. The gradient of is given by , where is the action-value function describing the expected discounted entropy augmented return by executing action in state and then acting on-policy by sampling .

One could imagine that in addition to maximizing , we can also maximize the entropy of the state distribution, . The main focus of our work is to propose a tractable approach to maximizing this entropy via the entropy of the per-step state distribution, (Section 3.2) and a variational approximation (Section 3.3).

3 Approach: Entropy Regularization with Marginal State Distribution

3.1 Marginal State Distribution

To understand the meaning of the marginal state distribution, let us first recall that the exact solution for the policy evaluation equation in vector form is given by

, where is the reward and

defines the dynamics of the Markov chain. The inverse of

can be written as , also known as the Neumann series, and each row of is the distribution over next states steps in the future. Therefore, we can write the probability of the agent being in state , after steps in the future as , and each entry the inverse matrix be written as .

Furthermore, as in (Sutton et al., 1999), the policy gradient discounted objective given an initial starting state distribution is given by . Given the distribution of initial states, we can define the discounted weighting of states as , such that the equivalent definition of policy gradient objective is . The discounted weighting of states can further be written as a recursive expression, similar to policy evaluation equations, where plays the role of value function . The recursive expression for the discounted weighting of states can therefore be written as an expectation


Note that the value function is also defined given a starting state or as an expectation where the states are drawn from a starting state distribution , since it is . We can therefore write the policy gradient objective with an equivalent expectation w.r.t to the starting state distribution , given by . Note that since both the discounted weighting of states, given in equation 1 and the value function are defined with , we can further write the joint expectation, where we denote as the value function with augmented rewards, given by . We define , the probability of being in state s at time t as the marginal discounted weighting of states. Further from equation 1, let us denote , as the marginal state distribution, we can therefore write the discounted weighting of states as .


The policy gradient objective, given by value functions with augmented rewards is , where the augement rewards are given by . The intuition for using such augmented rewards is similar to count-based occupancy measures for example, where the agent is provided an exploration bonus for states that are visited less often. However, the key difference is that here the occupancy measure is dependent on the policy directly, as it derives from the discounted weighting of future states dependent on the policy .

However, from equation 1 note that is not in fact a distribution, since the rows of do not sum to 1, but instead would sum to . Therefore, to consider the distribution of discounted weighting of states, we would need to consider the normalized variant that we denote by . We would therefore also need to consider the normalized variant of the marginal discounted weighting of states, which we define as the marginal state distribution given by

. This means that instead of augmenting rewards in with , we would instead need to augment it with or equivalently denoting it as . This however is quite difficult in practice since in order to estimate we would need to estimate the probability of being in a state with . In other words, for the normalized discounted weighting of states

defined above, we would now need to sample upon entering the next state, from the geometric distribution of

to decide whether to terminate from the trajectory prematurely with a probability of or not.

3.2 Marginalized State Distribution Entropy Regularization

Instead of augmenting rewards with , we can instead consider different entropies. Instead of value functions augmented with rewards , we can instead consider entropy augmented rewards, similar to entropy regularized policies (Mnih et al., 2016). We can therefore consider regularized value functions with the entropy of the marginal state distribution or equivalently . We can re-write equation 2 with entropy augmented value functions, with weighting and taking the expectation, over all states along the trajectory, to get the policy gradient objective with parameterized policies given by


We can therefore regularize policy gradient objective with the entropy of the marginal state distribution, which we denote by . As with policy entropy regularization, we augment the reward, , with the entropy of the marginal state distribution, , given by where are weighting terms.111Observe that unlike other reward bonuses, both and depend on the policy parameters, , and will therefore have non-zero gradients with respect to . For simplicity, we will focus on the second term in the following derivations. Similar to count based exploration (Bellemare et al., 2016; Ostrovski et al., 2017), this reward term would incentivize policies which increase and therefore the diversity of states visited.

Using this reward bonus results in a slightly different policy gradient objective, in which the gradient of the entropy of the marginal state distribution, can act as regularizer, leading to the state entropy regularized policy gradient update:


where is the cumulative discounted augmented reward from state taking action and then acting on-policy according to . Derivation of the marginal state entropy regularized policy gradient is given in Appendix S.2.

However, note that in equations 3 and 4, we have defined the objective w.r.t to the unnormalized version of the marginal discounted weighting of states. In other words, to consider the marginal state distribution, we would need instead of . In other words, we would need the normalized marginal probability of state to turn it into a distribution. This is intractable in practice for two reasons : (a) we cannot estimate probability with a normalization factor since it means to compute the marginal probability with a probability of ; (b) for continuous state spaces and in case of function approximation, we cannot get exact estimates of the marginan probability of being in state s at time t, dependent on the policy . Computing the marginal state distribution or for time-step therefore requires an intractable integral across all the time-steps of a trajectory.

In the next section, we discuss how we can compute an approximation to the marginal state distribution , which is dependent with changes in policy , by introducing a variational approximation (Kingma and Welling, 2014) .

3.3 Approximation to Marginal State Distribution Entropy

In practice, it is difficult to compute the normalized marginal state of a probability, dependent on policy parameters . We introduce use a variational approximation to

. In particular, we assign a probability distribution over each state

by using an encoder to map each state to a latent representation, , of that state . This can be achieved by defining a policy network which outputs . We can decompose the output of the policy as . We define a policy function, which in addition to mapping states to actions, also maps states to a latent representation . Informally, this means that for each state visited at time t under the current policy parameterization , we compute the corresponding marginal probability of being in that state. However, since cannot be computed exactly for the marginal state distribution, we instead compute a lower bound approximation, ie, a latent state representation and compute the marginal probability distribution in the latent space . Using the variational approximation , we can now re-define the discounted weighting of states, based on the latent representation that we denote by , which is a lower bound approximation to as previously given in equation 1


Therefore, based on the latent representation, we can again define the normalized latent state distribution relating to the marginal latent state distribution as . The marginal state distribution is therefore given by .

In marginal state entropy regularized policy gradients with parameterized policies , we therefore require a tractable approximation to computing the entropy of marginal state distribution or equivalently defined as . The variational entropy, , gives an approximation to the marginal state distribution entropy, , and therefore we can maximize this approximation instead. This approximation, however, may have drawbacks from a theoretical standpoint, since and may decouple the assigned distributions over a given state . However, we find that maximizing the lower bound still maximizes and leads to benefits with exploration in practice. We argue that this approach still provides a suitable approximation to the policy dependent marginal state distribution , which may otherwise be difficult to compute exactly. Using the variational approximation, we get a lower bound to the objective in equation3.


From the above equation, we get the following approximation to the policy gradient with the variational state distribution entropy , where the term is the variational marginal entropy of and acts as a regularizer in the policy gradient update since the entropy term per state relates to the changes in the policy.



Our algorithm is summarized in Algorithm 1. As described in Section 3.3, we use a policy architecture that outputs both the action probabilities and the encoded latent state representation of states , denoted by .

0:     A policy and , regularization coefficients.
0:     The number of episodes, and update interval, .
  for  to  do
     Take action ,get reward and observe next state
     Store tuple () as trajectory rollouts or in replay buffer
if  then
       Update policy parameters following any policy gradient method, according to
  end for
Algorithm 1 Regularization with Entropy of Marginal State Distribution

4 Experiments

In this section, we demonstrate the usefulness of state entropy regularization, specifically marginal state entropy regularization, for exploration. We provide a detailed experimental setup for each section as well as reproducibility efforts in Appendix S.4. In experimental results below, we denote MaxEntPolicy for regularization with only and MaxEntState for regularization with in addition with policy entropy.222Unless specified, all our baselines use .

The goal of our experiments is to demonstrate that a simple idea and regularization method for marginal state distribution entropy maximization can lead to better performance on a range of sparse reward tasks. Firstly, in Section 4.1, we verify the hypothesis from Hazan et al. (2018) that maximizing the entropy of the stationary state distribution can lead to faster learning when computing the exact policy gradient. In Section 4.2 we show that our tractable approximation to maximizing the marginal state entropy, , can induce a policy which will maximize state coverage, leading to better exploration of the state space in a variety of gridworlds, compared to maximizing policy entropy. Finally we show that the increased state space coverage translates into empirical gains on a variety of sparse reward navigation tasks (Section 4.3) and continuous control environments (Section 4.4). These results demonstrate the effectiveness of using the marginal state entropy as a regularizer that induces exploration.

4.1 Does Regularizing Entropy of the State Distribution Make Sense?

In this section, we expand the results from Hazan et al. (2018) to validate the hypothesis that regularizing policy gradients with the entropy of state distribution, , can be useful for improved performance.

Experimental Setup: We use Frozen Lake (Brockman et al., 2016) where the entire state space is enumerable to obtain reward and transition dynamics. In this environment, we can compute exact policy gradient with the state distribution entropy, , and policy entropy , weighted by regularization parameters and respectively.

Result: We find that using state distribution entropy in addition to policy entropy can lead to faster learning compared to them in isolation (Figure 0(a)). Furthermore, we found that while policy entropy performed quite well (Figure 0(a) and 1(a) in Appendix) with respect to the solution found, it required a decay on which was was difficult to find (Figure 1(c)

). In contrast, state distribution entropy had a stable performance for a wide range of hyperparameters (Figure 

1(b) and 1(d), addition) despite reaching a suboptimal solution. This is not surprising as the optimal solution will not be the one that maximizes state space coverage. Additionally, state distribution entropy regularization has a stabilizing interactive effect on policy entropy regularization (Figure S3 as shown in appendix).

Figure 1: Proof-of-concept experiments indicating the utility of maximizing state distribution entropy and our corresponding approximation, marginal state distribution entropy, on FrozenLake: (a) A combination of state distribution entropy, , with , and policy entropy, , with , performs best in an exact policy gradient algorithm (b) Adding the marginal state entropy regularization improves upon the performance compared with the policy entropy baseline, . See Figure S2 and S3 for a closer analysis of and

4.2 State Space coverage in Complex GridWorlds

The maximization of state distribution entropy should lead to a more uniform coverage of the state space. In the following section, we empirically verify increased state space coverage compared to the baseline of maximizing policy entropy in Grid World domains.

Experimental Setup: To confirm our intuitions about state space coverage, we use three environments: (1) Pachinko where a simple grid world is periodically dotted with impassable walls; (2) Double-slit where the agent starts at the extreme bottom left and must navigate three rooms with small doors to reach a goal at the extreme top right; and (3) Four rooms (Schaul et al., 2015). The agents for (1) and (2) are trained with Reinforce (Williams, 1992) while the agent for (3) is trained using a simple actor-critic method with GAE (Konda and Tsitsiklis, 2000; Schulman et al., 2015).

Results: Our first result provides evidence for the validity of our approximation of the state distribution. When using MaxEntPolicy the agent does not move far from the starting position (Figure 1(a)) indicating that a more random policy will not successfully navigate beyond a few walls. In contrast, MaxEntState (Figure 1(b)) is able to reach a larger portion of the state space providing empirical validation that the technique encourages the policy to navigate into a wider area of the world. In the double-slit environment, MaxEntPolicy finds a mostly random policy and only ends up visiting a small region of the grid, barely making it to the other rooms (Figure 1(c)). In contrast, MaxEntState successfully finds a policy that reaches the goal state in the last room (Figure 1(d)).

(a) MaxEntPolicy
(b) MaxEntState
(c) MaxEntPolicy
(d) MaxEntState
Figure 2: Improved state space coverage on a complex gridworld leads to better policies: A Pachinko world demonstrates that (a) using MaxEntPolicy regularization produces a policy that centers around the starting position while (b) MaxEntState leads to a more dispersed policy with wider state space coverage. In a double-slit environment, the agent must navigate three rooms to reach the goal. (c) MaxEntPolicy results in a random walk policy that leaves the first room but never makes it into the third whereas (d) MaxEntPolicy successfully reaches the goal in the third room. We highlight the fact that it would never be able to reach this room without leaving the second room. Light colored regions represent areas where a trained policy spent time in a grid.

To quantify if this improved exploration translates to improved learning speed, we measure both the qualitative heatmaps as well as the return on the four room domain during learning. Consistent with the other environments, the final policy learned using MaxEntState (Figure 2(b)) has an increased state space coverage compared to a policy learned with MaxEntPolicy alone (Figure 2(a)). These qualitative results translate into quantitative improvements in learning performance.

(a) MaxEntPolicy
(b) MaxEntState
(c) Learning Curves
Figure 3: MaxEnt has improved exploration and sample efficiency in the Four Rooms Domain. Policies trained with MaxEntState in (b) visit a larger region of the gird compared to MaxEntPolicy in (a). Cumulative returns plot in (c) comparing MaxEntState regularization with baselines (count based exploration and MaxEntPolicy) show improved sample efficiency.

4.3 MaxEntState Improves Performance on Partially Observed Sparse Reward Tasks

The results from Section 4.2, and in particular Figure 2, suggest that agents which maximize state distribution entropy do indeed maximize state space coverage. We expect that these qualitative traits provide empirical performance improvements in complicated sparse-reward and partially observable domains. In this section, we quantitatively measure performance improves in deep RL by considering a variety of partially observed 2D and 3D tasks.

Experimental Setup: To measure quantitatively the impact of using MaxEntState, we consider a range of partially observable sparse reward 2D environments generated with MiniGrid (Chevalier-Boisvert et al., 2018) and 3D environments generated with MiniWorld333These environments are alternative to the VizDoom or DeepMind Lab environments. (Chevalier-Boisvert, 2018). In MiniGrid, the environments are setup as 2D grids and the agent only receives a small area as an observation (Figure S6). In MiniWorld, the agent receives a high dimensional image of a 3D space and then must learn to navigate a complex maze consisting of rooms, doors and hallways. Both environments are challenging to solve in practice because a positive reward is only given at the end of a successful episode. For both the MiniGrid and Miniworld environments, we use both A2C (Mnih et al., 2016) and PPO (Schulman et al., 2017b) policy gradient algorithms.

Figure 4: MaxEntState improves learning speed on challenging 2D partially observable environments. MaxEntState (blue), the agent quickly is quickly able to find the rewards in (a) PutNearSXNY (b) DoorKeySXNY and (c) MultiRoomSXNY compared to using MaxEntPolicy alone (Baseline, orange). NX denotes the sizes of the rooms and NY denotes the number of rooms. PutNear involves placing objects near a target while DoorKey involves finding a key before navigating into a room with the goal.
Figure 5: MaxEntState provides improved exploration to solve (partially) difficult sparse reward environments. MiniWorld Envoronments with A2C, with different weightings for the state entropy regularization. The baseline (green) is a standard A2C with policy entropy regularization (). Additional results using PPO can be seen in Figure S4 in Appendix.

Results: MaxEntState improves learning speed on all MiniGrid environments tested (Figure 4). In particular, the agent consistently finds the goal within the first 1000-2000 timesteps of experience. In the more complicated 3D tasks, MaxEntState entropy regularization consistently performs better compared to MaxEntPolicy baseline (Figure 5 and S4). We were able to find some environments where only policies trained using MaxEntState were able to learn to solve the task (Figures 4(c), 3(a), and 3(b)). These environments are quite difficult for MaxEntPolicy baseline to solve with random exploration, since they contain 3D observation space, are partially observable and sparse reward. In these tasks, MaxEntPolicy typically fails, as exploration of the state space plays a key role in these multiple room navigation tasks.

4.4 MaxEntState on Continuous Control

Finally, we compare the effectiveness of marginal state entropy regularization for common continuous control tasks. In control environments, exploration of the state space often plays a key role for sample efficiency in solving these tasks.

Experimental Setup: We use the popular MuJoCo simulator for continuous control (Todorov et al., 2012) with environments from OpenAI Gym (Brockman et al., 2016). We us the Soft Actor-Critic (SAC) (Haarnoja et al., 2018) and Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016)

frameworks. We use open-source tuned implementations of DDPG and SAC, as these algorithms have high variance and often unstable to use in practice

(Henderson et al., 2018). In DDPG, we use MaxEntState as an additional state entropy regularizer, and compare with baseline DDPG only. In SAC, we use MaxEntState regularizer additionally with the MaxEnt framework with .

Results: We find that using MaxEntState on continuous control problem provide get marginal improvements using SAC (Figure 6). These improvements are much larger in DDPG (Figure S5 in Appendix), since baseline DDPG uses deterministic policies only, and exploration is achieved with random noise in action space only.

Figure 6: Slight improvements in control domains with state space exploration. Continuous Control Benchmarks on Soft Actor-Critic. We compare with SAC here only, since it uses MaxEnt framework for exploration and improved stability, and we compare with and without state entropy regularization. We achieve more significant improvements with DDPG as shown in Figure S5.

5 Related Work

MaxEntState joints a family of methods that does reward shaping (Ng et al., 1999), by augmenting external rewards, , from the environment with an additional reward signal, . While the external reward, , is set by the environment, we are free to pick the internal reward, : Some examples of include using a hash-map based count of the states visited (Tang et al., 2017), the norm of the successor feature (Machado et al., 2018) and feature space prediction error (Pathak et al., 2017). Particularly related are count-based models for exploration. Following pseudo-count based models (Bellemare et al., 2016), the empirical distribution, can be measured by the number of occurrences of a state in the sequence, . This pseudo-count can also be defined in terms of a density model, (Ostrovski et al., 2017). The entropy of the induced state distribution, or , can thus be measured and provided as an intrinsic motivation similar to MaxEntState. Unlike our work, no gradient of this term exists with respect to the policy parameters.

Our method deviates from the usual intrinsic motivation algorithms as MaxEntState also serves as a regularizer. Specifically, there exists a gradient signal given by the bonus, , with respect to the policy parameters, (Schulman et al., 2017a). Regularizers are often used to extract specific kinds of behaviour to exploit structure of certain tasks. For example, mutual information regularizes can limit the dependence of the policy actions on the environment goals encouraging more generalizable behaviour (Goyal et al., 2019; Galashov et al., 2019). Additionally, certain maximum entropy regularizers can encourage a diversity amongst policies (Bachman et al., 2018; Eysenbach et al., 2019), enhance state space coverage (Hazan et al., 2018) and diversity of goals (Pong et al., 2019).

6 Discussions

In this work, we have shown that regularizing with maximum state entropy can perform better than using maximum entropy policy by itself.

When and why does maximizing the entropy of the marginal state distribution matter? In this work, we provide a regularized policy gradient which directly influences state space coverage for exploration. Our results show, through visualizations of the state space in gridworlds, that the marginal state space regularization does indeed improve the state space coverage (Figure 1(b)). However, we do not expect a state distribution regularization scheme to work everywhere: Policies that maximize state space coverage are effectively sub-optimal (Figure 0(a)). This can be problematic when there is a dense reward signal and our agent is being incentivized to simultaneously follow the reward as well as maximize state space coverage. Despite this intuition, we have shown that there is a small benefit in dense reward continuous control tasks as well.

In contrast, we do expect that the gains from this state space coverage translate to quicker learning in environments with sparse or no reward structure: In such environments, a good random exploration strategy is important. Policy entropy and state distribution entropy both induce two very different kinds of random exploration. While a mechanism that increases the randomness of policies will do a random walk in space (Figure 1(a) and 2(a)), a mechanism that visits novel states is likely to be more successful in discovering an eventual reward. Indeed, we found this to be true for the proposed algorithm in the sparse reward environments we have tested here. We expect that, decaying the weighting term, , should eventually recover the true objective we care about, the discounted sum of rewards.

Limitations and Future Work: One major limitation of our approach is that we use the marginal, per-step state distribution for entropy, instead of maximizing the entropy of the stationary or discounted state distributions. Maximizing the stationary state distribution can directly influence coverage as stationary would imply exploration independent of time, whereas discounted state distribution based on future occupancy measures would be more reliable means of exploration in episodic settings. In practice, the stationary or discounted state distribution can be directly estimated by learning an explicit density estimator that is directly influenced by the policy parameters.

Conclusion: This work proposes an entropy regularization technique in policy optimization based on maximizing the marginal state distribution , which can be used as an approximation to . We provide a simple mechanism that can be used on top of any existing policy gradient algorithm to directly influence state space coverage. We showed that this scheme can learn policies that induce a larger state space coverage which can be an effective exploration objective. Finally, this regularization improved performance on a range of sparse reward environments. In closing, we believe our work provides a step towards extending entropy regularization that can directly influence state space coverage, which can perhaps tackle fundamental problems of exploration in reinforcement learning.


We thank Anirudh Goyal for preliminary discussions regarding this work. We are grateful to Emmanuel Bengio, Pascal Lamblin, John Martin and Pablo Samuel Castro for comments on drafts of this manuscript. This research was enabled in part by support provided by and


  • Z. Ahmed, N. L. Roux, M. Norouzi, and D. Schuurmans (2019) Understanding the impact of entropy on policy optimization.

    International Conference on Machine Learning

    External Links: Link, 1811.11214 Cited by: §1, §2.
  • P. Bachman, R. Islam, A. Sordoni, and Z. Ahmed (2018) Vfunc: a deep generative model for functions. arXiv preprint arXiv:1807.04106. Cited by: §1, §5.
  • M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos (2016) Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 1471–1479. External Links: Link Cited by: §1, §3.2, §5.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §S.4, §4.1, §4.4.
  • M. Chevalier-Boisvert, L. Willems, and S. Pal (2018) Minimalistic gridworld environment for openai gym. GitHub. Note: Cited by: §S.4, §4.3.
  • M. Chevalier-Boisvert (2018) Gym-miniworld environment for openai gym. GitHub. Note: Cited by: §S.4, §4.3.
  • B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine (2019) Diversity is all you need: learning skills without a reward function. International Conference on Learning Representations. Cited by: §1, §5.
  • S. Fujimoto, H. van Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 1582–1591. External Links: Link Cited by: §S.4.
  • A. Galashov, S. M. Jayakumar, L. Hasenclever, D. Tirumala, J. Schwarz, G. Desjardins, W. M. Czarnecki, Y. W. Teh, R. Pascanu, and N. Heess (2019) Information asymmetry in kl-regularized rl. International Conference on Learning Representations. External Links: Document, Link Cited by: §5.
  • A. Goyal, P. Brakel, W. Fedus, T. P. Lillicrap, S. Levine, H. Larochelle, and Y. Bengio (2018) Recall traces: backtracking models for efficient reinforcement learning. CoRR abs/1804.00379. External Links: Link, 1804.00379 Cited by: §S.4.
  • A. Goyal, R. Islam, D. Strouse, Z. Ahmed, M. Botvinick, H. Larochelle, S. Levine, and Y. Bengio (2019) InfoBot: transfer and exploration via the information bottleneck. International Conference on Learning Representations abs/1901.10902. External Links: Link, 1901.10902 Cited by: §1, §5.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 1856–1865. External Links: Link Cited by: §1, §4.4.
  • E. Hazan, S. M. Kakade, K. Singh, and A. V. Soest (2018) Provably efficient maximum entropy exploration. CoRR abs/1812.02690. External Links: Link, 1812.02690 Cited by: 1st item, §1, §4.1, §4, §5.
  • P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger (2018) Deep reinforcement learning that matters. In

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018

    pp. 3207–3214. External Links: Link Cited by: §4.4.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, External Links: Link Cited by: §3.2.
  • V. R. Konda and J. N. Tsitsiklis (2000) Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §4.2.
  • I. Kostrikov (2018) PyTorch implementations of reinforcement learning algorithms. GitHub. Note: Cited by: §S.4, §S.4.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, External Links: Link Cited by: §4.4.
  • M. C. Machado, M. G. Bellemare, and M. Bowling (2018) Count-based exploration with the successor representation. arXiv preprint arXiv:1807.11622. Cited by: §1, §5.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pp. 1928–1937. External Links: Link Cited by: §1, §3.2, §4.3.
  • A. Y. Ng, D. Harada, and S. J. Russell (1999) Policy invariance under reward transformations: theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, June 27 - 30, 1999, pp. 278–287. Cited by: §5.
  • G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos (2017) Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 2721–2730. External Links: Link Cited by: §1, §3.2, §5.
  • D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 2778–2787. External Links: Link Cited by: §1, §5.
  • V. H. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, and S. Levine (2019) Skew-fit: state-covering self-supervised reinforcement learning. arXiv preprint arXiv:1903.03698. Cited by: §1, §5.
  • T. Schaul, D. Horgan, K. Gregor, and D. Silver (2015) Universal value function approximators. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 1312–1320. External Links: Link Cited by: §4.2.
  • J. Schulman, X. Chen, and P. Abbeel (2017a) Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440. Cited by: §2, §5.
  • J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2015) High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: §4.2.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017b) Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: Link, 1707.06347 Cited by: §4.3.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (1999) Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver, Colorado, USA, November 29 - December 4, 1999], pp. 1057–1063. External Links: Link Cited by: §S.2, §2, §3.1.
  • H. Tang, R. Houthooft, D. Foote, A. Stooke, O. X. Chen, Y. Duan, J. Schulman, F. DeTurck, and P. Abbeel (2017) # exploration: a study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems, pp. 2753–2762. Cited by: §5.
  • E. Todorov, T. Erez, and Y. Tassa (2012) MuJoCo: a physics engine for model-based control.. In IROS, pp. 5026–5033. External Links: ISBN 978-1-4673-1737-5, Link Cited by: §4.4.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, pp. 229–256. External Links: Link, Document Cited by: §4.2.

Appendix S Supplementary Material: Marginalized State Distribution Entropy Regularization in Policy Optimization

s.1 Additional Experimental Results

(a) One Room Maze Domain (3D)
(b) Maze Domain (3D)
Figure S1: Examples of MiniWorld environments.
(c) with decay on
(d) with decay on
Figure S2: (a)-(d) The full range of experiments on FrozenLake that systematically study the impact of the two kinds of entropy with and without a decay on their regularization coefficients.
Figure S3: (a)-(c) The full range of experiments on FrozenLake that systematically study the impact of the interaction between two kinds of entropy.
Figure S4: MiniWorld Envoronments with PPO, with different weightings for the marginal state entropy regularization. Comparison with standard PPO baseline with policy entropy regularization. We find that with MaxEntState regularization, PPO can solve these hard exploration 3D maze navigation tasks much faster than baseline with MaxEntPolicy.
Figure S5: Continuous Control Benchmarks on DDPG. Comparison with DDPG with and without state entropy regularization. We find that with MaxEntState, we can significantly improve the performance of DDPG, since this algorithm cannot maximize entropy with deterministic policies. Maximizing entropy with the marginal state distribution therefore plays a key role with exploration, improving the sample efficiency of this off-policy algorithm.
(a) Doorkey Env
(b) Multi-Room Env
(c) Four Rooms Env
Figure S6: Examples of MiniGird environments.

s.2 Derivation of State Entropy Regularized Policy Gradient

In this section, we derive the policy gradient theorem with the state entropy regularized objective. Since the objective function is given by , with state entropy regularization, we can write as follows


We can now derive the policy gradient theorem, with the explicit state entropy regularized objective given by


The gradient of the term is :


Following the policy gradient theorem [Sutton et al., 1999], and using the log-derivative trick, we can therefore write


from which we can therefore get the following policy gradient theorem. The key difference here is that, the derived for the entropy of the marginal state distribution gives a regularized policy gradient update, where the original objective can be retrieved with .


In practice, however, computing exactly is difficult. This is because this term can be interpreted as using a probability density estimate of each state which is explicitly dependent on the policy which is parameterized.

s.3 Reproducibility Checklist

We follow the reproducibility checklist from Pineau, 2018 and include further details here. For all the models and algorithms, we have included details that we think would be useful for reproducing the results of this work.

  • For all models and algorithms presented, check if you include:

    1. Description of Algorithm and Model : We have included an algorithm box that clearly outlines the proposed approach. Our method can be used on top of any existing policy-gradient based RL algorithm, and in our experiments, we mostly used Reinforce, PPO and A2C algorithms for demonstrations. To implement our approach, we simply need to implement an encoder to the policy network, that maps states to a fixed dimensional latent variable (in most experiments, we use the size of the latent space

      ). We assume a Gaussian distribution over the latent representation, and compute the variational entropy

      and the KL divergence . We use this as a regularization in the policy gradient update. We further use the entropy to provide an exploration bonus in hard exploration tasks. This regularization is added on top of existing max-entropy policy regularization (which is also added as an exploration bonus). This term is added with a weighting term.

    2. Analysis of Complexity : We do not include any separate analysis of the complexity of our algorithm. Computation-wise, our approach requires the extra computation of the KL-term for the regularizer (which is similar to a lot of existing related works).

    3. Link to downloadable source code : See the experimental details section below, where we include further details for each of our experimental setup. All our implementations use existing open-sourced RL implementations (details of which we list below). We provide the code used in our experiments in a separate zip file, and agree that we will open source our implementations, for any result figures used in this paper, as well as scripts used for launching and plotting the experimental results for absolute clarity.

  • For any theoretical claim, check if you include:

    1. We include a statement of our theoretical result in the main paper. We clearly describe the theoretical steps required to derive our modified objective.

    2. Complete Proof of Claim : In appendix, we have also included a clear derivation of our proposed approach, using existing theorem used in the literature, to clarify how exactly our proposed approach differs.

    3. A clear explanation of any assumptions : We clearly describe the assumptions made to make our model work in practice. Since we introduce an encoder in our policy network for practical realization of the algorithm, we clearly mention that the only assumption being made is assuming a standard Gaussian distribution as output of the encoder, and a unit Gaussian prior for computing the KL divergence term.

  • For all figures and tables that present empirical results, check if you include:

    1. Data collection process : We did not need to include a complete description of the data collection process. This is because we use any standard RL algorithm, and use the same number of timesteps or episodes typically used for practical implementations. Both our proposed approach and the baseline are trained with the same number of samples.

    2. Downloadable version of environment : We use open-sourced OpenAI gym environments for most of our experiments, including the Mujoco simulator. For experiments, where we used other environments that are typically not used, we either include a link to the environment code repo that we used, or provide the actual code of the environment in the accompanying codebase of this paper.

    3. Description of any pre-processing step : We do not require any data pre-processing step for our experiments.

    4. Sample allocation for training and evaluation : We use standard RL evaluation framework for our experimental results. In our experiments, as done in any RL algorithm, the trained policy is evaluated at fixed intervals, and the performance is measured by plotting the cumulative returns. In most of our presented experimental results, we plot the cumulative return performance measure. In 2 of our results, we plot the state visitation heatmap for a qualitative analysis of our method and to present the intuition. In the accompanying code, we also include details of how to generate these heatmaps to reproduce the results.

    5. Range of hyper-parameters considered : For our experiments, we did not do any extensive hyperparameter tuning. We took existing implementations of RL algorithms (details of which are given in the Appendix experimental details section below), which generally contain tuned implementations. For our proposed method, we only introduced the extra hyperparameter for the state entropy weighting. We tried our experiments with only 3 different lambda values () and compared to the baseline with for a fair comparison. Both our proposed method and the baseline contains the same network architectures, and other hyperparameters, that are used in existing open-sourced RL algorithms. We include more details of our experiment setups in the next section in Appendix.

    6. Number of Experiment Runs : For all our experimental results, we plot results over random seeds. Each of our hyper-parameter tuning is also done with experiment runs with each hyperparameter. These random seeds are sampled at the start of any experiment, and plots are shown averaged over 5 runs. We note that since a lot of DeepRL algorithms suffer from high variance, we therefore have the high variance region in some of our experiment results.

    7. Statistics used to report results : In the resulting figures, we plot the mean,

      , and standard error

      for the shaded region, to demonstrate the variance across runs and around the mean. We note that some of the environments we used in our experiments, are very challenging to solve (e.g 3D maze navigation domains), resulting in the high variance (shaded region) around the plots. The Mujoco control experiments done in this work have the standard shaded region as expected in the performance in the baseline algorithms we have used (DDPG and SAC).

    8. Error bars : The error bars or shaded region are due to where for the number of experiment runs.

    9. Computing Infrastrucutre : We used both CPUs and GPUs in all of our experiments, depending on the complexity of the tasks. For some of our experiments, we could have run for more than random seeds, for each hyperparameter tuning, but it becomes computationally challenging and a waste of resources, for which we limit the number of experiment runs, with both CPU and GPU to be a standrd of across all setups.

s.4 Additional Experimental Details

In this section, we include further experimental details and setup for the results presented in the paper

Experiment setup in Demonstrating Hypothesis in Section 4.1: In section 4.1 we demonstrate the effect of maximizing on a simple FrozenLake task. we use the standard FrozenLake gym environment [Brockman et al., 2016]

, and used a simple actor-critic implementation with one-hot state encoding of states as features, and a one-layer neural network approximator.

Experiment setup for State Space Coverage in Section 4.2: Pachinko world and the double-slit experiment are implemented in the open-source package, EasyMDP444 . For this task, we use a parallel threaded Reinforce implementation, and only compare the performance of our proposed approach qualitatively by plotting the state visitation heatmaps.

For the four rooms domain, we used the open-source four rooms code available from [Goyal et al., 2018]. We use an actor-critic with a GAE implementation for this task, and compare with and without our proposed state entropy regularization. We provide the code for this task in the accompanying code for the paper.

Experiment setup in Maze Navigation Tasks in Section 4.3

For the sparse reward POMDP gridworld tasks, we use the open-sourced Gym-Minigrid environments available in [Chevalier-Boisvert et al., 2018]. This implementation uses a standard A2C implementation from [Kostrikov, 2018]. We used the same network architectures, learning rates and optimizers, as used in the open-source implementation, with no further hyper-parameter tuning. Figure S6 shows the POMDP environments from the Minigrid [Chevalier-Boisvert et al., 2018]. For this task, we provide code for our implementation built on top of existing A2C code.

In the 3D maze navigation tasks, we used the open-source Miniworld environments from [Chevalier-Boisvert, 2018] (Figure S1). For our experiments, we used both A2C and PPO with the open-sourced implementations from [Kostrikov, 2018]. We used the same network architecture, optimizers and learning rates for our implementation, as used in the baseline code of [Kostrikov, 2018]. Figure below further shows some of the environments used in this work

Experiment setup in Continuous Control Tasks in Section 4.4

For the continuous control experiments, we used the open-source implementation of DDPG available from the accompanying paper [Fujimoto et al., 2018]. We further use a SAC implementation, from a modified implementation of DDPG. Both the implementations of DDPG and SAC are provided with the accompanying codebase. We used the same architectures and hyperparameters for DDPG and SAC as reported in [Fujimoto et al., 2018].