Efficient Entropy for Policy Gradient with Multidimensional Action Space

06/02/2018
by   Yiming Zhang, et al.
MIT
NYU college
0

In recent years, deep reinforcement learning has been shown to be adept at solving sequential decision processes with high-dimensional state spaces such as in the Atari games. Many reinforcement learning problems, however, involve high-dimensional discrete action spaces as well as high-dimensional state spaces. This paper considers entropy bonus, which is used to encourage exploration in policy gradient. In the case of high-dimensional action spaces, calculating the entropy and its gradient requires enumerating all the actions in the action space and running forward and backpropagation for each action, which may be computationally infeasible. We develop several novel unbiased estimators for the entropy bonus and its gradient. We apply these estimators to several models for the parameterized policies, including Independent Sampling, CommNet, Autoregressive with Modified MDP, and Autoregressive with LSTM. Finally, we test our algorithms on two environments: a multi-hunter multi-rabbit grid game and a multi-agent multi-arm bandit problem. The results show that our entropy estimators substantially improve performance with marginal additional computational cost.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

03/20/2018

Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines

Policy gradient methods have enjoyed great success in deep reinforcement...
06/08/2020

A Decentralized Policy Gradient Approach to Multi-task Reinforcement Learning

We develop a mathematical framework for solving multi-task reinforcement...
06/10/2022

Deep Multi-Agent Reinforcement Learning with Hybrid Action Spaces based on Maximum Entropy

Multi-agent deep reinforcement learning has been applied to address a va...
06/06/2021

Learning MDPs from Features: Predict-Then-Optimize for Sequential Decision Problems by Reinforcement Learning

In the predict-then-optimize framework, the objective is to train a pred...
12/03/2015

Deep Reinforcement Learning with Attention for Slate Markov Decision Processes with High-Dimensional States and Actions

Many real-world problems come with action spaces represented as feature ...
09/26/2019

V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control

Some of the most successful applications of deep reinforcement learning ...
10/21/2021

Anti-Concentrated Confidence Bonuses for Scalable Exploration

Intrinsic rewards play a central role in handling the exploration-exploi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep reinforcement learning has been shown to be adept at solving sequential decision processes with high-dimensional state spaces such as in the Go game alpha_go and Atari games original_atari_paper ; human_level_control ; multitask_learning_model_based_rl ; combine_pg_dqn ; actor_mimic_multi_task_transfer_learning_rl ; sample_efficient_actor_critic ; sobolev_training . In all of these success stories, the size of the action space was relatively small. Many Reinforcement Learning (RL) problems, however, involve high-dimensional action spaces as well as high-dimensional state spaces. Examples include StarCraft deepmind_starcraft ; facebook_starcraft , where there are many agents, each of which can take a finite number of actions; and coordinating self-driving cars at an intersection, where each car can take a finite set of actions comm_net .

In policy gradient, in order to encourage sufficient exploration, an entropy bonus term is typically added to the objective function. However, in the case of high-dimensional action spaces, calculating the entropy and its gradient requires enumerating all the actions in the action space and running forward and backpropagation for each action, which may be computationally infeasible.

In this paper, we develop several novel unbiased estimators for the entropy bonus and its gradient. We apply these estimators to several models for the parameterized policies, including Independent Sampling, CommNet, Autoregressive with Modified MDP, and Autoregressive with LSTM. For all of these parameterizations, actions can be efficiently sampled from the policy distribution, and backpropagation can be employed for training. These parameterizations can be combined with the entropy bonus estimators and stochastic gradient descent, giving a new class of policy gradient algorithms with desirable exploration. Finally, we test our algorithms on two environments: a multi-hunter multi-rabbit grid game and a multi-agent multi-arm bandit problem. The results show that our entropy estimators can substantially improve performance with marginal additional computational cost.

2 Preliminaries

Consider a Markov Decision Process (MDP) with a

-dimensional action space . Denote for an action in . A policy specifies for each state a distribution over the action space . In the standard RL setting, an agent interacts with an environment over a number of discrete time steps sutton_book ; silver_lectures . At time step , the agent is in state and samples an action from the policy distribution . The agent then receives a scalar reward and the environment enters the next state . The agent then samples from and so on. The process continues until the end of the episode, denoted by . The return is the discounted accumulated return from time step until the end of the episode where .

In policy gradient, we consider a set of parameterized policies , , and attempt to find a good within a parameter set . Typically, the policy

is generated by a neural network with

denoting the network’s weights and biases. The parameters are updated by performing stochastic gradient ascent on the expected reward. One example of such an algorithm is REINFORCE william , where in a given episode at time step , are updated as followed:

where is a baseline. It is well known that the policy gradient algorithm often converges to a local optimum. To discourage convergence to a highly suboptimal policy, the policy entropy is typically added to the update rule:

(1)

where

(2)

This approach is often referred to as adding entropy bonus or entropy regularization william and is widely used in different applications, such as optimal control in Atari games async_rl , multi-agent games multi_agent_openai

and optimizer search for supervised machine learning with RL

optimizer_search . is referred to as the entropy weight.

3 Policy Parameterization for Multidimensional Action Space

For problems with discrete action spaces, policies are commonly parameterized as a feed-forward neural network (FFN) with a softmax output layer of dimension

. Therefore sampling from such a policy requires O() effort. For multidimensional action spaces, grows exponential with the number of dimensions .

In order to efficiently sample from our policy, we consider an autoregressive model which can be sampled from each dimension sequentially. In our discussion, we will assume

. To handle action sets of different sizes, we will include inconsequential actions. Here we review two such models, and note that sampling from both models only require summing over effort as opposed to

effort. We emphasize that our use of an autoregressive model to create multi-dimensional probability distributions is not novel. However, we need to provide a brief review to motivate our entropy calculation algorithms.

3.1 Using an LSTM to Generate the Parameterized Policy

LSTMs have recently been used with great success for autoregressive models in language translation tasks lstm_translation . An LSTM can also be used to create a parameterized multi-dimensional distribution and to sample from that distribution (Figure 1(a)). To generate , we run a forward pass through the LSTM with the input being and the current state (and implicitly on which influences ). This produces a hidden state , which is then passed through a linear layer, producing a

dimensional vector. The softmax of this vector is taken to produce the one-dimensional conditional distribution

, . is sampled from this one-dimensional distribution, and is then fed into the next stage of the LSTM to produce . We note that this approach is an adaptation of sequence modeling in supervised machine learning wave_net to reinforcement learning and has also been proposed by google_sdqn_cont_action and actor_critic_sequence_prediction .

RNN

LinearSoftmax

RNN

LinearSoftmax

RNN

LinearSoftmax

(a) The RNN architecture. To generate , we input and into the RNN and then pass the resulting hidden state through a linear layer and a softmax to generate a distribution, from which we sample .

FFN

length

Softmax

FFN

length

Softmax

FFN

Softmax

(b) The MMDP architecture. To generate , we input and

into a FFN. The output is passed through a softmax layer, providing a distribution from which we sample

. Since the input size of the FFN is fixed, when generating , constants serve as placeholders for in the input to the FFN.
Figure 1: The RNN and MMDP architectures for generating parameterized policies.

3.2 Using MMDP to Generate Parameterized Policy

The underlying MDP can be modified to create an equivalent MDP for which the action space is one-dimensional. We refer to this MDP as the Modified MDP (MMDP). In the original MDP, we have state space and action space . In MMDP, the state encapsulates the original state and all the action dimensions selected for state so far (Figure 1(b)). We note that google_sdqn_cont_action recently and independently proposed the reformulation of the MDP into MMDP.

4 Entropy Bonus Approximation for Multidimensional Action Space

As shown in (1), an entropy bonus is typically included to enhance exploration. However, for large multi-dimensional action space, calculating the entropy and its gradient requires enumerating all the actions in the action space and running forward and backpropagation for each action. In this section, we develop computationally efficient unbiased estimates for the entropy and its gradient. These computationally efficient algorithms can be combined with the autoregressive models discussed in the previous section to provide end-to-end computationally efficient schemes.

To abbreviate notations, we write for and for . We consider auto-regressive models whereby the sample components , are sequentially generated. In particular, after obtaining , we will generate from some parameterized distribution defined over the one-dimensional set . After generating the distribution , and the action components sequentially, we then define .

Let

denote a random variable with distribution

. Let denote the exact entropy of the distribution :

4.1 Crude Unbiased Estimator

During training within an episode, for each state , the policy generates an action . We refer to this generated action as the episodic sample. A crude approximation of the entropy bonus is:

This approximation is an unbiased estimate of

but its variance is likely to be large. To reduce the variance, we can generate multiple action samples when in

and average the log action probabilities over the samples. However, generating a large number of samples is costly, especially when each sample is generated from a neural network, since each sample requires one additional forward pass.

4.2 Smoothed Estimator

This section proposes an alternative unbiased estimator for which only requires the one episodic sample and accounts for the entropy along each dimension of the action space:

where

which is the entropy of conditioned on . This estimate of the entropy bonus is computationally efficient since for each dimension , we would need to obtain , its log and gradient anyway during training. We refer to this approximation as the smoothed entropy.

The smoothed estimate of the entropy has several appealing properties. The proofs of Theorem 1 and Theorem 3 are straightforward and omitted.

Theorem 1.

is an unbiased estimator of the exact entropy .

Theorem 2.

If

has a multivariate normal distribution with mean and variance depending on

, then:


Thus, the smoothed estimate of the entropy equals the exact entropy for a multivariate normal parameterization of the policy.

See Appendix B for proof.

Theorem 3.

(i) If there exists a sequence of weights in such that

converges to the uniform distribution over

, then


(ii) If there exists a sequence of weights in such that for some , then

Thus, the smoothed estimate of the entropy mimics the exact entropy in that it has the same supremum and infinum values as the exact entropy.

The above theorems indicate that may serve as a good proxy for : it is an unbiased estimator for , it has the same minimum and maximum values when varying ; and in the special case when has a multivariate normal distribution, it is actually equal to for all . Our numerical experiments have shown that the smoothed estimator typically has lower variance than the crude estimator . However, it is not generally true that the smoothed estimator always has lower variance as counterexamples can be found.

4.3 Smoothed Mode Estimator

For the smoothed estimate of the entropy , we use the episodic action to form the estimate. We now consider alternative choices of actions which may improve performance at modest additional computational cost. First consider where . Thus in this case, instead of calculating the smoothed estimate of the entropy with the episodic action , we calculate it with the most likely action . The problem here is that it is not easy to find when the given conditional probabilities are not in closed form but only available algorithmically as outputs of neural networks. A more computationally efficient approach would be to choose the action greedily: and for . This leads to the definition . The action is an approximation for the mode of the distribution . As often done in NLP, we can use beam search to determine an action that has higher probability, that is, . Indeed, the above definition is beam search with beam size equal to 1. We refer to as the smoothed mode estimate.

with an appropriate beam size may be a better approximation for the entropy than . However, calculating and its gradient comes with some computational cost. For example, with a beam size equal to one, we would have to make two passes through the policy neural network at each time step: one to obtain the episodic sample and the other to obtain the greedy action . For beam size , we would need to make passes. We note that is a biased estimator for but with no variance. Thus there is a bias-variance tradeoff between and . Note that also satisfies Theorems 2 and 3 in subsection 4.2.

4.4 Estimating the Gradient of the Entropy

So far we have been looking at estimates of entropy. But the update rule (1) uses the gradient of the entropy rather than the entropy. As it turns out, the gradients of the estimators and are not unbiased estimates of the gradient of the entropy. In this subsection, we provide unbiased estimators for the gradient of the entropy. For simplicity, in this section, we assume a one-step decision setting, such as in a multi-armed bandit problem. A straightforward calculation shows:

(3)

Suppose is one sample from . A crude unbiased estimator for the gradient of the entropy therefore is: . Note that this estimator is equal to the gradient of the crude estimator multiplied by a correction factor.

Analogous to the smoothed estimator for entropy, we can also derive a smoothed estimator for the gradient of the entropy.

Theorem 4.

If is a sample from , then

is an unbiased estimator for the gradient of the entropy.

See Appendix C for proof.

Note that this estimate for the gradient of the entropy is equal to the gradient of the smoothed estimate plus a correction term. We refer to this estimate of the entropy gradient as the unbiased gradient estimate.

5 Experimental Results

We designed experiments to compare the different entropy estimators the LSTM, MMDP, and CommNet model, a related approach introduced by comm_net . As a baseline, we will use the Independent Sampling (IS) model which is an FFN that takes as input the state, creates a representation of the state, and from the representation outputs softmax heads, from which the value of each action dimension can be sampled independently comm_net . In this case, the smoothed estimate is equal to the exact entropy. For each entropy approximation, the entropy weight was tuned to give the highest reward. For IS and MMDP, the number of hidden layers was tuned from 1 to 7. For CommNet, the number of communication steps was tuned from 2 to 5, the learning rate was tuned between 3e-3 and 3e-4 and the size of the policy hidden layer was tuned between 128 and 256.

5.1 Hunters and Rabbits

In this environment, there is a grid. At the beginning of each episode, hunters and rabbits are randomly placed in the grid. The rabbits remain fixed in the episode, and each hunter can move to a neighboring square (including diagonal neighbors) or stay at the current square. So each hunter has nine possible actions, and altogether there are actions at each time step. When a hunter enters a square with a rabbit, the hunter captures the rabbit and remains there until the end of the episode. In each episode, the goal is for the hunters to capture the rabbits as quickly as possible. Each episode is allowed to run for at most ten thousands time steps.

To provide a dense reward signal, we modify the goal as following: capturing a rabbit gives a reward of , which is discounted by the number of time steps taken since the beginning of the episode. The discount factor is 0.8. The goal is to maximize the episode’s total discounted reward. After a hunter captures a rabbit, they both become inactive.

Comparison of different entropy estimates for IS, LSTM, MMDP and CommNet

Table 1 shows the performance of the IS, LSTM, MMDP and CommNet models with the different entropy estimates. Training and evaluation were performed in a square grid of 5 by 5 with 5 hunters and 5 rabbits. Results are averaged over 5 seeds. For each seed, training and evaluation were run for 1 million and 1 thousand episodes respectively.

As compared with no entropy, crude entropy can actually reduce performance. However, smoothed entropy and smoothed mode entropy always increase performance, often significantly. For the LSTM model, the best performing approximation is smoothed entropy, reducing the mean episode length by and increasing the mean episode reward by compared to without entropy. We also note that there is not a significant difference in performance between the smoothed entropy, smoothed mode entropy, and the unbiased gradient approaches. When comparing the four models, we see that the LSTM model with smoothed entropy does significantly better the other three models. The CommNet model could potentially be improved by allowing the hunters to see more of the state; this could be investigated in future research.

Without
Entropy
Crude
Entropy
Smoothed
Entropy
Smoothed
Mode Entropy
Unbiased Gradient
Estimate
IS Mean
Episode Length
98.7 78.9 32 12.3 11.8 1.9 11.8 1.9 11.8 1.9
LSTM Mean
Episode Length
10.1 1.9 19 8.7 6.0 0.2 6.0 0.1
MMDP Mean
Episode Length
21.5 3.7 37.3 29.6 10.6 0.7 10.6 0.7 9.8 0.6
CommNet Mean
Episode Length
22.7 0.6 22.3 0.4 21.9 0.4 21.9 0.4 21.9 0.4
IS Mean
Episode Reward
2.2 0.03 2.4 0.05 2.7 0.01 2.7 0.01 2.7 0.01
LSTM Mean
Episode Reward
3.0 0.06 3.0 0.03 3.2 0.04 3.2 0.02
MMDP Mean
Episode Reward
2.8 0.03 2.7 0.03 2.9 0.03 2.8 0.04 2.9 0.02
CommNet Mean
Episode Reward
2.5 0.01 2.6 0.01 2.6 0.01 2.6 0.01 2.6 0.01
Table 1: Performance of IS, LSTM, MMDP and CommNet across different entropy approximations.

The smoothed estimator is also more robust with respect to the initial seed than without entropy as shown in Figure 2. For example, for the LSTM model, in the case of without entropy, seed 0 leads to significantly worse results than the seeds 1-4. This does not happen with the smoothed estimator.

Without

Smoothed

(a) IS

Without

Smoothed

(b) LSTM

Without

Smoothed

(c) MMDP
Figure 2: IS, LSTM and MMDP results across 5 seeds (y-axis denotes mean episode length).

Entropy approximations versus exact entropy

We now consider how policies trained with entropy approximations compare with polices trained with exact entropy. In order to calculate exact entropy in an acceptable amount of time, we reduced the number of hunters and rabbits to 4 hunters and 4 rabbits. Training was run for 50,000 episodes. Table 2 shows the performance differences between policies trained with entropy approximations and exact entropy. We see that the best entropy approximations perform only slightly worse than exact entropy for both LSTM and MMDP. Once again we see that the LSTM model performs better than the MMDP model.

-0.15cm LSTM Smoothed Entropy LSTM Exact Entropy MMDP Unbiased Gradient Estimate MMDP Exact Entropy Mean Episode Length 9.0 0.3 11.5 0.3 10.7 0.4 Mean Episode Reward 2.14 0.02 2.01 0.01 2.1 0.01

Table 2: LSTM and MMDP results for entropy approximation versus exact entropy.

5.2 Multi-agent Multi-arm Bandits

We examine a multi-agent version of the standard multi-armed bandit problem, where there are agents each pulling one of arms, with . The arm generates a reward . The total reward in a round is generated as follows. In each round, each agent chooses an arm. All of the chosen arms are then pulled, with each pulled arm generating a reward. Note that the total number of arms chosen, , may be less than since some agents may choose the same arm. The total reward is the sum of rewards from the chosen arms. The optimal policy is for the agents to collectively pull the arms with the highest rewards. Additionally, among all the optimal assignments of agents to the arms that yield the highest reward, we add a bonus reward with probability if one particular agents-to-arms configuration is chosen.

We performed experiments with 4 agents and 10 arms, with the arm providing a reward of . The exceptional assignment gets a bonus of 166 (making a total reward of 200) with probability 0.01, and no bonus with probability 0.99. Thus the maximum expected reward is 35.66. Training was run for 100,000 rounds for each of 10 seeds. Table 3 shows average results for the last 500 of the 100,000 rounds.

Without
Entropy
Crude
Entropy
Smoothed
Entropy
Unbiased
Gradient Estimate
IS Mean Reward
34.2 1.3 34.4 1.3 34.2 1.3
LSTM Mean Reward
34.9 0.8 35.5 1.1 35.9 0.8
IS Percentage
Optimal Config Pulled
19.8 39.7 29.7 49.6 19.7 39.7
LSTM Percentage
Optimal Config Pulled
39.8 35.9 59.4 35.7 95.0 1.9
Table 3: Performance of IS and LSTM policy parameterizations.

The results for the multi-agent bandit problem are consistent with those for the hunter-rabbit problem. Policies obtained with the entropy approximations all perform better than policies obtained without entropy or with crude entropy, particularly for the percentage of rounds the arms are pulled with the optimal configuration. Note that LSTM with the unbiased gradient estimator gives the best results.

6 Related Work

Metz et al. google_sdqn_cont_action recently and independently proposed the reformulation of MDP into the MMDP and the LSTM policy parameterization. They inject noise into the action space to encourage exploration. Usunier et al. sc_episodic_explore uses MMDP and noise injection in the parameter space to achieve high performance in multi-agent Starcraft micro-management tasks. Instead of noise injection, we propose novel estimators for the entropy bonus that is often used to encourage exploration in policy gradient.

While entropy regularization has been mostly used in policy gradient algorithms, Schulmann et al. equivalence_pg_soft_q applies entropy regularization to Q-learning. They make an important observation about the equivalence between policy gradient and entropy regularized Q-learning.

To the best of our knowledge, no prior work has dealt with approximating the policy entropy for MDP with large multi-dimensional discrete action space. On the other hand, there have been many attempts to devise methods to encourage beneficial exploration for policy gradient. Nachum et al. urex modifies the entropy term by adding weights to the log action probabilities, leading to a new optimization objective termed under-appreciated reward exploration.

Dulac-Arnold et al. drl_large_discrete_action

embeds discrete actions in a continuous space, picks actions in the continuous space and map these actions back into the discrete space. However, their algorithm introduces a new hyper-parameter that requires tuning for every new task. Our approach involves no new hyper-parameter other than those normally used in deep learning.

The LSTM policy parameterization can be seen as the adaptation of sequence modeling techniques in supervised machine learning, such as in speech generation wave_net or machine translation lstm_translation to reinforcement learning, as was previously done in actor_critic_sequence_prediction .

7 Conclusion

In this paper, we developed several novel unbiased estimators for entropy bonus and its gradient. We did experimental work for two environments with large multi-dimensional action spaces. We found that the smoothed estimate of the entropy and the unbiased estimate of the entropy gradient can significantly increase performance with marginal additional computational cost.

Appendix A. Hyperparameters

Hyperparameters for hunter-rabbit game

For IS, the numbers of hidden layers for smoothed entropy, unbiased gradient estimate, crude entropy and without entropy are 1, 1, 5 and 7 respectively. The entropy weights for smoothed entropy, unbiased gradient estimate and crude entropy are 0.03, 0.02 and 0.01 respectively. The hyper-parameters for smoothed mode entropy is not listed since the smoothed mode entropy equals the smoothed entropy for IS.

For CommNet, the numbers of communication step for without entropy, crude entropy, smoothed entropy and unbiased entropy gradient are 2, 2, 2 and 2 respectively. The sizes of the policy hidden layer for without entropy, crude entropy, smoothed entropy and unbiased entropy gradient are 256, 256, 256 and 128 respectively. The entropy weights for crude entropy, smoothed entropy and unbiased entropy gradient are 0.04, 0.04 and 0.01 respectively. The policies were optimized using Adamadam with learning rate 3e-4. We found 3e-4 gives better performance than the learning rate 3e-3 originally used in comm_net .

The LSTM policy has 128 hidden nodes. For the MMDP policy, the number of hidden layers for smoothed entropy, smoothed mode entropy, unbiased gradient estimate, crude entropy and without entropy are 5, 3, 3, 4 3 and 3 respectively. Each MMDP layer has 128 nodes. We parameterize the baseline in (2) with a FFN with one hidden layer of size 64. This network was trained using first visit Monte Carlo return to minimize the L1 loss between actual and predicted values of states visited during the episode.

Both the policies and baseline are optimized after each episode with RMSprop

rmsprops . The RHS of (2) is clipped to before updating the policy parameters. The learning rates for the baseline, IS, LSTM and MMDP are , , , respectively.

To obtain the results in Table 1, the entropy weights for LSTM smoothed entropy, LSTM smoothed mode entropy, LSTM unbiased gradient estimate, LSTM crude entropy, MMDP smoothed entropy, MMDP smoothed mode entropy, MMDP unbiased gradient estimate and MMDP crude entropy are 0.02, 0.021, 0.031, 0.04, 0.02, 0.03, 0.03 and 0.01 respectively.

To obtain the results in Table 2, the entropy weights for LSTM smoothed entropy, LSTM exact entropy, MMDP unbiased gradient estimate and MMDP exact entropy are 0.03, 0.01, 0.03 and 0.01 respectively. The MMDP networks have three layers with 128 nodes in each layer. Experimental results are averaged over five seeds (0-4).

Hyperparamters for Multi-Agent Multi-Arm Bandits

The experiments were run with 4 agents and 10 arms. For the 10 arms, their rewards are for . The LSTM policy has 32 hidden nodes. The baseline in (1) is a truncated average of the reward of the last 100 rounds. The entropy weight for crude entropy, smoothed entropy and unbiased gradient estimate are 0.005, 0.001 and 0.003 respectively. The learning rates for without entropy, crude entropy, smoothed entropy and unbiased gradient estimate are 0.006, 0.008, 0.002 and 0.005 respectively. Experimental results are averaged over ten seeds.

Appendix B. Proof of Theorem 2

Theorem 2. If has a multivariate normal distribution with mean and variance depending on , then:


Thus, the smoothed estimate of the entropy equals the exact entropy for a multivariate normal parameterization of the policy.

Proof.

We first note that for where and are random vectors, we have where

Observe that the covariance matrix of the conditional distribution does not depend on the value of applied_multi_stats .

Also note that for , the entropy of takes the form

where is the dimension of and denotes the determinant. Therefore, the entropy of a multivariate normal random variable depends only on the variance and not on the mean.

Because is multivariate normal, the distribution of given has a normal distribution with a variance that does not depend on . Therefore

does not depend on and hence does not depend on . Combining this with the fact that is an unbiased estimator for gives for all . ∎

Appendix C. Proof of Theorem 4

Theorem 4. If is a sample from , then

is an unbiased estimator for the gradient of the entropy.

Proof.

From Equation(3), we have:

(4)

We will now use conditional expectation to calculate the terms in the double sum.

For :

For :

For :

Combining these three conditional expectations with (4), we obtain:

Alternatively, Theorem 4 could also be proven by applying Theorem 1 of schulman2015gradient . ∎

Appendix D. State Representation For CommNet

Sukhbaatar et al.comm_net proposes CommNet to handle multi-agent environments where each agent observes only part of the state and the number of agents changes throughout an episode. We thus modify the state representation of the hunters and rabbits environment to better reflect the strengths of CommNet. Each hunter only sees its own id, its position and the positions of all rabbits. More precisely, the state each hunter receives is [hunter id, hunter position, all rabbit positions].

Acknowledgements

We would like to thank Martin Arjovsky for his input and suggestions at both the early and latter stages of this research. Our gratitude also goes to the HPC team at NYU, NYU Shanghai, and NYU Abu Dhabi.

References