Reinforcement learning (RL) tackles sequential decision making problems by formulating them as tasks where an agent must learn how to act optimally through trial and error interactions with the environment. The goal is to maximize the sum of the numerical reward signal observed at each time step. These rewards are usually provided to the agent by the environment, either continuously or sparsely. Here we focus on the problem of exploration in RL, which aims to reduce the number of interactions an agent needs in order to learn to perform well.
The most common approach to exploration in absence of any knowledge about the environment is to perform random actions. As knowledge is gained, the agent can use it to attempt to increase its performance by taking greedy actions, while retaining some chance to choose random actions to further explore the environment (epsilon-greedy exploration). However, if rewards are sparse or are not sufficiently in- formative to allow performance improvements, epsilon-greedy fails to explore sufficiently far. Several methods have been described that bias the agent’s actions towards novelty mostly by using optimistic initialization or curiosity signals.
In particular,we opt for empowerment Klyubin et al. , an information-theoretic formulation of the agent’s influence on the near future. The value of a state, its empowerment value, is given by the maximum mutual information between a control input and the successor state the agent could achieve.
Mutual Information is known to be very hard to calculate. Thankfully, recent advances in neural estimation Belghazi et al. 
enable effective computation of mutual information between high dimensional input / output pairs of deep neural networks, and in this work we leverage these techniques to calculate the empowerment of a state. We then use this quantity as an intrinsic reward signal to train the DQN on Montezuma’s Revenge.
2 Empowerment-Driven Exploration
Our agent is composed of two networks: a reward generator that outputs a empowerment-driven intrinsic reward signal and a policy that outputs a sequence of actions to maximize that reward signal. In addition to intrinsic rewards, the agent optionally may also receive some extrinsic reward from the environment. Let the intrinsic curiosity reward generated by the agent at time t be rit and the extrinsic reward be ret. The policy sub-system is trained to maximize the sum of these two rewards rt=rit+ret, with ret mostly(if not always) zero.
The Empowerment value for a state is defined as the channel capacity between the action and the following state Klyubin et al. , E() = max_ ^′, .
Where I is the mutual information. is the empowerment maximizing policy. Empowerment, therefore, is the maximum information an agent can transfer to it’s environment by changing the next states through it’s actions.
This mutual information in section 2.1 can further be represented in the form of the KL Divergence,
=& ^′, ^′
=& ∬ ^′, ln′, ′^′ ,
To compute the mutual information, we will be using the formulation as explained in Belghazi et al. .
2.2 Mutual Information Neural Estimation
The Mutual Information Neural Estimation (MINE, Belghazi et al. ) learns a neural estimate of the mutual information of continuous variables, is strongly consistent and can be used to learn the empowerment value of a state by using a forward dynamics model, to get the samples from the marginal distribution of , and the policy, .
Following, Belghazi et al. 
, we train a discriminator (a classifier) to distinguish between samples coming from the joint,, and the marginal distributions, and . MINE relies on a lower-bound to the mutual information based on the Donsker and Varadhan ,
=& ^′, ^′
≥& [^′, ][T_ω] - log[^′][e^T_ω]
We use this estimate of the mutual information (or the empowerment for the state, ) as the intrinsic reward to train the policy. The policy is, therefore, encouraged to predict actions which provide the maximum mutual information.
One thing to note is that the distribution, , is calculated using the forward dynamics model which will be discussed in the following section.
2.3 Forward dynamics model
The forward dynamics model, , is used to sample from the marginal distribution, by marginalizing out the actions,
The dynamics model is trained simultaneously with the statistics network, and the policy, . Predicting in the raw pixel space is not ideal since it is hard to predict pixels directly. So we use the random feature space as used in Burda et al. , to train the forward dynamics model as well as the policy since it was shown that the random feature space was sufficient for representing Atari game frames.
3 Experimental Setup
The proposed empowerment driven DQN agent is composed of a policy network, the DQN, and the intrinsic reward generator that is composed of the statistics network and the forward dynamics model,
. The implementation is using PytorchPaszke et al. . Inputs are a stack of four 84 x 84 gray scale frames. All the observations are then encoded to a 64 dimensional encoding using a random convolutional encoder. All other networks share the same network architecture but use separate artificial neural networks. We used double Q-learning with target network updates every 2000 steps and an experience replay of buffer capacity 1000000 steps. Training of the models starts at 1000 steps and follows Algorithm 1. All 3 networks are trained on batch size of 64, using Adam optimizers with learning rate of 1e-2, 1e-3 and 1e-4 for the forward dynamics, statistics and policy network respectively.
The DQN Agent predicts 18 Q-values corresponding to the 18 actions in ALE, uses relu activations and a discount factor of 0.99. The extrinsic reward is clipped between [-1, 1] and the gradients of the temporal difference loss is clamped between [-1, 1]. The value used foris 0.1.
The DQN Agent trained using the empowerment intrinsic motivation is able to consistently exit the room one and gather the rewards whereas the agent trained on the reward signal of the game fails to receive any reward.
Owing to computational limits, the size of the environment encoding was limited to 64 which could prove to be insufficient for the Atari Environment. This is a parameter which needs to be investigated further in future work.
4 Conclusion and future work
The experiments show that using empowerment, calculated using mutual information neural estimation, as an intrinsic motivator can help an agent to consistently achieve rewards.
Compared to an agent which just receives the external reward signal from the game, the empowerment driven agent is able to consistently achieve the rewards in the first level of Montezuma’s revenge and enter the second room.
Using empowerment as an intrinsic motivator is a direction which has also been worked upon by previous research work but in this work we have the following two advantages. First, empowerment is calculated and maximized using stochastic gradient descent . Second, in this work we just use the mutual information as the intrinsic reward and simply update the agent using Q-Learning. No new source policy distributions are introduced making the algorithm easy to implement.
Future work includes the testing of the method on the entire Atari game suite as well as increasing the model sizes and embedding sizes.
- Klyubin et al.  Alexander S. Klyubin, Daniel Polani, and Chrystopher L. Nehaniv. All else being equal be empowered. In Mathieu S. Capcarrère, Alex A. Freitas, Peter J. Bentley, Colin G. Johnson, and Jon Timmis, editors, Advances in Artificial Life, pages 744–753, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. ISBN 978-3-540-31816-3.
- Belghazi et al.  Ishmael Belghazi, Sai Rajeswar, Aristide Baratin, R. Devon Hjelm, and Aaron C. Courville. MINE: mutual information neural estimation. CoRR, abs/1801.04062, 2018. URL http://arxiv.org/abs/1801.04062.
- Donsker and Varadhan  M. D. Donsker and S. R. S. Varadhan. Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on Pure and Applied Mathematics, 36(2):183–212, 1983. doi: 10.1002/cpa.3160360204. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.3160360204.
- Burda et al.  Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A. Efros. Large-scale study of curiosity-driven learning. In arXiv:1808.04355, 2018.
- Paszke et al.  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.