Empowerment-driven Exploration using Mutual Information Estimation

10/11/2018 ∙ by Navneet Madhu Kumar, et al. ∙ 0

Exploration is a difficult challenge in reinforcement learning and is of prime importance in sparse reward environments. However, many of the state of the art deep reinforcement learning algorithms, that rely on epsilon-greedy, fail on these environments. In such cases, empowerment can serve as an intrinsic reward signal to enable the agent to maximize the influence it has over the near future. We formulate empowerment as the channel capacity between states and actions and is calculated by estimating the mutual information between the actions and the following states. The mutual information is estimated using Mutual Information Neural Estimator and a forward dynamics model. We demonstrate that an empowerment driven agent is able to improve significantly the score of a baseline DQN agent on the game of Montezuma's Revenge.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) tackles sequential decision making problems by formulating them as tasks where an agent must learn how to act optimally through trial and error interactions with the environment. The goal is to maximize the sum of the numerical reward signal observed at each time step. These rewards are usually provided to the agent by the environment, either continuously or sparsely. Here we focus on the problem of exploration in RL, which aims to reduce the number of interactions an agent needs in order to learn to perform well.

The most common approach to exploration in absence of any knowledge about the environment is to perform random actions. As knowledge is gained, the agent can use it to attempt to increase its performance by taking greedy actions, while retaining some chance to choose random actions to further explore the environment (epsilon-greedy exploration). However, if rewards are sparse or are not sufficiently in- formative to allow performance improvements, epsilon-greedy fails to explore sufficiently far. Several methods have been described that bias the agent’s actions towards novelty mostly by using optimistic initialization or curiosity signals.

In particular,we opt for empowerment Klyubin et al. [2005], an information-theoretic formulation of the agent’s influence on the near future. The value of a state, its empowerment value, is given by the maximum mutual information between a control input and the successor state the agent could achieve.

Mutual Information is known to be very hard to calculate. Thankfully, recent advances in neural estimation Belghazi et al. [2018]

enable effective computation of mutual information between high dimensional input / output pairs of deep neural networks, and in this work we leverage these techniques to calculate the empowerment of a state. We then use this quantity as an intrinsic reward signal to train the DQN on Montezuma’s Revenge.

2 Empowerment-Driven Exploration

Our agent is composed of two networks: a reward generator that outputs a empowerment-driven intrinsic reward signal and a policy that outputs a sequence of actions to maximize that reward signal. In addition to intrinsic rewards, the agent optionally may also receive some extrinsic reward from the environment. Let the intrinsic curiosity reward generated by the agent at time t be rit and the extrinsic reward be ret. The policy sub-system is trained to maximize the sum of these two rewards rt=rit+ret, with ret mostly(if not always) zero.

2.1 Empowerment

The Empowerment value for a state is defined as the channel capacity between the action and the following state Klyubin et al. [2005], E() = max_  ^′, .

Where I is the mutual information. is the empowerment maximizing policy. Empowerment, therefore, is the maximum information an agent can transfer to it’s environment by changing the next states through it’s actions.

This mutual information in section 2.1 can further be represented in the form of the KL Divergence, ^′, =& ^′, ^′ 
=& ∬ ^′, ln, ^′ ,

To compute the mutual information, we will be using the formulation as explained in Belghazi et al. [2018].

2.2 Mutual Information Neural Estimation

The Mutual Information Neural Estimation (MINE, Belghazi et al. [2018]) learns a neural estimate of the mutual information of continuous variables, is strongly consistent and can be used to learn the empowerment value of a state by using a forward dynamics model, to get the samples from the marginal distribution of , and the policy, .

Following, Belghazi et al. [2018]

, we train a discriminator (a classifier) to distinguish between samples coming from the joint,

, and the marginal distributions, and . MINE relies on a lower-bound to the mutual information based on the Donsker and Varadhan [1983],

^′, =& ^′, ^′ 
≥& [^′, ][T_ω] - log[^′][e^T_ω]

We use this estimate of the mutual information (or the empowerment for the state, ) as the intrinsic reward to train the policy. The policy is, therefore, encouraged to predict actions which provide the maximum mutual information.

One thing to note is that the distribution, , is calculated using the forward dynamics model which will be discussed in the following section.

2.3 Forward dynamics model

The forward dynamics model, , is used to sample from the marginal distribution, by marginalizing out the actions,

The dynamics model is trained simultaneously with the statistics network, and the policy, . Predicting in the raw pixel space is not ideal since it is hard to predict pixels directly. So we use the random feature space as used in Burda et al. [2018], to train the forward dynamics model as well as the policy since it was shown that the random feature space was sufficient for representing Atari game frames.

cumulation horizon , initializations for , , and
repeat
     for  do
         Sample a batch of transitions from a replay buffer
         for each transition (, , , )  do
              Update f to reduce
              Update T to increase the mutual information [^′, ][T_ω] - log[^′][e^T_ω]
              Update the reward function for the transition = + ^′,
         end for
         Update the agent using the bellman update = +
     end for
until convergence
Algorithm 1 Joint training of policy , statistics and forward dynamics model

3 Experimental Setup

3.1 Agent

The proposed empowerment driven DQN agent is composed of a policy network, the DQN, and the intrinsic reward generator that is composed of the statistics network and the forward dynamics model,

. The implementation is using Pytorch

Paszke et al. [2017]. Inputs are a stack of four 84 x 84 gray scale frames. All the observations are then encoded to a 64 dimensional encoding using a random convolutional encoder. All other networks share the same network architecture but use separate artificial neural networks. We used double Q-learning with target network updates every 2000 steps and an experience replay of buffer capacity 1000000 steps. Training of the models starts at 1000 steps and follows Algorithm 1. All 3 networks are trained on batch size of 64, using Adam optimizers with learning rate of 1e-2, 1e-3 and 1e-4 for the forward dynamics, statistics and policy network respectively.

The DQN Agent predicts 18 Q-values corresponding to the 18 actions in ALE, uses relu activations and a discount factor of 0.99. The extrinsic reward is clipped between [-1, 1] and the gradients of the temporal difference loss is clamped between [-1, 1]. The value used for

is 0.1.

3.2 Results

The DQN Agent trained using the empowerment intrinsic motivation is able to consistently exit the room one and gather the rewards whereas the agent trained on the reward signal of the game fails to receive any reward.

Figure 1: Rewards in Montezuma’s Revenge

Owing to computational limits, the size of the environment encoding was limited to 64 which could prove to be insufficient for the Atari Environment. This is a parameter which needs to be investigated further in future work.

4 Conclusion and future work

The experiments show that using empowerment, calculated using mutual information neural estimation, as an intrinsic motivator can help an agent to consistently achieve rewards.

Compared to an agent which just receives the external reward signal from the game, the empowerment driven agent is able to consistently achieve the rewards in the first level of Montezuma’s revenge and enter the second room.

Using empowerment as an intrinsic motivator is a direction which has also been worked upon by previous research work but in this work we have the following two advantages. First, empowerment is calculated and maximized using stochastic gradient descent . Second, in this work we just use the mutual information as the intrinsic reward and simply update the agent using Q-Learning. No new source policy distributions are introduced making the algorithm easy to implement.

Future work includes the testing of the method on the entire Atari game suite as well as increasing the model sizes and embedding sizes.

References