Multimodal Reward Shaping for Efficient Exploration in Reinforcement Learning

Maintaining long-term exploration ability remains one of the challenges of deep reinforcement learning (DRL). In practice, the reward shaping-based approaches are leveraged to provide intrinsic rewards for the agent to incentivize motivation. However, most existing IRS modules rely on attendant models or additional memory to record and analyze learning procedures, which leads to high computational complexity and low robustness. Moreover, they overemphasize the influence of a single state on exploration, which cannot evaluate the exploration performance from a global perspective. To tackle the problem, state entropy-based methods are proposed to encourage the agent to visit the state space more equitably. However, the estimation error and sample complexity are prohibitive when handling environments with high-dimensional observation. In this paper, we introduce a novel metric entitled Jain's fairness index (JFI) to replace the entropy regularizer, which requires no additional models or memory. In particular, JFI overcomes the vanishing intrinsic rewards problem and can be generalized into arbitrary tasks. Furthermore, we use a variational auto-encoder (VAE) model to capture the life-long novelty of states. Finally, the global JFI score and local state novelty are combined to form a multimodal intrinsic reward, controlling the exploration extent more precisely. Finally, extensive simulation results demonstrate that our multimodal reward shaping (MMRS) method can achieve higher performance in contrast to other benchmark schemes.



There are no comments yet.


page 1

page 2

page 3

page 5

page 7

page 8

page 10

page 11


RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments

Exploration in sparse reward environments remains one of the key challen...

Don't Do What Doesn't Matter: Intrinsic Motivation with Action Usefulness

Sparse rewards are double-edged training signals in reinforcement learni...

Clustered Reinforcement Learning

Exploration strategy design is one of the challenging problems in reinfo...

Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning

Efficient exploration remains a challenging problem in reinforcement lea...

Bayesian Curiosity for Efficient Exploration in Reinforcement Learning

Balancing exploration and exploitation is a fundamental part of reinforc...

State Entropy Maximization with Random Encoders for Efficient Exploration

Recent exploration methods have proven to be a recipe for improving samp...

Exploration via Flow-Based Intrinsic Rewards

Exploration bonuses derived from the novelty of observations in an envir...

Code Repositories


MultiModal Reward Shaping for Efficient Exploration in Reinforcement Learning

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Balancing the tradeoff between exploration and exploitation is a crucial problem in reinforcement learning (RL) sutton2018reinforcement . In general, learning the optimal policy requires the agent to visit all possible state-action pairs infinitely. However, most existing RL algorithms have sophisticated exploitation mechanisms while performing poor exploration strategies. As a result, the policy may prematurely fall into local optima after finite steps and never improve again stadie2015incentivizing . Therefore, the critical problem is to maintaining exploration throughout the whole learning procedure. To address the problem, a simple approach is to employ stochastic policies such as

-greedy policy and Boltzmann exploration, which randomly select all the possible actions with a non-zero probability in each state. Such techniques are prone to learn the optimal policy eventually in the tabular setting, but it is likely to be futile when handling complex environments with high-dimensional observations.

To cope with the exploration problem in complex tasks, the reward shaping method is leveraged to form multimodal rewards to improve exploration. More specifically, recent approaches proposed to provide intrinsic rewards for the agent to reward its exploration performance dayan2002reward . In sharp contrast to the extrinsic rewards given by the environment explicitly, intrinsic rewards represent the inherent learning motivation or curiosity of the agent, which are difficult to characterize and evaluate. Many prior works have been devoted to realizing computable intrinsic reward modules, and they can be broadly categorized into novelty-based and prediction error-based approaches. For instance, strehl2008analysis ; ostrovski2017count employed a state visitation function to measure the novelty of states. Such visitation function assigns a higher bonus to those infrequently-seen states, incentivizing the agent to revisit novel states and increasing the probability of learning better policy. Methods in pathak2017curiosity ; yu2020intrinsic ; yuan2021hybrid ; stadie2015incentivizing followed the second idea, in which the prediction error of a dynamic model is utilized as intrinsic rewards. Given an observed transition, an attendant model was designed to predict a next-state based on the current state-action pair. Then the intrinsic reward is computed as the Euclidean distance between the predicted next-state and the true next-state.

All the methods above suffer from vanishing intrinsic rewards, i.e., the intrinsic rewards will decrease with visits. Once a state is given a minimal intrinsic reward, the agent will have no motivation to explore it further. To maintain the long-term exploration ability, badia2020never proposed a never-give-up (NGU) framework that learns combined intrinsic rewards consist of episodic and life-long state novelty. NGU evaluates the episodic state novelty through a slot-based memory and pseudo-count method bellemare2016unifying , encouraging the agent to visit more distinct states in each episode. Since the memory will be wiped at the beginning of the episode, the intrinsic rewards will not decay with the training process. Meanwhile, NGU further introduced a random network distillation (RND) module to capture the life-long novelty of states burda2018exploration . By controlling the learning rate of RND, the life-long intrinsic reward can prevent the agent from visiting familiar states more gently. NGU has complex architecture and high computational complexity, making it difficult to be generalized into arbitrary environments. A simpler framework entitled rewarding impact-driven exploration (RIDE) is proposed in raileanu2020ride . RIDE first trains an embedding network to encode the state space following the inverse-forward pattern of pathak2017curiosity . Then the Euclidean between two consecutive encoded states is utilized as the intrinsic reward. It encourages the agent to take actions that result in more state changes, maintaining exploration and avoiding the television dilemma reported in savinov2018episodic .

However, both NGU and RIDE overemphasize the influence of a single state on exploration and cannot reflect the global exploration extent. Moreover, the aforementioned methods have poor mathematical interpretability and rely heavily on attendant models. To circumvent this problem, islam2019entropy proposed to maximize the entropy of state distribution, forcing the agent to visit all the states more equitably. In particular, islam2019entropy estimates the state entropy using a variational auto-encoder (VAE) kingma2013auto . VAE accepts the parameters of policy as input before predicting the state distribution. Therefore, minimizing the reconstruction error of VAE is equivalent to maximizing the estimation of state entropy. Finally, the policy and VAE model can be updated together through the policy gradient method. In zhang2018dissection , the Shannon entropy regularizer is further expanded into Rényi Entropy to adapt to arbitrary tasks. To realize an efficient and stable entropy estimate, seo2021state

proposed a random encoder for efficient exploration (RE3) framework that requires no representation learning. In each episode, the observations are collected and encoded using a fixed deep neural network (DNN). Then a

-nearest neighbor estimator singh2003nearest is leveraged to estimate the state entropy. Simulations results demonstrated that RE3 improved the sample efficiency of both model-free and model-based RL algorithms. However, it is difficult to choose an appropriate value, and the estimation error and sample complexity grow exponentially with the size of state space.

Inspired by the discussions above, we consider alternating the entropy regularizer with a simpler and computable metric. Moreover, we aim to combine this metric with state novelty to realize a comprehensive exploration evaluation. In this paper, we propose multimodal reward shaping (MMRS), a model-free, memory-unnecessary, and generative-model empowered method for providing high-quality intrinsic rewards. Our main contribution of this paper are summarized as follows:

  • We first analyze the sample complexity of the entropy-based intrinsic rewards. After that, a novel metric entitled Jain’s fairness index (JFI) is introduced to replace the entropy-regularizer, and we prove the utility equivalence between the two metrics. Furthermore, we dive into the employment of JFI both in tabular settings and environments with high-dimensional observations.

  • Since JFI evaluates the global exploration performance, the life-long state novelty is leveraged to form multimodal intrinsic rewards. In particular, we use a VAE to perform state embedding and capture the life-long novelty states. Such a method requires no additional memory and avoids overfitting, which is more efficient and robust than RND.

  • Finally, extensive simulations are performed to compare MMRS with other state-of-the-art (SOTA) methods. Numerical results demonstrate that MMRS outperforms SOTA exploration methods with simpler architecture and higher robustness. Furthermore, we conduct qualitative analysis to show that MMRS eliminates the vanishing intrinsic rewards and maintains long-term exploration ability.

2 Problem Formulation

We study the RL problem that considers the Markov decision process (MDP) defined as below

sutton2018reinforcement :

Definition 1 (Mdp).

The MDP can be defined as a tuple , where:

  • is the state space;

  • is the action space;

  • is the transition probability;

  • is the reward function;

  • is the initial state distribution;

  • is a discount factor.

Note that the reward function here conditions on a full transition . Furthermore, we denote as the policy of agent, which observes the state of environment before choosing an action from the action space. Given MDP , the objective of RL is to find the optimal policy that maximizes the expected discounted return:


where , is the set of all stationary policies, and is the trajectory generated by the policy.

In this paper, we aim to improve the exploration of state space, in which the agent is expected to visit as many distinct states as possible within limited learning procedure. Mathematically, the policy should provide equitable visitation probability for all the states. To evaluate the exploration of single states, we define the following state visitation distribution (SVD):


where denotes the probability,

is the random variable of state at step

. Therefore, improving exploration is equivalent to maximizing the following entropy:


Finally, the RL objective is reformulated as:


It is difficult to straightforwardly solve the optimization problem due to the complex entropy item. The following section proposes a reward shaping-based method to replace the entropy by utilizing a novel and simple metric.

3 State Visitation Entropy and Fairness Index

In this section, we first analyze the sample complexity of the original entropy-based intrinsic rewards. Then an alternative method is proposed for replacing the entropy regularizer.

3.1 Sample Complexity of Estimating Entropy

To compute the state entropy, we first need to estimate the SVD defined in Eq. (2). Given a trajectory generated by policy , a reasonable estimate of can be formulated as:


where is the indicator function. The following lemma indicates the sample complexity of such estimation:

Lemma 1.

Assume sampling for steps and estimating the SVD following Eq. (5), then with probability as , such that:


where is the cardinality of state space.

Furthermore, a reasonable estimate of Eq. (3) based on the Monte-Carlo method is:


See proof in Appendix B. ∎

Similar to the estimation of SVD, we can prove that:

Lemma 2.

Assume sampling for steps and estimating the entropy following Eq. (5), then with probability as , it holds:


See proof in Appendix A. ∎

Equipped with Lemma 1 and Lemma 2, the sample complexity of estimating state entropy:


Eq. (9) demonstrate the estimation error converges sub-linearly as

. In practice, we can only sample finite steps in each episode, such estimation may produce much variance and make negative influence in the policy learning.

3.2 JFI Regularizer

To replace the complicated entropy regularizer, we introduce the JFI to evaluate the global exploration performance. JFI is a count-based metric that is first leveraged to calculate the allocation fairness in the wireless resources scheduling jain1999throughput . The following definition formulates the JFI for state visitation considering the tabular setting, in which the state space is finite.

Definition 2 (JFI for state visitation).

Given a trajectory , denote by the episodic visitation counter and , the visitation fairness based on JFI can be computed as:


The JFI ranges form (worst case) to (best case), and it is maximum when gets the same value for all the states.

Based on Definition 2, we define the shaped JFI that conditions on a specific state as:


where and only performs counting in the sub-episode . Immediately, a shaping function can be formulated as:


Eq. (12) indicates the gain performance on JFI when transiting from to , and the following theorem proves the utility equivalence between JFI and state visitation entropy.

Theorem 1 (Consistency).

Given a trajectory , maximizing the state visitation entropy is equivalent to maximizing the JFI using Eq. (12) when .


See proof in Appendix C. ∎

Figure 1: Two exploration trajectories in the GridWorld game, where the black node and red node denote the start and end, and integers denote the index of actions.

Furthermore, we employ a representative example to demonstrate the advantage of JFI. Fig. 1 demonstrates two trajectories generated by the agent when interacting with the GridWorld game. Fig. 1(a) makes better exploration than Fig. 1(b) because it visits more states within same steps. Equipped with Eq. (11), the shaped JFI from the fifth step to the eighth step are computed as:


where represents an increase and represents a decrease. It is obvious that keeping exploration will increase the JFI, and visiting the known region repeatedly will be punished. In sharp contrast to the entropy regularizer, the JFI is very sensitive to the change of exploratory situation, which is more computable and robust.

JFI for infinite state space. It is easy to compute in tabluar settings. However, many environments have infinite state space, and episodic exploration can only visit a few parts of the space. Moreover, the observed states in an episode are usually uncountable. To address the problem, the -means clustering is leveraged to refine the observed states and make it countable macqueen1967some .

Given an observed states set , -means clustering aims to shatter the states into sets to minimize the sum of within-cluster distance. Formally, the algorithm aims to optimize the following objective:


where is the mean of samples in .

With the clustered states, Eq. (11) is rewritten as:


where performs counting for labels in the sub-episode . In practice, we perform clustering using the encoding of the raw states to reduce variance and computation complexity. To realize efficient and generalized encoding, a VAE model is bulit in the following section.

4 Multimodal Reward Shaping

Figure 2: The overview of MMRS, where denotes the Euclidean distance.

In this section, we propose a intrinsic reward modult entitled MMRS that is model-free, memory-unnecessary, and generative-model empowered. As illustrated in Fig. 2, MMRS is composed of two major modules, namely global intrinsic reward block (GIRB) and local intrinsic reward block (LIRB). GIRB evaluates the global exploration performance using shape JFI, while LIRB tracing the life-long novelty of states across episodes. Finally, the two kinds of exploration bonuses form mutli-modal intrinsic rewards for policy updates.

To capture the long-term state novelty, a general method is to employ an attendant model to record the visited states such as RND and ICM pathak2017curiosity

. However, the discriminative models suffer from overfitting and have poor generalization ability. To address the problem, we propose to evaluate the state novelty using variational auto-encoder (VAE), which is a powerful generative model based on Bayesian inference

kingma2013auto . A vanilla VAE has a recognition model and generative model, and they can be represented as a probabilistic encoder and a probabilistic decoder. We leverage the VAE to encode and reconstruct the state samples to capture their life-long novelty. In particular, the output of the encoder can be leveraged to perform the clustering operation defined in Section 3.

Denote by the recognition model represented by a deep neural network (DNN) with parameters , which accepts a state and encodes it into latent variables. Similarly, we define the generative model using a DNN with parameters , which accepts the latent variables and reconstructs the state. Given a trajectory

, the VAE is trained by minimizing the following loss function:


where and is the Kullback-Liebler (KL) divergence. For an observed state at step , its life-long novelty is computed as:


where is the reconstructed state and is a normalization operator. This definition indicates that the infrequently-seen states will produce high reconstruction error, which motivates the agent to explore it further. Note that the VAE model will inevitably produce diminishing intrinsic rewards. To control the decay rate, a low and diminishing learning rate is indispensable when updating the VAE model.

We refer the JFI as global intrinsic rewards and state novelty as local intrinsic rewards, respectively. Equipped with the two kinds of exploration bonuses, we are ready to propose the following shaping function:


where are two weighting coefficients. Finally, the workflow of the MMRS is summarized in Algorithms 1.

1:  Initialize the policy network , recognition model and generative model ;
2:  Set the coefficients and the number of clusters ;
3:  for episode  do
4:     Execute policy and collect the trajectory ;
5:     Use states from to train the VAE model by minimizing the loss function defined in Eq. (16);
6:     Use the recognition model to encode the observed states and collect the corresponding latent variables ;
7:     if  then
8:        Perform -means clustering to refine and label with integers;
9:     end if
10:     Calculate the multimodal intrinsic reward for each state-action pair in :
11:     Update with respect to the mixed reward using any RL algorithms.
12:  end for
Algorithm 1 Multimodal Reward Shaping

5 Experiments

In this section, we evaluate our MMRS framework both on discrete and continuous control tasks of OpenAI Gym library brockman2016openai coumans2016pybullet . We carefully select several representative algorithms as benchmarks, namely RE3, RIDE and RND. The brief introduction of these benchmark schemes can be found in Appendix D. With RND, we can testify that the MMRS overcomes the problem of diminishing intrinsic rewards. With RE3 and RIDE, we can validate that the MMRS achieves higher performance when handling the environments with high-dimensional observations. In particular, the agent is also trained without intrinsic reward modules for ablation justification.

5.1 Discrete Control Tasks

We first test the MMRS on Atari games with discrete action space, and the details of selected games are illustrated in Table 1. Since the Atari games output frames continuously, so we stack four consecutive frames as an observation. Moreover, the frames are resized with shape to reduce computation. In particular, some games have complex action space, which can be used to evaluate the robustness and generalization ability of MMRS.

Game Observation shape Action space size
Assault (84, 84, 4) 7
Breakout (84, 84, 4) 4
Beam Rider (84, 84, 4) 9
Kung Fu Master (84, 84, 4) 14
Space Invaders (84, 84, 4) 6
Seaquest (84, 84, 4) 18
Table 1: The details of Atari games with discrete action space.

5.1.1 Experimental Setting

We leveraged convolutional neural networks (CNNs) to build MMRS and benchmark algorithms. The LIRB of MMRS needs to learn an

encoder and a decoder

. The encoder was composed of four convolutional layers and one dense layer, in which each convolutional layer is followed by a batch normalization (BN) layer. For the decoder, it utilized four deconvolutional layers to perform upsampling. Moreover, a dense layer and a convolutional layer were employed at the top and the bottom of the decoder. Note that no BN layer is included in the decoder. Finally, we used LeakyReLU activation function both for encoder and decoder, and more detailed network architectures can be found in Appendix


We trained MMRS with five million environment steps. In each episode, the agent was set to interact with eight parallel environments with different random seeds. Moreover, one episode had a length of 128 steps, producing 1024 pieces of transitions. For an observed state, it was first processed by the encoder of VAE to generate a latent vector. Then the latent vector was sent to the decoder to reconstruct the state. The pixel values of the true state and the reconstructed state were normalized into

, and the reconstruction error was employed as the state novelty. After that, we performed -means clustering on the latent vectors with . Note that the number of clusters can be bigger if a longer episode length is employed.

Equipped with the multimodal intrinsic rewards (), we used a proximal policy optimization (PPO) schulman2017proximal

method to update the policy network. More specifically, we used a PyTorch implementation of the PPO method, which can be found in

kostrikov2018github . To make a fair comparison, we employed an identical policy network and value network for all the algorithms, and its architectures can be found in Table. The PPO was trained with a learning rate of , an entropy coefficient of , a value function coefficient of , and a GAE parameter of schulman2015high

. In particular, a gradient clipping operation with threshold

was performed to stabilize the learning procedure.

After the policy was updated, the transitions were utilized to update our VAE model of LIRB. For the hyper-parameters setting, the batch size was set as 64, and the Adam optimizer was leveraged to perform the gradient descent. In particular, a linearly decaying learning rate was employed to prevent the life-long state novelty from diminishing rapidly. Finally, we trained the benchmark schemes following its default settings reported in the literature.

5.1.2 Performance Comparison

Next, we compare the performance of MMRS against benchmark methods, and the episode return is utilized as the key performance indicator (KPI). Table 2 illustrates the performance comparison of MMRS and benchmarks, in which the highest performance is shown in bold numbers. In particular, "T" indicates that the method can outperform the vanilla PPO agent (VPA), while "F" is not.

Assault 2589.95 T/2870.87 T/2751.55 T/2642.08 T/3303.64
Breakout 52.74 T/58.65 T/62.62 T/55.37 T/70.77
Beam Rider 597.16 T/1225.72 T/748.51 F/541.27 T/1383.00
Kung Fu Master 14450.0 F/13534.62 T/15830.43 F/12500.0 T/17295.65
Space Invaders 558.90 T/631.75 T/749.32 T/577.60 T/751.71
Seaquest 879.29 T/886.43 T/884.14 F/853.10 T/892.14
Table 2: Performance comparison in six Atari games.

As shown in Table 2, MMRS achieved the highest performance in all the selected games. RE3 outperforms the VPA in five games while failing in one game. RIDE outperforms the VPA in all games and achieves the suboptimal performance in one game. In contrast, RND outperforms the VPA in three games and fails in three games. Furthermore, Fig. 3 illustrates the moving average of the episode return during training. The growth rate of MMRS is faster than the other benchmarks, but it shows more oscillation considering the whole learning procedure.

Figure 3: Moving average of the episode return in Atari games training.
Figure 4: Computational complexity comparison. All the experiments were performed in Ubuntu 18.04 LTS operating system with a Intel 10900x CPU and a NVIDIA RTX3090 GPU.

Since we learn policies in environments with high-dimensional observations, it is non-trivial to evaluate the computational complexity of these intrinsic reward modules. In practice, we use the frames per second (FPS) during training as the KPI. For instance, if the agent takes seconds to accomplish sampling and updating in one episode, then the FPS is computed as the ratio between time costs and episode length. As shown in Fig. 4, RE3 achieves the highest FPS during training because it requires no additional models or memories. RIDE employs three DNNs to reconstruct the transition process and introduces a pseudo-count method to compute state visitation frequency. Therefore, it achieves the lowest FPS. Our MMRS achieves the highest performance at the cost of lower computation efficiency, but it can be further improved by simplifying the network architectures of VAE model.

5.2 Continuous Control Tasks

In this section, we further test the MMRS on Bullet games with continuous action space, and three classical games are selected in Table 3. Unlike the Atari games that has images observations, Bullet games use fixed-length vectors to describe the environments. For instance, "Ant" game uses 28 features to describe the state of agent, and its action is a vector consists of 8 values within .

Game Observation shape Action extent Action shape
Ant (28, ) (-1.0, 1.0) (8, )
Hopper (15, ) (-1.0, 1.0) (3, )
Humanoid (44, ) (-1.0, 1.0) (17, )
Table 3: The details of Bullet games with continuous action space.

5.2.1 Experimental Setting

We leveraged multilayer perceptron (MLP) to implement MMRS and benchmarks, and the detailed network architectures can be found in Appendix

E. Note that no BN layers were introduced in this experiment. We trained MMRS with one million environment steps. The agent was also set to interact with eight parallel environments with different random seeds in each episode, and diagonal Gaussian was used to sample actions. The rest of the updating procedure was consistent with the experiments of Atari games, but no normalization was performed to the states. For computing the multimodal intrinsic rewards, the coefficients were set as .

5.2.2 Performance Comparison

Table 4 illustrates the performance comparison between MMRS and benchmarks. MMRS ouperforms the VPA in all three games while achieving the best performance in two games. RE3 beats all the other algorithms in one game and demonstrates suboptimal performance in another game. RIDE and RND outperform the VPA in two games but fail in one game. Furthermore, Fig. 5 demonstrates the moving average of episode return during the training. It is obvious that MMRS realizes stable and efficient growth when compared with benchmarks. In summary, MMRS shows great potential for obtaining considerable performance both in discrete and continuous control tasks.

Ant 648.14 T/679.88 T/661.81 T/675.85 T/692.85
Hopper 1071.58 F/971.55 F/889.53 F/765.63 T/1127.04
Humanoid 18.19 T/27.85 T/23.03 T/20.88 T/25.31
Table 4: Performance comparison in three Bullet games.
Figure 5: Moving average of the episode return in Bullet games training.

6 Conclusion

In this paper, we have investigated the problem of improving exploration in RL. We first dived into the sample complexity of the entropy-based approaches and obtained an exact lower bound. To eliminate the prohibitive sample complexity, a novel metric entitled JFI was introduced to replace the entropy regularizer. Moreover, we further proved the utility consistency between the JFI and entropy regularizer, and demonstrated the practical usage of JFI both in tabular setting and infinite state space. Equipped with the JFI metric, the state novelty was integrated to build multimodal intrinsic rewards, which evaluates the exploration extent more precisely. In particular, we used VAE model to capture the life-long state novelty across episodes, it avoids overfitting and learns excellent state representation when compared with the discriminative models. Finally, extensive simulations were performed both in discrete and continuous tasks of Open AI Gym library. The numerical results demonstrated that our algorithm outperformed the benchmarks, showing great effectiveness for realizing efficient exploration.


Appendix A Proof of Lemma 1

Considering sampling for steps and collecting a dataset , then a reasonable estimate of the SVD is:

According to the McDiarmid’s inequality mcdiarmid1989method , we have that :

Take logarithm on both sides, such that:

Assume the , so with probability , it holds:

Let , such that:


This concludes the proof.

Appendix B Proof of Lemma 2

Considering sampling for steps and collecting a dataset , then a reasonable estimate of state entropy is:

For , the Hoeffding’s inequality hoeffding1994probability indicates that:


Let , take logarithm on both sides, such that:

Finally, it holds:

This concludes the proof.

Appendix C Proof of Theorem 1

Given a trajectory , the undiscounted return based on Eq. (12) is:

where . Recall the optimal condition of JFI, , it holds:

When , the state visitation probability satisfies:

Therefore, the state visitation entropy obtains its optima. This concludes the proof.

Appendix D Benchmark Schemes

d.1 Re3

Given a trajectory , RE3 first uses a random initialized DNN to encode the observed states. Denote by the encoding vectors, RE3 estimates the entropy of using a -nearest neighbor (-NN) entropy estimator singh2003nearest :


where is the -NN of within the set , is the dimension of the encoding vector, , is the Gamma function, and is the diagamma function.

Equipped with Eq. 20, the intrinsic reward of each transition is computed as:


d.2 Ride

RIDE inherits the architecture of intrinsic reward module (ICM) in pathak2017curiosity , which is composed of a embedding module , an inverse dynamic model , and a forward dynamic model . Given a transition , the inverse dynamic model predicts an action using the encoding of state and next-state . Meanwhile, the forward dynamic model accepts and the true action to predict the representation of . Given a trajectory , the three models are trained to minimized the following loss function:


where denotes the loss function that measures the distance between true actions and predicted actions, e.g., the cross entropy for discrete action space.

Finally, the intrinsic reward of each transition is computed as:


where is the number of times that state has been visited during the current episode, it can be obtained using pseudo-count method ostrovski2017count .

d.3 Rnd

RND leverages DNN to record the visited states and computes its novelty, which consists of a predictor network and target network. The target network serves as the reference, which is fixed and randomly initialized to set the prediction problem. The predictor network is trained using the collected data by the agent across the episodes. Denote by and the target network and predictor network, where is the embedding dimension. The RND is trained to minimize the following loss function:


Finally, the intrinsic reward of each transition is computed as:


Appendix E Network architectures

e.1 Atari games

Moudle Policy network Encoder Decoder
Input State State Latent Variables

8 Conv. 32, ReLU

44 Conv. 64, ReLU
33 Conv. 32, ReLU
Dense 512, ReLU
Categorical Distribution
33 Conv. 32, LeakyReLU
33 Conv. 32, LeakyReLU
33 Conv. 32, LeakyReLU
33 Conv. 32
Dense 512 & Dense 512
Gaussian sampling
Dense 64, LeakyReLU
Dense 1024, LeakyReLU
33 Deconv. 64, LeakyReLU
33 Deconv. 64, LeakyReLU
33 Deconv. 64, LeakyReLU
88 Deconv. 32
11 Conv. 4
Output Action Latent variables Reconstructed state
Table 5: The CNN-based network architectures.

For instance, "88 Conv. 32" represents a convolutional layer that has 32 filters of size 88. A categorical distribution was used to sample an action based on the action probability of the stochastic policy. Note that "Dense 512 & Dense 512" in Table 5 means that there are two branches for outputing the mean and variance of the latent variables, respectively.

e.2 Bullet games

Moudle Policy network Encoder Decoder
Input State State Latent Variables
Dense 64, Tanh
Dense 64, Tanh
Categorical Distribution
Dense 32, Tanh
Dense 64, Tanh
Dense 256
Dense 256 & Dense 512
Gaussian sampling
Dense 32, Tanh
Dense 64, Tanh
Dense observation shape
Output Action Latent variables Reconstructed state
Table 6: The MLP-based network architectures.