MMRS
MultiModal Reward Shaping for Efficient Exploration in Reinforcement Learning
view repo
Maintaining longterm exploration ability remains one of the challenges of deep reinforcement learning (DRL). In practice, the reward shapingbased approaches are leveraged to provide intrinsic rewards for the agent to incentivize motivation. However, most existing IRS modules rely on attendant models or additional memory to record and analyze learning procedures, which leads to high computational complexity and low robustness. Moreover, they overemphasize the influence of a single state on exploration, which cannot evaluate the exploration performance from a global perspective. To tackle the problem, state entropybased methods are proposed to encourage the agent to visit the state space more equitably. However, the estimation error and sample complexity are prohibitive when handling environments with highdimensional observation. In this paper, we introduce a novel metric entitled Jain's fairness index (JFI) to replace the entropy regularizer, which requires no additional models or memory. In particular, JFI overcomes the vanishing intrinsic rewards problem and can be generalized into arbitrary tasks. Furthermore, we use a variational autoencoder (VAE) model to capture the lifelong novelty of states. Finally, the global JFI score and local state novelty are combined to form a multimodal intrinsic reward, controlling the exploration extent more precisely. Finally, extensive simulation results demonstrate that our multimodal reward shaping (MMRS) method can achieve higher performance in contrast to other benchmark schemes.
READ FULL TEXT VIEW PDFMultiModal Reward Shaping for Efficient Exploration in Reinforcement Learning
Balancing the tradeoff between exploration and exploitation is a crucial problem in reinforcement learning (RL) sutton2018reinforcement . In general, learning the optimal policy requires the agent to visit all possible stateaction pairs infinitely. However, most existing RL algorithms have sophisticated exploitation mechanisms while performing poor exploration strategies. As a result, the policy may prematurely fall into local optima after finite steps and never improve again stadie2015incentivizing . Therefore, the critical problem is to maintaining exploration throughout the whole learning procedure. To address the problem, a simple approach is to employ stochastic policies such as
greedy policy and Boltzmann exploration, which randomly select all the possible actions with a nonzero probability in each state. Such techniques are prone to learn the optimal policy eventually in the tabular setting, but it is likely to be futile when handling complex environments with highdimensional observations.
To cope with the exploration problem in complex tasks, the reward shaping method is leveraged to form multimodal rewards to improve exploration. More specifically, recent approaches proposed to provide intrinsic rewards for the agent to reward its exploration performance dayan2002reward . In sharp contrast to the extrinsic rewards given by the environment explicitly, intrinsic rewards represent the inherent learning motivation or curiosity of the agent, which are difficult to characterize and evaluate. Many prior works have been devoted to realizing computable intrinsic reward modules, and they can be broadly categorized into noveltybased and prediction errorbased approaches. For instance, strehl2008analysis ; ostrovski2017count employed a state visitation function to measure the novelty of states. Such visitation function assigns a higher bonus to those infrequentlyseen states, incentivizing the agent to revisit novel states and increasing the probability of learning better policy. Methods in pathak2017curiosity ; yu2020intrinsic ; yuan2021hybrid ; stadie2015incentivizing followed the second idea, in which the prediction error of a dynamic model is utilized as intrinsic rewards. Given an observed transition, an attendant model was designed to predict a nextstate based on the current stateaction pair. Then the intrinsic reward is computed as the Euclidean distance between the predicted nextstate and the true nextstate.
All the methods above suffer from vanishing intrinsic rewards, i.e., the intrinsic rewards will decrease with visits. Once a state is given a minimal intrinsic reward, the agent will have no motivation to explore it further. To maintain the longterm exploration ability, badia2020never proposed a nevergiveup (NGU) framework that learns combined intrinsic rewards consist of episodic and lifelong state novelty. NGU evaluates the episodic state novelty through a slotbased memory and pseudocount method bellemare2016unifying , encouraging the agent to visit more distinct states in each episode. Since the memory will be wiped at the beginning of the episode, the intrinsic rewards will not decay with the training process. Meanwhile, NGU further introduced a random network distillation (RND) module to capture the lifelong novelty of states burda2018exploration . By controlling the learning rate of RND, the lifelong intrinsic reward can prevent the agent from visiting familiar states more gently. NGU has complex architecture and high computational complexity, making it difficult to be generalized into arbitrary environments. A simpler framework entitled rewarding impactdriven exploration (RIDE) is proposed in raileanu2020ride . RIDE first trains an embedding network to encode the state space following the inverseforward pattern of pathak2017curiosity . Then the Euclidean between two consecutive encoded states is utilized as the intrinsic reward. It encourages the agent to take actions that result in more state changes, maintaining exploration and avoiding the television dilemma reported in savinov2018episodic .
However, both NGU and RIDE overemphasize the influence of a single state on exploration and cannot reflect the global exploration extent. Moreover, the aforementioned methods have poor mathematical interpretability and rely heavily on attendant models. To circumvent this problem, islam2019entropy proposed to maximize the entropy of state distribution, forcing the agent to visit all the states more equitably. In particular, islam2019entropy estimates the state entropy using a variational autoencoder (VAE) kingma2013auto . VAE accepts the parameters of policy as input before predicting the state distribution. Therefore, minimizing the reconstruction error of VAE is equivalent to maximizing the estimation of state entropy. Finally, the policy and VAE model can be updated together through the policy gradient method. In zhang2018dissection , the Shannon entropy regularizer is further expanded into Rényi Entropy to adapt to arbitrary tasks. To realize an efficient and stable entropy estimate, seo2021state
proposed a random encoder for efficient exploration (RE3) framework that requires no representation learning. In each episode, the observations are collected and encoded using a fixed deep neural network (DNN). Then a
nearest neighbor estimator singh2003nearest is leveraged to estimate the state entropy. Simulations results demonstrated that RE3 improved the sample efficiency of both modelfree and modelbased RL algorithms. However, it is difficult to choose an appropriate value, and the estimation error and sample complexity grow exponentially with the size of state space.Inspired by the discussions above, we consider alternating the entropy regularizer with a simpler and computable metric. Moreover, we aim to combine this metric with state novelty to realize a comprehensive exploration evaluation. In this paper, we propose multimodal reward shaping (MMRS), a modelfree, memoryunnecessary, and generativemodel empowered method for providing highquality intrinsic rewards. Our main contribution of this paper are summarized as follows:
We first analyze the sample complexity of the entropybased intrinsic rewards. After that, a novel metric entitled Jain’s fairness index (JFI) is introduced to replace the entropyregularizer, and we prove the utility equivalence between the two metrics. Furthermore, we dive into the employment of JFI both in tabular settings and environments with highdimensional observations.
Since JFI evaluates the global exploration performance, the lifelong state novelty is leveraged to form multimodal intrinsic rewards. In particular, we use a VAE to perform state embedding and capture the lifelong novelty states. Such a method requires no additional memory and avoids overfitting, which is more efficient and robust than RND.
Finally, extensive simulations are performed to compare MMRS with other stateoftheart (SOTA) methods. Numerical results demonstrate that MMRS outperforms SOTA exploration methods with simpler architecture and higher robustness. Furthermore, we conduct qualitative analysis to show that MMRS eliminates the vanishing intrinsic rewards and maintains longterm exploration ability.
We study the RL problem that considers the Markov decision process (MDP) defined as below
sutton2018reinforcement :The MDP can be defined as a tuple , where:
is the state space;
is the action space;
is the transition probability;
is the reward function;
is the initial state distribution;
is a discount factor.
Note that the reward function here conditions on a full transition . Furthermore, we denote as the policy of agent, which observes the state of environment before choosing an action from the action space. Given MDP , the objective of RL is to find the optimal policy that maximizes the expected discounted return:
(1) 
where , is the set of all stationary policies, and is the trajectory generated by the policy.
In this paper, we aim to improve the exploration of state space, in which the agent is expected to visit as many distinct states as possible within limited learning procedure. Mathematically, the policy should provide equitable visitation probability for all the states. To evaluate the exploration of single states, we define the following state visitation distribution (SVD):
(2) 
where denotes the probability,
is the random variable of state at step
. Therefore, improving exploration is equivalent to maximizing the following entropy:(3) 
Finally, the RL objective is reformulated as:
(4) 
It is difficult to straightforwardly solve the optimization problem due to the complex entropy item. The following section proposes a reward shapingbased method to replace the entropy by utilizing a novel and simple metric.
In this section, we first analyze the sample complexity of the original entropybased intrinsic rewards. Then an alternative method is proposed for replacing the entropy regularizer.
To compute the state entropy, we first need to estimate the SVD defined in Eq. (2). Given a trajectory generated by policy , a reasonable estimate of can be formulated as:
(5) 
where is the indicator function. The following lemma indicates the sample complexity of such estimation:
Assume sampling for steps and estimating the SVD following Eq. (5), then with probability as , such that:
(6) 
where is the cardinality of state space.
Furthermore, a reasonable estimate of Eq. (3) based on the MonteCarlo method is:
(7) 
See proof in Appendix B. ∎
Similar to the estimation of SVD, we can prove that:
Assume sampling for steps and estimating the entropy following Eq. (5), then with probability as , it holds:
(8) 
See proof in Appendix A. ∎
Equipped with Lemma 1 and Lemma 2, the sample complexity of estimating state entropy:
(9) 
Eq. (9) demonstrate the estimation error converges sublinearly as
. In practice, we can only sample finite steps in each episode, such estimation may produce much variance and make negative influence in the policy learning.
To replace the complicated entropy regularizer, we introduce the JFI to evaluate the global exploration performance. JFI is a countbased metric that is first leveraged to calculate the allocation fairness in the wireless resources scheduling jain1999throughput . The following definition formulates the JFI for state visitation considering the tabular setting, in which the state space is finite.
Given a trajectory , denote by the episodic visitation counter and , the visitation fairness based on JFI can be computed as:
(10) 
The JFI ranges form (worst case) to (best case), and it is maximum when gets the same value for all the states.
Based on Definition 2, we define the shaped JFI that conditions on a specific state as:
(11) 
where and only performs counting in the subepisode . Immediately, a shaping function can be formulated as:
(12) 
Eq. (12) indicates the gain performance on JFI when transiting from to , and the following theorem proves the utility equivalence between JFI and state visitation entropy.
Given a trajectory , maximizing the state visitation entropy is equivalent to maximizing the JFI using Eq. (12) when .
See proof in Appendix C. ∎
Furthermore, we employ a representative example to demonstrate the advantage of JFI. Fig. 1 demonstrates two trajectories generated by the agent when interacting with the GridWorld game. Fig. 1(a) makes better exploration than Fig. 1(b) because it visits more states within same steps. Equipped with Eq. (11), the shaped JFI from the fifth step to the eighth step are computed as:
(13)  
where represents an increase and represents a decrease. It is obvious that keeping exploration will increase the JFI, and visiting the known region repeatedly will be punished. In sharp contrast to the entropy regularizer, the JFI is very sensitive to the change of exploratory situation, which is more computable and robust.
JFI for infinite state space. It is easy to compute in tabluar settings. However, many environments have infinite state space, and episodic exploration can only visit a few parts of the space. Moreover, the observed states in an episode are usually uncountable. To address the problem, the means clustering is leveraged to refine the observed states and make it countable macqueen1967some .
Given an observed states set , means clustering aims to shatter the states into sets to minimize the sum of withincluster distance. Formally, the algorithm aims to optimize the following objective:
(14) 
where is the mean of samples in .
With the clustered states, Eq. (11) is rewritten as:
(15) 
where performs counting for labels in the subepisode . In practice, we perform clustering using the encoding of the raw states to reduce variance and computation complexity. To realize efficient and generalized encoding, a VAE model is bulit in the following section.
In this section, we propose a intrinsic reward modult entitled MMRS that is modelfree, memoryunnecessary, and generativemodel empowered. As illustrated in Fig. 2, MMRS is composed of two major modules, namely global intrinsic reward block (GIRB) and local intrinsic reward block (LIRB). GIRB evaluates the global exploration performance using shape JFI, while LIRB tracing the lifelong novelty of states across episodes. Finally, the two kinds of exploration bonuses form mutlimodal intrinsic rewards for policy updates.
To capture the longterm state novelty, a general method is to employ an attendant model to record the visited states such as RND and ICM pathak2017curiosity
. However, the discriminative models suffer from overfitting and have poor generalization ability. To address the problem, we propose to evaluate the state novelty using variational autoencoder (VAE), which is a powerful generative model based on Bayesian inference
kingma2013auto . A vanilla VAE has a recognition model and generative model, and they can be represented as a probabilistic encoder and a probabilistic decoder. We leverage the VAE to encode and reconstruct the state samples to capture their lifelong novelty. In particular, the output of the encoder can be leveraged to perform the clustering operation defined in Section 3.Denote by the recognition model represented by a deep neural network (DNN) with parameters , which accepts a state and encodes it into latent variables. Similarly, we define the generative model using a DNN with parameters , which accepts the latent variables and reconstructs the state. Given a trajectory
, the VAE is trained by minimizing the following loss function:
(16) 
where and is the KullbackLiebler (KL) divergence. For an observed state at step , its lifelong novelty is computed as:
(17) 
where is the reconstructed state and is a normalization operator. This definition indicates that the infrequentlyseen states will produce high reconstruction error, which motivates the agent to explore it further. Note that the VAE model will inevitably produce diminishing intrinsic rewards. To control the decay rate, a low and diminishing learning rate is indispensable when updating the VAE model.
We refer the JFI as global intrinsic rewards and state novelty as local intrinsic rewards, respectively. Equipped with the two kinds of exploration bonuses, we are ready to propose the following shaping function:
(18) 
where are two weighting coefficients. Finally, the workflow of the MMRS is summarized in Algorithms 1.
In this section, we evaluate our MMRS framework both on discrete and continuous control tasks of OpenAI Gym library brockman2016openai coumans2016pybullet . We carefully select several representative algorithms as benchmarks, namely RE3, RIDE and RND. The brief introduction of these benchmark schemes can be found in Appendix D. With RND, we can testify that the MMRS overcomes the problem of diminishing intrinsic rewards. With RE3 and RIDE, we can validate that the MMRS achieves higher performance when handling the environments with highdimensional observations. In particular, the agent is also trained without intrinsic reward modules for ablation justification.
We first test the MMRS on Atari games with discrete action space, and the details of selected games are illustrated in Table 1. Since the Atari games output frames continuously, so we stack four consecutive frames as an observation. Moreover, the frames are resized with shape to reduce computation. In particular, some games have complex action space, which can be used to evaluate the robustness and generalization ability of MMRS.
Game  Observation shape  Action space size 

Assault  (84, 84, 4)  7 
Breakout  (84, 84, 4)  4 
Beam Rider  (84, 84, 4)  9 
Kung Fu Master  (84, 84, 4)  14 
Space Invaders  (84, 84, 4)  6 
Seaquest  (84, 84, 4)  18 
We leveraged convolutional neural networks (CNNs) to build MMRS and benchmark algorithms. The LIRB of MMRS needs to learn an
encoder and a decoder. The encoder was composed of four convolutional layers and one dense layer, in which each convolutional layer is followed by a batch normalization (BN) layer. For the decoder, it utilized four deconvolutional layers to perform upsampling. Moreover, a dense layer and a convolutional layer were employed at the top and the bottom of the decoder. Note that no BN layer is included in the decoder. Finally, we used LeakyReLU activation function both for encoder and decoder, and more detailed network architectures can be found in Appendix
E.We trained MMRS with five million environment steps. In each episode, the agent was set to interact with eight parallel environments with different random seeds. Moreover, one episode had a length of 128 steps, producing 1024 pieces of transitions. For an observed state, it was first processed by the encoder of VAE to generate a latent vector. Then the latent vector was sent to the decoder to reconstruct the state. The pixel values of the true state and the reconstructed state were normalized into
, and the reconstruction error was employed as the state novelty. After that, we performed means clustering on the latent vectors with . Note that the number of clusters can be bigger if a longer episode length is employed.Equipped with the multimodal intrinsic rewards (), we used a proximal policy optimization (PPO) schulman2017proximal
method to update the policy network. More specifically, we used a PyTorch implementation of the PPO method, which can be found in
kostrikov2018github . To make a fair comparison, we employed an identical policy network and value network for all the algorithms, and its architectures can be found in Table. The PPO was trained with a learning rate of , an entropy coefficient of , a value function coefficient of , and a GAE parameter of schulman2015high. In particular, a gradient clipping operation with threshold
was performed to stabilize the learning procedure.After the policy was updated, the transitions were utilized to update our VAE model of LIRB. For the hyperparameters setting, the batch size was set as 64, and the Adam optimizer was leveraged to perform the gradient descent. In particular, a linearly decaying learning rate was employed to prevent the lifelong state novelty from diminishing rapidly. Finally, we trained the benchmark schemes following its default settings reported in the literature.
Next, we compare the performance of MMRS against benchmark methods, and the episode return is utilized as the key performance indicator (KPI). Table 2 illustrates the performance comparison of MMRS and benchmarks, in which the highest performance is shown in bold numbers. In particular, "T" indicates that the method can outperform the vanilla PPO agent (VPA), while "F" is not.
Game  PPO  PPO+RE3  PPO+RIDE  PPO+RND  PPO+MMRS 

Assault  2589.95  T/2870.87  T/2751.55  T/2642.08  T/3303.64 
Breakout  52.74  T/58.65  T/62.62  T/55.37  T/70.77 
Beam Rider  597.16  T/1225.72  T/748.51  F/541.27  T/1383.00 
Kung Fu Master  14450.0  F/13534.62  T/15830.43  F/12500.0  T/17295.65 
Space Invaders  558.90  T/631.75  T/749.32  T/577.60  T/751.71 
Seaquest  879.29  T/886.43  T/884.14  F/853.10  T/892.14 
As shown in Table 2, MMRS achieved the highest performance in all the selected games. RE3 outperforms the VPA in five games while failing in one game. RIDE outperforms the VPA in all games and achieves the suboptimal performance in one game. In contrast, RND outperforms the VPA in three games and fails in three games. Furthermore, Fig. 3 illustrates the moving average of the episode return during training. The growth rate of MMRS is faster than the other benchmarks, but it shows more oscillation considering the whole learning procedure.
Since we learn policies in environments with highdimensional observations, it is nontrivial to evaluate the computational complexity of these intrinsic reward modules. In practice, we use the frames per second (FPS) during training as the KPI. For instance, if the agent takes seconds to accomplish sampling and updating in one episode, then the FPS is computed as the ratio between time costs and episode length. As shown in Fig. 4, RE3 achieves the highest FPS during training because it requires no additional models or memories. RIDE employs three DNNs to reconstruct the transition process and introduces a pseudocount method to compute state visitation frequency. Therefore, it achieves the lowest FPS. Our MMRS achieves the highest performance at the cost of lower computation efficiency, but it can be further improved by simplifying the network architectures of VAE model.
In this section, we further test the MMRS on Bullet games with continuous action space, and three classical games are selected in Table 3. Unlike the Atari games that has images observations, Bullet games use fixedlength vectors to describe the environments. For instance, "Ant" game uses 28 features to describe the state of agent, and its action is a vector consists of 8 values within .
Game  Observation shape  Action extent  Action shape 

Ant  (28, )  (1.0, 1.0)  (8, ) 
Hopper  (15, )  (1.0, 1.0)  (3, ) 
Humanoid  (44, )  (1.0, 1.0)  (17, ) 
We leveraged multilayer perceptron (MLP) to implement MMRS and benchmarks, and the detailed network architectures can be found in Appendix
E. Note that no BN layers were introduced in this experiment. We trained MMRS with one million environment steps. The agent was also set to interact with eight parallel environments with different random seeds in each episode, and diagonal Gaussian was used to sample actions. The rest of the updating procedure was consistent with the experiments of Atari games, but no normalization was performed to the states. For computing the multimodal intrinsic rewards, the coefficients were set as .Table 4 illustrates the performance comparison between MMRS and benchmarks. MMRS ouperforms the VPA in all three games while achieving the best performance in two games. RE3 beats all the other algorithms in one game and demonstrates suboptimal performance in another game. RIDE and RND outperform the VPA in two games but fail in one game. Furthermore, Fig. 5 demonstrates the moving average of episode return during the training. It is obvious that MMRS realizes stable and efficient growth when compared with benchmarks. In summary, MMRS shows great potential for obtaining considerable performance both in discrete and continuous control tasks.
Game  PPO  PPO+RE3  PPO+RIDE  PPO+RND  PPO+MMRS 

Ant  648.14  T/679.88  T/661.81  T/675.85  T/692.85 
Hopper  1071.58  F/971.55  F/889.53  F/765.63  T/1127.04 
Humanoid  18.19  T/27.85  T/23.03  T/20.88  T/25.31 
In this paper, we have investigated the problem of improving exploration in RL. We first dived into the sample complexity of the entropybased approaches and obtained an exact lower bound. To eliminate the prohibitive sample complexity, a novel metric entitled JFI was introduced to replace the entropy regularizer. Moreover, we further proved the utility consistency between the JFI and entropy regularizer, and demonstrated the practical usage of JFI both in tabular setting and infinite state space. Equipped with the JFI metric, the state novelty was integrated to build multimodal intrinsic rewards, which evaluates the exploration extent more precisely. In particular, we used VAE model to capture the lifelong state novelty across episodes, it avoids overfitting and learns excellent state representation when compared with the discriminative models. Finally, extensive simulations were performed both in discrete and continuous tasks of Open AI Gym library. The numerical results demonstrated that our algorithm outperformed the benchmarks, showing great effectiveness for realizing efficient exploration.
International conference on machine learning
, pages 2721–2730. PMLR, 2017.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, pages 16–17, 2017.Intrinsic reward driven imitation learning via generative model.
In International Conference on Machine Learning, pages 10925–10935. PMLR, 2020.Considering sampling for steps and collecting a dataset , then a reasonable estimate of the SVD is:
According to the McDiarmid’s inequality mcdiarmid1989method , we have that :
Take logarithm on both sides, such that:
Assume the , so with probability , it holds:
Let , such that:
(19) 
This concludes the proof.
Considering sampling for steps and collecting a dataset , then a reasonable estimate of state entropy is:
For , the Hoeffding’s inequality hoeffding1994probability indicates that:
Therefore,
Let , take logarithm on both sides, such that:
Finally, it holds:
This concludes the proof.
Given a trajectory , the undiscounted return based on Eq. (12) is:
where . Recall the optimal condition of JFI, , it holds:
When , the state visitation probability satisfies:
Therefore, the state visitation entropy obtains its optima. This concludes the proof.
Given a trajectory , RE3 first uses a random initialized DNN to encode the observed states. Denote by the encoding vectors, RE3 estimates the entropy of using a nearest neighbor (NN) entropy estimator singh2003nearest :
(20)  
where is the NN of within the set , is the dimension of the encoding vector, , is the Gamma function, and is the diagamma function.
Equipped with Eq. 20, the intrinsic reward of each transition is computed as:
(21) 
RIDE inherits the architecture of intrinsic reward module (ICM) in pathak2017curiosity , which is composed of a embedding module , an inverse dynamic model , and a forward dynamic model . Given a transition , the inverse dynamic model predicts an action using the encoding of state and nextstate . Meanwhile, the forward dynamic model accepts and the true action to predict the representation of . Given a trajectory , the three models are trained to minimized the following loss function:
(22) 
where denotes the loss function that measures the distance between true actions and predicted actions, e.g., the cross entropy for discrete action space.
Finally, the intrinsic reward of each transition is computed as:
(23) 
where is the number of times that state has been visited during the current episode, it can be obtained using pseudocount method ostrovski2017count .
RND leverages DNN to record the visited states and computes its novelty, which consists of a predictor network and target network. The target network serves as the reference, which is fixed and randomly initialized to set the prediction problem. The predictor network is trained using the collected data by the agent across the episodes. Denote by and the target network and predictor network, where is the embedding dimension. The RND is trained to minimize the following loss function:
(24) 
Finally, the intrinsic reward of each transition is computed as:
(25) 
Moudle  Policy network  Encoder  Decoder  
Input  State  State  Latent Variables  
Arch. 




Output  Action  Latent variables  Reconstructed state 
For instance, "88 Conv. 32" represents a convolutional layer that has 32 filters of size 88. A categorical distribution was used to sample an action based on the action probability of the stochastic policy. Note that "Dense 512 & Dense 512" in Table 5 means that there are two branches for outputing the mean and variance of the latent variables, respectively.
Moudle  Policy network  Encoder  Decoder  
Input  State  State  Latent Variables  
Arch. 




Output  Action  Latent variables  Reconstructed state 
Comments
There are no comments yet.