# Rényi State Entropy for Exploration Acceleration in Reinforcement Learning

One of the most critical challenges in deep reinforcement learning is to maintain the long-term exploration capability of the agent. To tackle this problem, it has been recently proposed to provide intrinsic rewards for the agent to encourage exploration. However, most existing intrinsic reward-based methods proposed in the literature fail to provide sustainable exploration incentives, a problem known as vanishing rewards. In addition, these conventional methods incur complex models and additional memory in their learning procedures, resulting in high computational complexity and low robustness. In this work, a novel intrinsic reward module based on the Rényi entropy is proposed to provide high-quality intrinsic rewards. It is shown that the proposed method actually generalizes the existing state entropy maximization methods. In particular, a k-nearest neighbor estimator is introduced for entropy estimation while a k-value search method is designed to guarantee the estimation accuracy. Extensive simulation results demonstrate that the proposed Rényi entropy-based method can achieve higher performance as compared to existing schemes.

## Authors

• 5 publications
• 3 publications
• 174 publications
07/19/2021

### Multimodal Reward Shaping for Efficient Exploration in Reinforcement Learning

Maintaining long-term exploration ability remains one of the challenges ...
03/03/2022

### Intrinsically-Motivated Reinforcement Learning: A Brief Introduction

Reinforcement learning (RL) is one of the three basic paradigms of machi...
10/29/2020

### How do Offline Measures for Exploration in Reinforcement Learning behave?

Sufficient exploration is paramount for the success of a reinforcement l...
08/01/2019

### Curiosity-driven Reinforcement Learning for Diverse Visual Paragraph Generation

Visual paragraph generation aims to automatically describe a given image...
02/18/2021

### State Entropy Maximization with Random Encoders for Efficient Exploration

Recent exploration methods have proven to be a recipe for improving samp...
06/19/2019

### Adapting Behaviour via Intrinsic Reward: A Survey and Empirical Study

Learning about many things can provide numerous benefits to a reinforcem...
08/24/2021

### Entropy-Aware Model Initialization for Effective Exploration in Deep Reinforcement Learning

Encouraging exploration is a critical issue in deep reinforcement learni...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Reinforcement learning (RL) algorithms have to be designed to achieve an appropriate balance between exploitation and exploration [1]. However, many existing RL algorithms suffer from insufficient exploration, i.e. the agent cannot keep exploring the environment to visit all possible state-action pairs [2]. As a result, the learned policy prematurely falls into local optima after finite iterations [3]. To address the problem, a simple approach is to employ stochastic policies such as the -greedy policy and the Boltzmann exploration [4]

. These policies randomly select one action with a non-zero probability in each state. For continuous control tasks, an additional noise term can be added to the action to perform limited exploration. Despite the fact that such techniques can eventually learn the optimal policy in the tabular setting, they are futile when handling complex environments with high-dimensional observations.

To cope with the exploration problems above, recent approaches proposed to leverage intrinsic rewards to encourage exploration. In sharp contrast to the extrinsic rewards explicitly given by the environment, intrinsic rewards represent the inherent learning motivation or curiosity of the agent [5]. Most existing intrinsic reward modules can be broadly categorized into novelty-based and prediction error-based approaches [6, 7, 8, 9]. For instance, [10, 11, 12] employed a state visitation counter to evaluate the novelty of states, and the intrinsic rewards are defined to be inversely proportional to the visiting frequency. As a result, the agent is encouraged to revisit those infrequent states while increasing the probability of exploring new states. In contrast, [13, 14, 3] followed an alternative approach in which the prediction error of a dynamic model is utilized as intrinsic rewards. Given a state transition, an auxiliary model was designed to predict a successor state based on the current state-action pair. After that, the intrinsic reward is computed as the Euclidean distance between the predicted and the true successor states. In particular, [15] attempted to perform RL using only the intrinsic rewards, showing that the agent could achieve considerable performance in many experiments. Despite their good performance, these count-based and prediction error-based methods suffer from vanishing intrinsic rewards, i.e. the intrinsic rewards decrease with visits [16]. The agent will have no additional motivation to explore the environment further once the intrinsic rewards decay to zero. To maintain exploration across episodes, [17] proposed a never-give-up (NGU) framework that learns mixed intrinsic rewards composed of episodic and life-long state novelty. NGU evaluates the episodic state novelty using a slot-based memory and pseudo-count method [12], which encourages the agent to visit as many distinct states as possible in each episode. Since the memory is reset at the beginning of each episode, the intrinsic rewards will not decay during the training process. Meanwhile, NGU further introduced a random network distillation (RND) module to capture the life-long novelty of states, which prevents the agent from visiting familiar states across episodes [8]. However, NGU suffers from complicated architecture and high computational complexity, making it difficult to be applied in arbitrary tasks. A more straightforward framework entitled rewarding-impact-driven-exploration (RIDE) is proposed in [18]. RIDE inherits the inverse-forward pattern of [13], in which two dynamic models are leveraged to reconstruct the transition process. More specifically, the Euclidean distance between two consecutive encoded states is utilized as the intrinsic reward, which encourages the agent to take actions that result in more state changes. Moreover, RIDE uses episodic state visitation counts to discount the generated rewards, preventing the agent from staying at states that lead to large embedding differences while avoiding the television dilemma reported in [19].

However, both NGU and RIDE pay excessive attention to specific states while failing to reflect the global exploration extent. Furthermore, they suffer from poor mathematical interpretability and performance loss incurred by auxiliary models. To circumvent these problems, [20]

proposed a state entropy maximization method entitled random-encoder-for-efficient-exploration (RE3), forcing the agent to visit the state space more equitably. In each episode, the observation data is collected and encoded using a randomly initialized deep neural network. After that, a

-nearest neighbor estimator is leveraged to realize efficient entropy estimation [21]. Simulation results demonstrated that RE3 significantly improved the sampling efficiency of both model-free and model-based RL algorithms at the cost of less computational complexity. Despite its many advantages, RE3 ignores the important -value selection while its default random encoder entails low adaptability and robustness. Furthermore, [22] found that the Shannon entropy-based objective function may lead to a policy that visits some states with a vanishing probability, and proposed to maximize the Rényi entropy of state-action distribution (MaxRényi ). In contrast to RE3, RISE provides a more appropriate optimization objective for sustainable exploration. However, [22] leverages a variation auto-encoder (VAE) to estimate the state-action distribution, which produces high computational complexity and may mislead the agent due to imperfect estimation [23].

Inspired by the discussions above, we propose to devise a more efficient and robust method for state entropy maximization to improve exploration in RL. In this paper, we propose a RényI State Entropy (RISE) maximization framework to provide high-quality intrinsic rewards. Our main contributions are summarized as follows:

• We propose a Rényi entropy-based intrinsic reward module that generalizes the existing state entropy maximization methods such as RE3, and provide theoretical analysis for the Rényi entropy-based learning objective. The new module can be applied in arbitrary tasks with significantly improved exploration efficiency for both model-based and model-free RL algorithms;

• By leveraging (VAE) model, the proposed module can realize efficient and robust encoding operation for accurate entropy estimation, which guarantees its generalization capability and adaptability. Moreover, a search algorithm is devised for the -value selection to reduce the uncertainty of performance loss caused by random selection;

• Finally, extensive simulation is performed to compare the performance of RISE against existing methods using both discrete and continuous control tasks as well as several hard exploration games. Simulation results confirm that the proposed module achieve superior performance with higher efficiency.

## 2 Problem Formulation

We study the following RL problem that considers a Markov decision process (MDP) characterized by a tuple

[1], in which is the state space, is the action space, is the transition probability, is the reward function, is the initial state distribution, and is a discount factor, respectively. We denote by the policy of the agent that observes the state of the environment before choosing an action from the action space. The objective of RL is to find the optimal policy that maximizes the expected discounted return given by

 π∗=argmaxπ∈ΠEτ∼πT−1∑t=0γtrt(st,at), (1)

where is the set of all stationary policies, and is the trajectory collected by the agent.

In this paper, we aim to improve the exploration in RL. To guarantee the completeness of exploration, the agent is required to visit all possible states during training. Such an objective can be regarded as the Coupon collector’s problem conditioned upon a nonuniform probability distribution

[24], in which the agent is the collector and the states are the coupons. Denote by the state distribution induced by the policy . Assuming that the agent takes environment steps to finish the collection, we can compute the expectation of as

 Eπ(~T)=∫∞0(1−|S|∏i=1(1−e−dπ(si)t))dt, (2)

where stands for the cardinality of the enclosed set .

For simplicity of notation, we sometimes omit the superscript in in the sequel. Efficient exploration aims to find a policy that optimizes . However, it is non-trivial to evaluate Eq. (2) due to the improper integral, not to mention solving the optimization problem. To address the problem, it is common to leverage the Shannon entropy to make a tractable objective function, which is defined as

 H(d)=−Es∼d(s)[logd(s)]. (3)

However, this objective function may lead to a policy that visits some states with a vanishing probability. In the following section, we will first employ a representative example to demonstrate the practical drawbacks of Eq. (3) before introducing the Rényi entropy to address the problem.

## 3 Rényi State Entropy Maximization

### 3.1 Rényi State Entropy

We first formally define the Rényi entropy as follows:

###### Definition 1 (Rényi Entropy).

Let

be a random vector that has a density function

with respect to Lebesgue measure on , and let be the support of the distribution. The Rényi entropy of order is defined as [22]:

 Hα(f)=11−αlog∫Xfα(x)dx. (4)

Using Definition 1, we propose the following Rényi state entropy (RISE):

 Hα(d)=11−αlog∫Sdα(s)ds. (5)

Fig. 1 use a toy example to visualize the contours of different objective functions when an agent learns from an environment characterized by only three states. As shown in Fig. 1, decreases rapidly when any state probability approaches zero, which prevents the agent from visiting a state with a vanishing probability while encouraging the agent to explore the infrequently-seen states. In contrast, the Shannon entropy remains relatively large as the state probability approaches zero. Interestingly, Fig. 1 shows that this problem can be alleviated by the Rényi entropy as it better matches . The Shannon entropy is far less aggressive in penalizing small probabilities, while the Rényi entropy provides more flexible exploration intensity.

### 3.2 Theoretical Analysis

To maximize , we consider using a maximum entropy policy computation (MEPC) algorithm proposed by [25], which uses the following two oracles:

###### Definition 2 (Approximating planning oracle).

Given a reward function and a gap , the planning oracle returns a policy by , such that

 Vπ≥maxπ∈ΠVπ−ϵ, (6)

where is the state-value function.

###### Definition 3 (State distribution estimation oracle).

Given a gap and a policy , this oracle estimates the state distribution by , such that

 ∥d−^d∥∞≤ϵ. (7)

Given a set of stationary policies , we define a mixed policy as , where contains the weighting coefficients. Then the induced state distribution is

 dπmix=∑iωidπi(s). (8)

Finally, the workflow of MEPC is summarized in Algorithm 1.

Consider the discrete case of Rényi state entropy and set , we have

 Hα(d)=11−αlog∑s∈Sdα(s). (9)

To maximize , we can alternatively maximize

 ~Hα(d)=11−α∑s∈Sdα(s). (10)

Since is not smooth, we may consider a smoothed defined as

 ~Hα,σ(d)=11−α∑s∈S(d(s)+σ)α, (11)

where .

###### Lemma 1.

is -smooth, such that

 ∥∇~Hα,σ(d)−∇~Hα,σ(d′)∥∞≤β∥d−d′∥∞, (12)

where .

###### Proof.

See proof in Appendix .2. ∎

Now we are ready to give the following theorem:

###### Theorem 1.

For any , let and . It holds

 ~Hα,σ(dπmix,T)≥maxπ∈Π~Hα,σ(dπ)−ϵ, (13)

if Algorithm 1 is run for

 T≥10ασα−2ϵlog10ασα−1(1−α)ϵ. (14)
###### Proof.

See proof in Appendix .3. ∎

Theorem 1 demonstrates the computational complexity when using MEPC to maximize . Moreover, a small will contribute to the exploration phase, which is consistent with the analysis in [22].

### 3.3 Fast Entropy Estimation

However, it is non-trivial to apply MEPC when handling complex environments with high-dimensional observations. To address the problem, we propose to utilize the following -nearest neighbor estimator to realize efficient estimation of the Rényi entropy [26]. Note that in Eq. (15) denotes the ratio between the circumference of a circle to its diameter.

###### Theorem 2 (Estimator).

Denote by a set of independent random vectors from the distribution . For , stands for the -nearest neighbor of among the set. We estimate the Rényi entropy using the sample mean as follows:

 ^Hk,αN(f)=1NN∑i=1[(N−1)VmCk∥Xi−~Xi∥m]1−α, (15)

where and is the volume of the unit ball in , and is the Gamma function, respectively. Moreover, it holds

 limN→∞^Hk,αN(f)=Hα(f). (16)
###### Proof.

See proof in [26]. ∎

Given a trajectory collected by the agent, we approximate the Rényi state entropy in Eq. (4) using Eq. (15) as

 ^Hk,αT(d) =1TT−1∑i=0[(T−1)VmCk∥yi−~yi∥m]1−α, (17) ∝1TT−1∑i=0∥yi−~yi∥1−α,

where is the encoding vector of and is the -nearest neighbor of . After that, we define the intrinsic reward that takes each transition as a particle:

 ^r(si)=∥yi−~yi∥1−α, (18)

where is used to distinguish the intrinsic reward from the extrinsic reward . Eq. (18) indicates that the agent needs to visit as more distinct states as possible to obtain higher intrinsic rewards.

Such an estimation method requires no additional auxiliary models, which significantly promotes the learning efficiency. Equipped with the intrinsic reward, the total reward of each transition is computed as

 rtotalt=r(st,at)+λt⋅^r(st)+ζ⋅H(π(⋅|st)), (19)

where is the action entropy regularizer for improving the exploration on action space, and are two non-negative weight coefficients, and is a decay rate.

## 4 Robust Representation Learning

While the Rényi state entropy encourages exploration in high-dimensional observation spaces, several implementation issues have to be addressed in its practical deployment. First of all, observations have to be encoded into low-dimensional vectors in calculating the intrinsic reward. While a randomly initialized neural network can be utilized as the encoder as proposed in [20]

, it cannot handle more complex and dynamic tasks, which inevitably incurs performance loss. Moreover, since it is less computationally expensive to train an encoder than RL, we propose to leverage the VAE to realize efficient and robust embedding operation, which is a powerful generative model based on the Bayesian inference

[23]. As shown in Fig. 2(a), a standard VAE is composed of a recognition model and a generative model. These two models represent a probabilistic encoder and a probabilistic decoder, respectively.

We denote by the recognition model represented by a neural network with parameters . The recognition model accepts an observation input before encoding the input into latent variables. Similarly, we represent the generative model as using a neural network with parameters , accepting the latent variables and reconstructing the observation. Given a trajectory

, the VAE model is trained by minimizing the following loss function:

 L(st;ϕ,ψ) =Eqϕ(z|st)[logpψ(st|z)] (20) −DKL(qϕ(z|st)∥pψ(z)),

where , is the Kullback-Liebler (KL) divergence.

Next, we will elaborate on the design of the value to improve the estimation accuracy of the state entropy. [21]

investigated the performance of this entropy estimator for some specific probability distribution functions such as uniform distribution and Gaussian distribution. Their simulation results demonstrated that the estimation accuracy first increased before decreasing as the

value increases. To circumvent this problem, we propose our -value searching scheme as shown in Fig. 2(b). We first divide the observation dataset into subsets before the encoder encodes the data into low-dimensional embedding vectors. Assuming that all the data samples are independent and identically distributed, an appropriate value should produce comparable results on different subsets. By exploiting this intuition, we propose to search the optimal value that minimizes the min-max ratio of entropy estimation set. Denote by the policy network, the detailed searching algorithm is summarized in Algorithm 2.

Finally, we are ready to propose our RISE framework by exploiting the optimal value derived above. As shown in Fig. 2(c), the proposed RISE framework first encodes the high-dimensional observation data into low-dimensional embedding vectors through . After that, the Euclidean distance between and its -nearest neighbor is computed as the intrinsic reward. Algorithm 3 and Algorithm 4 summarize the on-policy and off-policy RL versions of the proposed RISE, respectively. In the off-policy version, the entropy estimation is performed on the sampled transitions in each step. As a result, a larger batch size can improve the estimation accuracy. It is worth pointing out that RISE can be straightforwardly integrated into any existing RL algorithms such as Q-learning and soft actor-critic, providing high-quality intrinsic rewards for improved exploration.

## 5 Experiments

In this section, we will evaluate our RISE framework on both the tabular setting and environments with high-dimensional observations. We compare RISE against two representative intrinsic reward-based methods, namely RE3 and MaxRényi . A brief introduction of these benchmarking methods can be found in Appendix .1. We also train the agent without intrinsic rewards for ablation studies. As for hyper-parameters setting, we only report the values of the best experiment results.

### 5.1 Maze Games

In this section, we first leverage a simple but representative example to highlight the effectiveness of the Rényi state entropy-driven exploration. We introduce a grid-based environment Maze2D [27] illustrated in Fig. 3. The agent can move one position at a time in one of the four directions, namely left, right, up, and down. The goal of the agent is to find the shortest path from the start point to the end point. In particular, the agent can teleport from a portal to another identical mark.

#### 5.1.1 Experimental Setting

The standard Q-learning (QL) algorithm [2] is selected as the benchmarking method. We perform extensive experiments on three mazes with different sizes. Note that the problem complexity increases exponentially with the maze size. In each episode, the maximum environment step size was set to , where is the maze size. We initialized the Q-table with zeros and updated the Q-table in every step for efficient training. The update formulation is given by:

 Q(s,a)←Q(s,a)+η[r+γmaxa′Q(s′,a′)−Q(s,a)], (22)

where is the action-value function. The step size was set to while a -greedy policy with an exploration rate of was employed.

#### 5.1.2 Performance Comparison

To compare the exploration performance, we choose the minimum number of environment steps taken to visit all states as the key performance indicator (KPI). For instance, a maze of grids corresponds to states. The minimum number of steps for the agent to visit all the possible states is evaluated as its exploration performance. As seen in Fig. 4, the proposed Q-learning+RISE achieved the best performance in all three maze games. Moreover, RISE with smaller takes less steps to finish the exploration phase. This experiment confirmed the great capability of the Rényi state entropy-driven exploration.

### 5.2 Atari Games

Next, we will test RISE on the Atari games with a discrete action space, in which the player aims to achieve more points while remaining alive [28]. To generate the observation of the agent, we stacked four consecutive frames as one input. These frames were cropped to the size of to reduce the required computational complexity.

#### 5.2.1 Experimental Setting

To handle the graphic observations, we leveraged convolutional neural networks (CNNs) to build RISE and the benchmarking methods. For fair comparison, the same policy network and value network are employed for all the algorithms, and their architectures can be found in Table

1. For instance, “88 Conv. 32” represents a convolutional layer that has filters of size 8

8. A categorical distribution was used to sample an action based on the action probability of the stochastic policy. The VAE block of RISE and MaxRényi need to learn an encoder and a decoder. The encoder is composed of four convolutional layers and one dense layer, in which each convolutional layer is followed by a batch normalization (BN) layer

[29]. Note that “Dense 512 & Dense 512” in Table 1

means that there are two branches to output the mean and variance of the latent variables, respectively. For the decoder, it utilizes four deconvolutional layers to perform upsampling while a dense layer and a convolutional layer are employed at the top and the bottom of the decoder. Finally, no BN layer is included in the decoder and the ReLU activation function is employed for all components.

In the first phase, we initialized a policy network and let it interact with eight parallel environments with different random seeds. We first collected observation data over ten thousand environment steps before the VAE encoder generates the latent vectors of dimension of

from the observation data. After that, the latent vectors were sent to the decoder to reconstruct the observation tensors. For parameters update, we used an Adam optimizer with a learning rate of

and a batch size of . Finally, we divided the observation dataset into subsets before searching for the optimal -value over the range of using Algorithm 2.

Equipped with the learned and encoder , we trained RISE with ten million environment steps. In each episode, the agent was also set to interact with eight parallel environments with different random seeds. Each episode has a length of steps, producing pieces of transitions. After that, we calculated the intrinsic reward for all transitions using Eq. (18), where . Finally, the policy network was updated using a proximal policy optimization (PPO) method [30]

. More specifically, we used a PyTorch implementation of the PPO method, which can be found in

[31]. The PPO method was trained with a learning rate of , a value function coefficient of , an action entropy coefficient of , and a generalized-advantage-estimation (GAE) parameter of [32]

. In particular, a gradient clipping operation with threshold

was performed to stabilize the learning procedure. As for benchmarking methods, we trained them following their default settings reported in the literature [22, 20].

#### 5.2.2 Performance Comparison

The average one-life return is employed as the KPI in our performance comparison. Table 2 illustrates the performance comparison over eight random seeds on nine Atari games. For instance, 5.24k1.86k represents the mean return is

k and the standard deviation is

k. The highest performance is shown in bold. As shown in Table 2, RISE achieved the highest performance in all nine games. Both RE3 and MaxRényi achieved the second highest performance in three games. Furthermore, Fig. 5 illustrates the change of average episode return during training of two selected games. It is clear that the growth rate of RISE is faster than all the benchmarking methods.

Next, we compare the training efficiency between RISE and the benchmarking methods, and the frame per second (FPS) is set as the KPI. For instance, if a method takes second to finish the training of an episode, the FPS is computed as the ratio between the time cost and episode length. The time cost involves only interaction and policy updates for the vanilla PPO agent. But the time cost needs to involve further the intrinsic reward generation and auxiliary model updates for other methods. As shown in Fig. 6

, the vanilla PPO method achieves the highest computation efficiency, while RISE and RE3 achieve the second highest FPS. In contrast, MaxRényi has far less FPS that RISE and RE3. This mainly because RISE and RE3 require no auxiliary models, while MaxRényi uses a VAE to estimate the probability density function. Therefore, RISE has great advantages in both the policy performance and learning efficiency.

### 5.3 Bullet Games

#### 5.3.1 Experimental Setting

Finally, we tested RISE on six Bullet games [33] with continuous action space, namely Ant, Half Cheetah, Hopper, Humanoid, Inverted Pendulum and Walker 2D. In all six games, the target of the agent is to move forward as fast as possible without falling to the ground. Unlike the Atari games that have graphic observations, the Bullet games use fixed-length vectors as observations. For instance, the “Ant” game uses parameters to describe the state of the agent, and its action is a vector of values within .

We leveraged the multilayer perceptron (MLP) to implement RISE and the benchmarking methods. The detailed network architectures are illustrated in Table

4. Note that the encoder and decoder were designed for MaxRényi, and no BN layers were introduced in this experiment. Since the state space is far simpler than the Atari games, the entropy can be directly derived from the observations while the training procedure for the encoder is omitted. We trained RISE with ten million environment steps. The agent was also set to interact with eight parallel environments with different random seeds, and Gaussian distribution was used to sample actions. The rest of the updating procedure was consistent with the experiments of the Atari games.

#### 5.3.2 Performance Comparison

Table 3 illustrates the performance comparison between RISE and the benchmarking methods. Inspection of Table 3 suggests that RISE achieved the best performance in all six games. In summary, RISE has shown great potential for achieving excellent performance in both discrete and continuous control tasks.

## 6 Conclusion

In this paper, we have investigated the problem of improving exploration in RL by proposing a Rényi state entropy maximization method to provide high-quality intrinsic rewards. Our method generalizes the existing state entropy maximization method to achieve higher generalization capability and flexibility. Moreover, a -value search algorithm has been developed to obtain efficient and robust entropy estimation by leveraging a VAE model, which makes the proposed method practical for real-life applications. Finally, extensive simulation has been performed on both discrete and continuous tasks from the Open AI Gym library and Bullet library. Our simulation results have confirmed that the proposed algorithm can substantially outperform conventional methods through efficient exploration.

### .1 Benchmarking Methods

#### .1.1 Re3

Given a trajectory , RE3 first uses a randomly initialized DNN to encode the visited states. Denote by the encoding vectors of observations, RE3 estimates the entropy of state distribution using a -nearest neighbor entropy estimator [21]:

 ^HkT(d) =1TT−1∑i=0logT⋅∥xi−~xi∥m2⋅πk⋅Γ(m2+1)+logk−Ψ(k) (23) ∝1TT−1∑i=0log∥xi−~xi∥2,

where is the -nearest neighbor of within the set , is the dimension of the encoding vectors, and is the Gamma function, and is the digamma function. Note that in Eq. (23) denotes the ratio between the circumference of a circle to its diameter. Equipped with Eq. (23), the total reward for each transition is computed as:

 rtotal=r(st,at)+λt⋅log(∥xt−~xt∥2+1), (24)

where is a weight coefficient that decays over time, is a decay rate. Our RISE method is a generalization of RE3, which provides more aggressive exploration incentives.

#### .1.2 MaxRényi

The MayRényi method aims to maximize the Rényi entropy of state-action distribution . The gradient of its objective function is

 ∇θHα(dπρ)∝α1−αE(s,a)∼dπ[∇θlogπ(a|s) (25) (11−γ⟨dπs,a,(dπρ)α−1⟩+(dπρ(s,a))α−1)].

It uses VAE to estimate and take the evidence lower bound (ELBO) as the density estimation [23], which suffers from low efficiency and high variance.

### .2 Proof of Lemma 1

Since is a diagonal matrix, we have

 ∥∇~Hα,σ(d)−∇~Hα,σ(d′)∥∞ (26) ≤maxς∈[0,1]|∇2~Hα,σ(ςd+(1−ς)d′)|⋅∥d−d′∥∞ ≤ασα−2∥d−d′∥∞,

where the first inequality follows the Taylor’s theorem. This concludes the proof.

### .3 Proof of Theorem 1

Equipped with Lemma 1, let , we have (proved by [25]):

 ~Hα,σ(dπ∗)−~Hα,σ(dπmix,T) (27) ≤Bexp(−Tη)+2βϵ2+ϵ1+ηβ,

where . Thus it suffices to set . When Algorithm 1 is run for

 T≥10βϵ−1log10Bϵ−1, (28)

it holds

 ~Hα,σ(dπmix,T)≥maxπ∈Π~Hα,σ(dπ)−ϵ. (29)

Consider the imposed smoothing on , we set and This concludes the proof.