The trade-off between exploration and exploitation is at the core of reinforcement learning (RL). Designing efficient exploration algorithm, while being a highly nontrivial task, is essential to the success in many RL tasks [burda2018exploration, ecoffet2019go]. Hence, it is natural to ask the following high-level question: What can we achieve by pure exploration? To address this question, several settings related to meta reinforcement learning (meta RL) have been proposed (see e.g., [wang2016learning, duan2016rl, finn2017model]). One common setting in meta RL is to learn a model in a reward-free environment in the meta-training phase, and use the learned model as the initialization to fast adapt for new tasks in the meta-testing phase [eysenbach2018diversity, gupta2018unsupervised, nagabandi2018learning]. Since the agent still needs to explore the environment under the new tasks in the meta-testing phase (sometimes it may need more new samples in some new task, and sometimes less), it is less clear how to evaluate the effectiveness of the exploration in the meta-training phase. Another setting is to learn a policy in a reward-free environment and test the policy under the task with a specific reward function (such as the score in Montezuma’s Revenge) without further training with the task [burda2018exploration, ecoffet2019go, burda2018large]. However, there is no guarantee that the algorithm has fully explored the transition dynamics of the environment unless we test the learned model for arbitrary reward functions. Recently, Jin et al. [jin2020reward] proposed a theoretical framework that fully decouples exploration and exploitation. Further, they designed a provably efficient algorithm that conducts a finite number of steps of reward-free exploration and returns near-optimal policies for arbitrary reward functions. However, their algorithm is designed for the tabular case and can hardly be extended to continuous or high-dimensional state spaces since they construct a policy for each state.
In this paper, we consider a similar zero-shot meta RL framework as follows: First, a single policy is trained to explore the dynamics of the reward-free environment in the exploration phase (i.e., the meta-training phase). Then, a dataset of trajectories is collected by executing the learned exploratory policy. In the planning phase (i.e., the meta-testing phase), an arbitrary reward function is specified and a batch RL algorithm [lange2012batch, fujimoto2018off] is applied to solve for a good policy solely based on the dataset, without further interaction with the environment. This framework is suitable to the scenarios when there are many reward functions of interest or the reward is designed offline to elicit desired behavior. The key to success in this framework is to obtain a dataset with good coverage over all possible situations in the environment with as few samples as possible, which in turn requires the exploratory policy to fully explore the environment.
Several methods that encourage various forms of exploration have been developed in the reinforcement learning literature. The maximum entropy framework [haarnoja2017reinforcement] maximizes the cumulative reward in the meanwhile maximizing the entropy over the action space conditioned on each state. This framework results in several efficient and robust algorithms, such as soft Q-learning [schulman2017equivalence, nachum2017bridging], SAC [haarnoja2018soft] and MPO [abdolmaleki2018maximum]. On the other hand, maximizing the state space coverage may results in better exploration. Various kinds of objectives/regularizations are used for better exploration of the state space, including information-theoretic metrics [houthooft2016vime, mohamed2015variational, eysenbach2018diversity] (especially the entropy of the state space [hazan2018provably, islam2019entropy]), the prediction error of a dynamical model [burda2018large, pathak2017curiosity, de2018curiosity], the state visitation count [burda2018exploration, bellemare2016unifying, ostrovski2017count]
or others heuristic signals such as novelty[lehman2011novelty, conti2018improving], surprise [achiam2017surprise] or curiosity [schmidhuber1991possibility].
To obtain an exploratory policy for the zero-shot meta RL framework, we propose to maximize the Rényi entropy over the state-action space in the exploration phase. In particular, we demonstrate the advantage of using state-action space, instead of the state space via a very simple example (see Section 3 and Figure 1). Moreover, Rényi entropy generalizes a family of entropies, including the commonly used Shannon entropy. We can justify the use of Rényi entropy as the objective function theoretically by providing an upper bound on the number of samples in the dataset to ensure that a near-optimal policy is obtained for any reward function in the planning phase.
Further, we derive a gradient ascent update rule for maximizing the Rényi entropy over the state-action space. The derived update rule is similar to the vanilla policy gradient update with the reward replaced by a function of the discounted stationary state-action distribution of the current policy. We use variational autoencoder (VAE)[kingma2013auto]
as the density model to estimate the distribution. The corresponding reward changes over iterations which makes it hard to accurately estimate a value function under the current reward. To address this issue, we propose to estimate a state value function using the off-policy data with the reward relabeled by the current density model. This enables us to efficiently update the policy in a stable way using PPO[schulman2017proximal]. Afterwards, we collect a dataset by executing the learned policy. In the planning phase, when a reward function is specified, we augment the dataset with the reward function and use a batch RL algorithm, batch constrained deep Q-learning (BCQ) [fujimoto2018off, fujimoto2019benchmarking], to plan for a good policy under the reward function. We conduct experiments on several environments with discrete, continuous or high-dimensional state spaces. The experiment results indicate that our algorithm is effective, sample efficient and robust in the exploration phase, and achieves good performance under the zero-shot meta RL framework.
Our contributions are summarized as follows:
(Section 3) We consider a zero-shot meta RL framework that separates exploration from exploitation and therefore places a higher requirement for an exploration algorithm. To efficiently explore under this framework, we propose a novel objective that maximizes the Rényi entropy over the state-action space in the exploration phase and justify this objective theoretically.
(Section 4) We propose a practical algorithm based on a derived policy gradient formulation to maximize the Rényi entropy over the state-action space for the zero-shot meta RL framework.
(Section 5) We conduct a wide range of experiments and the results indicate that our algorithm is efficient and robust during the exploration and results in superior performance in the downstream planning phase for arbitrary reward functions.
A reward-free environment can be formulated as a controlled Markov process (CMP) [fernandez1994controlled] which specifies the state space with , the action space with , the transition dynamics , the initial state-action distribution and the discount factor . A (stationary) policy parameterized by
specifies the probability of choosing the actions on each state. The stationary discounted state visitation distribution (or simply the state distribution) under the policyis defined as . The stationary discounted state-action visitation distribution (or simply the state-action distribution) under the policy is defined as . Unless otherwise stated, we use to denote the state-action distribution. We also use to denote the distribution started from the state-action pair instead of the initial state-action distribution .
When a reward function
is specified, CMP becomes the Markov decision process (MDP)[sutton2018introduction] . The objective for MDP is to find a policy that maximizes the expected cumulative reward , where is the Q function. The policy gradient for this objective is [williams1992simple].
Rényi entropy for a distribution is defined as , where
is the probability mass or the probability density function on(and the summation becomes integration in the latter case). When , Rényi entropy becomes Hartley entropy and equals the logarithm of the size of the support of . When , Rényi entropy becomes Shannon entropy [bromiley2004shannon, sanei2016renyi].
Given a distribution , the corresponding density model parameterized by gives the probability density estimation of based on the samples drawn from . Variational auto-encoder (VAE) [kingma2013auto] is a popular density model which maximizes the variational lower bound (ELBO) of the log-likelihood. Specifically, VAE maximizes the lower bound of , i.e., , where and are the decoder and the encoder respectively and is a prior distribution for the latent variable .
3 The objective for the exploration phase
The objective for the exploration phase under the zero-shot meta RL framework is to induce an informative and compact dataset: The informative condition is that the dataset should have good coverage such that the planning phase generates good policies for arbitrary reward functions. The compact condition is that the size of the dataset should be as small as possible to ensure a successful planning. In this section, we show that the Rényi entropy over the state-action space (i.e., ) is a good objective function for the exploration phase. We first demonstrate the advantage of maximizing the state-action space entropy over maximizing the state space entropy with a toy example. Then, we provide a motivation to use Rényi entropy by analyzing a deterministic setting. At last, we provide an upper bound on the number of samples needed in the dataset for a successful planning if we have access to a policy that maximizes the Rényi entropy over the state-action space. For ease of analysis, we assume the state-action space is discrete in this section and derive a practical algorithm that deals with continuous state-action space in the next section.
Why maximizing the entropy over the state-action space? We demonstrate the advantage of maximizing the entropy over the state-action space with a toy example shown in Figure 1. The example contains an MDP with two actions and five states. The first action always drives the agent back to the first state while the second action moves the agent to the next state. For simplicity of presentation, we consider a case with a discount factor , but other values are similar. The policy that maximizes the entropy of the state distribution is a deterministic policy that takes the actions shown in red. The dataset obtained by executing this policy is poor since the planning algorithm fails when, in the planning phase, a sparse reward is assigned to one of the state-action pairs that it visits with zero probability (e.g., a reward function that equals to on and equals to otherwise). In contrast, a policy that maximizes the entropy of the state-action distribution avoids the problem. For example, executing the policy that maximizes the Rényi entropy with over the state-action space, the expected size of the induced dataset is only 44 to ensure that the dataset contains all the state-action pairs (cf. Appendix The toy example in Figure 1). Note that, when the transition dynamics is known to be deterministic, a dataset containing all the state-action pairs is sufficient for the planning algorithm to obtain an optimal policy since the full transition dynamics is known.
Why using Rényi entropy? We analyze a deterministic setting where the transition dynamics is known to be deterministic. In this setting, the objective for the framework can be expressed as a specific objective function for the exploration phase. This objective function is hard to optimize but motivates us to use Rényi entropy as the surrogate.
We define as the cardinality of the state-action space. Given an exploratory policy , we assume the dataset is collected in a way such that the transitions in the dataset can be treated as i.i.d. samples from , where is the state-action distribution induced by the policy .
In the deterministic setting, we can recover the exact transition dynamics of the environment using a dataset of transitions that contains all the state-action pairs. Such a dataset leads to a successful planning for arbitrary reward functions, and therefore satisfies the informative condition. In order to obtain such a dataset that is also compact, we stop collecting samples if the dataset contains all the pairs. Given the distribution from which we collect samples, the average size of the dataset is , where
which is a result of the coupon collector’s problem [flajolet1992birthday]. Accordingly, the objective for the exploration phase can be expressed as . We show the contour of this function in Figure 2a. We can see that, when any component of the distribution approaches zeros, increases rapidly.
However, this function involves an improper integral which is hard to handle, and therefore cannot be directly used as an objective function in the exploration phase. One common choice for a tractable objective function is Shannon entropy, i.e., [hazan2018provably, islam2019entropy]. Still, Shannon entropy suffers from the following problem: The policy that maximizes Shannon entropy (denoted as ) may visit some state-action pairs with vanishing probabilities, so that we need much more samples to collect these pairs with such a policy. We provide an illustrative example in Figure 2b. Consider a CMP with three state-action pairs (specified in Appendix The example in Figure 2). For this CMP, the feasible region in the distribution space is the green line, where is the set of all the stationary policies. The state-action distribution is marked by the green star. We can see that, although the maximum probability to reach the last state-action pair for any policy is small, the policy visits this pair with a probability which is even smaller. Thus, we need to collect samples in average to ensure a successful planning. As a comparison, the numbers are only , and when using , and as the objective function respectively, where is Rényi entropy. This indicates that Rényi entropy with a small is a better objective function than Shannon entropy.
Theoretical justification for the objective function. Next, we formally justify the use of Rényi entropy over the state-action space with the following theorem. For ease of analysis, we consider a standard episodic case: The MDP has a finite planning horizon with the objective to maximize the cumulative reward , where and is picked from an initial state distribution . We assume the reward function is deterministic. A (non-stationary, stochastic) policy specifies the probability of choosing the actions on each state and on each step. The state-action distribution induced by on the -th step is .
Denote as a set of policies , where and . Construct a dataset with trajectories, each of which is collected by first uniformly randomly choosing a policy from and then executing the policy . Assume
where and is an absolute constant. Then, there exists a planning algorithm such that, for any reward function , with probability at least , the output policy of the planning algorithm based on is -optimal, i.e., , where .
We provide the proof in Appendix Proof of Theorem 3.1. The theorem justifies that Rényi entropy with small is a proper objective function for the exploration phase since the number of samples needed to ensure a successful planning is bounded when is small. When , the bound becomes infinity. The algorithm proposed by Jin et al. [jin2020reward] requires to sample trajectories where hides a logarithmic factor, which matches our bound when . However, they construct a policy for each state on each step, whereas we only need policies which easily adapts for the non-tabular case.
In this section, we develop an algorithm that works in the non-tabular case. In the exploration phase, we update the policy to maximize . We first deduce a gradient ascent update rule which is similar to vanilla policy gradient with the reward replaced by a function of the state-action distribution of the current policy. Afterwards, we estimate the state-action distribution using VAE. We also estimate a value function and update the policy using PPO, which is more sample efficient and robust than vanilla policy gradient. Then, we obtain a dataset by collecting samples from the learned policy. In the planning phase, we use a popular batch RL algorithm, BCQ, to plan for a good policy when the reward function is specified. One may also use other batch RL algorithms. We show the pseudocode of the process in Algorithm 1, the details of which are described in the following paragraphs.
Policy gradient formulation. Let us first consider the gradient of the objective function , where the policy is approximated by a policy network with the parameters denoted as . We omit the dependency of on . The gradient of the objective function is
As a special case, when , the Rényi entropy becomes the Shannon entropy and the gradient turns into
Due to space limit, the derivation can be found in Appendix Policy gradient of state-action space entropy.
There are two terms in the gradients.
The first term equals to with the reward or , which resembles the policy gradient (of cumulative reward) for a standard MDP.
This term encourages the policy to choose the actions that lead to the rarely visited state-action pairs.
In a similar way, the second term resembles the policy gradient of instant reward and encourages the agent to choose the actions that are rarely selected on the current step.
For stability, we replace this term with
111 We found that using this term leads to more stable performance empirically since this term does not suffer from the high variance induced by the estimation of
We found that using this term leads to more stable performance empirically since this term does not suffer from the high variance induced by the estimation of. which encourages the agent to choose the actions uniformly conditioned on the state samples and therefore plays a similar role (see also Appendix .3). Accordingly, we update the policy based on the following formula where
is a hyperparameter:
Discussion. Islam et al. [islam2019entropy] motivate the agent to explore by maximizing the Shannon entropy over the state space resulting in an intrinsic reward which is similar to ours when . Bellmare et al. [bellemare2016unifying] use an intrinsic reward where is an estimation of the visit count of . Our algorithm with induces a similar reward.
Sample collection. To estimate for calculating the underlying reward, we collect samples in the following way (cf. Line 5 of Algorithm 1): In the -th iteration, we sample trajectories. In each trajectory, we terminate the rollout on each step with probability . In this way, we obtain a set of trajectories where is the length of the -th trajectory. Then, we can use VAE to estimate based on (cf. Line 6 of Algorithm 1).
Value function. Instead of performing vanilla policy gradient, we update the policy using PPO which is more robust and sample efficient. However, the underlying reward function changes across iterations. This makes it hard to learn a value function incrementally that is used to reduce variance in PPO. We propose to train a value function network using relabeled off-policy data. In the -th iteration, we maintain a replay buffer that stores the trajectories of the last iterations (cf. Line 5 of Algorithm 1). Next, we calculate the reward for each state-action pair in with the latest density estimator , i.e., we assign or for each . Based on these rewards, we can estimate the target value for each state using the truncated TD() estimator [sutton2018introduction] which balances bias and variance (see the detail in Appendix Value function estimator). Then, we train the value function network (where is the parameter) to minimize the mean squared error w.r.t. the target values:
Policy network. In each iteration, we update the policy network to maximize the following objective function that is used in PPO:
where and is the advantage estimated using GAE [schulman2015high] and the learned value function .
We conduct experiments to answer the following questions: 1) Does our exploration algorithm MaxRenyi empirically obtain near-optimal policies in terms of the entropy over the state-action space? 2) How does MaxRenyi perform compared with previous exploration algorithms in the exploration phase? 3) Does MaxRenyi lead to better performance in the planning phase compared with previous exploration algorithms? 4) Does our algorithm succeed in complex environments? The experiments are conducted five times with different random seeds. The lines in the figures indicate the average and the shaded areas or the error bars indicate the standard deviation. The detailed experiment settings and hyperparameters can be found in AppendixExperiment settings and hyperparameters.
Experiments on Pendulum and FourRooms. To answer the first question, we compare the entropy induced by the agent during the training in MaxRenyi with the maximum entropy the agent can achieve. We implement MaxRenyi for two simple environments, Pendulum and FourRooms, where we can solve for the maximum entropy by brute force search. Pendulum from OpenAI Gym [1606.01540] has a continuous state-action space. For this task, the entropy is estimated by discretizing the state space into grids and the action space into two discrete actions. FourRooms is a grid world environment with four actions and deterministic transitions. We show the results in Figure 3. The results indicate that our exploration algorithm approaches the optimal in terms of the corresponding state-action space entropy with different in both discrete and continuous settings.
Experiments on MultiRooms. To answer the second and the third questions, we implement our algorithm for the MultiRooms environment from minigrid [gym_minigrid] which is hard to explore for standard RL algorithms. In the exploration phase, the agent has to navigate and open the doors to explore through a series of rooms. In the planning phase, we randomly assign a goal state to one of the grids, reward the agent if it reaches the state, and then train a policy with this reward function. There are four actions: turning left/right, moving forward and opening the door. The observation is the first person perspective from the agent which is high-dimensional and partially observable (cf. Figure 4a).
To answer the second question, we compare our exploration algorithm MaxRenyi with RND [burda2018exploration] (which uses the prediction error as the intrinsic reward), MaxEnt [hazan2018provably] (which maximizes the state space entropy) and an ablated version of our algorithm MaxRenyi(VPG) (that updates the policy directly by vanilla policy gradient using (2) and (3)). We show the performance of different algorithms in the exploration phase in Figure 4b. First, we observe that using PPO to update performs better than the ablated VPG versions, indicating that MaxRenyi benefits from the variance reduction with a value function. Second, we see that MaxRenyi is more sample efficient than MaxEnt that invokes multiple runs of a standard RL algorithm. Also, MaxRenyi is more stable than RND that explores well at the start but later degenerates when the agent becomes familiar with all the states.
To answer the third question, we collect datasets of different sizes by executing different exploratory policies, and use the datasets to compute policies with different reward functions using BCQ. We show the results in Figure 4c. First, we observe that the datasets generated by running random policies do not lead to a successful planning, indicating the importance of learning a good exploratory policy in this framework. Second, the dataset with only 8k samples leads to a successful planning (with a cumulative reward larger than 0.8) using MaxRenyi, whereas a dataset with 16k samples is needed to succeed in the planning phase when using MaxEnt or RND. This illustrates that MaxRenyi leads to a better performance in the planning phase (i.e., attains good policies with fewer samples) than the previous exploration algorithms.
Experiments on Montezuma’s Revenge. To answer the last question, we implement our algorithm for Montezuma’s Revenge [machado18arcade] which is a video game with high-dimensional observations and discrete actions. We show the result for the exploration phase in Figure 5a. We observe that our algorithm with different successfully explores 10 to 20 rooms within 0.2 billion steps and performs better than RND. Then, we collect a dataset with 100 million samples by executing the policy obtained using . We design two sparse reward functions that only reward the agent if it goes through Room 3 or Room 7 (to the next room). We show the trajectories of the policies planned with the two reward functions in red and blue respectively in Figure 5b. We see that although the reward functions are sparse, the agent chooses the correct path (e.g., opening the correct door in Room 1 with the only key) and successfully reaches the specified room. This indicates that our algorithm generates good policies for different reward functions based on an offline planning in complex environments.
In this paper, we consider a zero-shot meta RL framework, which is useful to train agents with multiple reward functions or design the reward function offline. In this framework, an exploratory policy is learned by interacting with a reward-free environment in the exploration phase and generates a dataset. In the planning phase, when the reward function is specified, a policy is computed offline to maximize the corresponding cumulative reward using the batch RL algorithm based on the dataset. We propose a novel objective function, the Rényi entropy over the state-action space, for the exploration phase. We theoretically justify this objective and design a practical algorithm to optimize for this objective. In the experiments, we show that our exploration algorithm is effective under this framework, while being more sample efficient and more robust than the previous exploration algorithms.
The zero-shot meta RL framework studied in this paper may be useful in designing safe reinforcement learning agent. Under this framework, we can iteratively design the reward function and train the agent offline to avoid dangerous behaviors to the system. Furthermore, it is possible to collect trajectories by human, which ensures the safety during the overall training process. In this case, what constitutes a good trajectory dataset is an interesting research question, and our work may be seen as an attempt to answer this question.
The toy example in Figure 1
In this example, the initial state-action is fixed to be . Due to the simplicity of this toy example, we can solve for the policy that maximizes the Rényi entropy of the state-action distribution with and . The optimal policy is
where represents the probability of choosing on . The corresponding state-action distribution is
where the -th element represents the probability density of the state-action distribution on . Using equation (1), we obtain which is the expected number of samples collected form this distribution that contains all the state-action pairs.
The example in Figure 2
In this example, we consider a CMP with state-action pairs and . Specifically, there are two states and . The state has only one action . The state has two actions and . Therefore, the state-action space is and the state space is . The transition matrix of this CMP is
and the initial state action distribution of this CMP is
This example illustrates the following essence that widely exists in other CMPs: A good objective function should encourage the policy to visit the hard-to-reach states as frequently as possible and Rényi entropy does a better job than Shannon entropy from this perspective.
Proof of Theorem 3.1
We provide Theorem 3.1 to justify our objective that maximizes the entropy over the state-action space in the exploration phase for the zero-shot meta RL setting. In the following analysis, we consider the standard episodic and tabular case as follows: The MDP is defined as , where is the state space with , is the action space with , is the length of each episode, is the set of unknown transition matrices, is the set of the reward functions and is the unknown initial state distribution. We denote as the transition probability from to on the -th step. We assume that the instant reward for taking the action on the state on the -th step is deterministic. A policy specifies the probability of choosing the actions on each state and on each step. We denote the set of all such policies as .
The agent with policy interacts with the MDP as follows: At the start of each episode, an initial state is sampled from the distribution . Next, at each step , the agent first observes the state and then selects an action from the distribution . The environment returns the reward and transits to the next state according to the probability . The state-action distribution induced by on the -th step is .
The state value function for a policy is defined as . The objective is to find a policy that maximizes the cumulative reward .
The proof goes through in a similar way to that of Jin et al. [jin2020reward].
Proof of Theorem 3.1.
Given the dataset specified in the theorem, a reward function and the level of accuracy , we use the planning algorithm shown in Algorithm 2 to obtain a near-optimal policy . The planning algorithm first estimates a transition matrix based on and then solves for a policy based on the estimated transition matrix and the reward function .
Define and is the value function on the MDP with the transition dynamics . We first decompose the error into the following terms:
Then, the proof of the theorem goes as follows:
To bound the two evaluation error terms, we first present Lemma .4 to show that the policy which maximizes Rényi entropy is able to visit the state-action space reasonably uniformly, by leveraging the convexity of the feasible region in the state-action distribution space (Lemma .2). Then, with Lemma .4 and the concentration inequality Lemma .5, we show that the two evaluation error terms can be bounded by for any policy and any reward function (Lemma .7).
To bound the optimization error term, we use the natural policy gradient (NPG) algorithm as the APPROXIMATE-MDP-SOLVER in Algorithm 2 to solve for a near-optimal policy on the estimated and the given reward function . Finally, we apply the optimization bound for the NPG algorithm [agarwal2019optimality] to bound the optimization error term (Lemma .8).
Definition .1 ().
is defined as the feasible region in the state-action distribution space , , where is the cardinality of the state-action space.
Hazan et al. [hazan2018provably] proved the convexity of such feasible regions in the infinite-horizon and discounted setting. For completeness, we provide a similar proof on the convexity of in the episodic setting.
[Convexity of ] is convex. Namely, for any and , denote and , there exists a policy such that .
Proof of Lemma .2.
Define a mixture policy that first chooses from with probability and respectively at the start of each episode, and then executes the chosen policy through the episode. Define the state distribution (or the state-action distribution) for this mixture policy on each layer as . Obviously, . For any mixture policy , we can construct a policy where such that [puterman2014markov]. In this way, we find a policy such that . ∎
Similar to Jin et al. [jin2020reward], we define -significance for state-action pairs on each step and show that the policy that maximizes Rényi entropy is able to reasonably uniformly visit the significant state-action pairs.
Definition .3 (-significance).
A state-action pair on the -th step is -significant if there exists a policy , such that the probability to reach following the policy is greater than , i.e., .
Recall the way we construct the dataset : With the set of policies where and , we first uniformly randomly choose a policy from at the start of each episode, and then execute the policy through the episode. Note that there is a set of policies that maximize the Rényi entropy on the -th layer since the policy on the subsequent layers does not affect the entropy on the -th layer. We denote the induced state-action distribution on the -th step as , i.e., the dataset can be regarded as being sampled from for each step . Therefore, .
If , then
Proof of Lemma .4.
For any -significant , consider the policy . Denote . We can treat
as a vector withdimensions and use the first dimension to represent , i.e., . Denote . Since , . By Lemma .2, . Since is concave over when , will monotonically increase as we increase from 0 to , i.e.,
This is true when , i.e., when , we have
Since are non-negative, we have
Note that , we have
Since , we have . Using the fact that , we can obtain the inequality in (7). ∎
When , we have
Suppose is the empirical transition matrix estimated from samples that are i.i.d. drawn from and is the true transition matrix, then with probability at least , for any and any value function , we have:
Proof of Lemma .5.
The lemma is proved in a similar way to that of Lemma C.2 in [jin2020reward] that uses a concentration inequality and a union bound. The proof differs in that we do not need to count the different events on the action space for the union bound since the state-action distribution is given here (instead of only the state distribution). This results in a missing factor in the logarithm term compared with their lemma. ∎
Lemma .6 (Lemma E.15 in Dann et al. [dann2017unifying]).
For any two MDPs and with rewards and and transition probabilities and , the difference in values and with respect to the same policy is
Under the conditions of Theorem 3.1, with probability at least , for any reward function and any policy , we have:
Proof of Lemma .7.
The proof is similar to Lemma 3.6 in Jin et al. [jin2020reward] but differs in several details. Therefore, we still provide the proof for completeness. The proof is based on the following observations: 1) The total contribution of all insignificant state-action pairs is small; 2) is reasonably accurate for significant state-action pairs due to the concentration inequality in Lemma .5.
Using Lemma .6 on and with the true transition dynamics and the estimated transition dynamics respectively, we have
Let be the set of -significant state-action pairs on the -th step. We have
By the definition of -significant pairs, we have
As for , by Cauchy-Shwartz inequality and Lemma .4, we have
By Lemma .5, we have