DeepAI

# Linear Reinforcement Learning with Ball Structure Action Space

We study the problem of Reinforcement Learning (RL) with linear function approximation, i.e. assuming the optimal action-value function is linear in a known d-dimensional feature mapping. Unfortunately, however, based on only this assumption, the worst case sample complexity has been shown to be exponential, even under a generative model. Instead of making further assumptions on the MDP or value functions, we assume that our action space is such that there always exist playable actions to explore any direction of the feature space. We formalize this assumption as a “ball structure” action space, and show that being able to freely explore the feature space allows for efficient RL. In particular, we propose a sample-efficient RL algorithm (BallRL) that learns an ϵ-optimal policy using only Õ(H^5d^3/ϵ^3) number of trajectories.

• 6 publications
• 2 publications
• 7 publications
• 16 publications
10/12/2020

### Is Plug-in Solver Sample-Efficient for Feature-based Reinforcement Learning?

It is believed that a model-based approach for reinforcement learning (R...
07/18/2022

### A Few Expert Queries Suffices for Sample-Efficient RL with Resets and Linear Value Approximation

The current paper studies sample-efficient Reinforcement Learning (RL) i...
06/07/2022

### Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure

The practicality of reinforcement learning algorithms has been limited d...
03/23/2021

### An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap

A fundamental question in the theory of reinforcement learning is: suppo...
05/17/2021

### Sample-Efficient Reinforcement Learning Is Feasible for Linearly Realizable MDPs with Limited Revisiting

Low-complexity models such as linear function representation play a pivo...
07/29/2018

### Optimal Tap Setting of Voltage Regulation Transformers Using Batch Reinforcement Learning

In this paper, we address the problem of setting the tap positions of vo...
11/28/2022

### Inapplicable Actions Learning for Knowledge Transfer in Reinforcement Learning

Reinforcement Learning (RL) algorithms are known to scale poorly to envi...

## 1 Introduction

Reinforcement Learning (RL) is a well-studied framework for sequential decision making that has been successfully applied to real-world problems in fields such as game-play (Atari, AlphaGo, Starcraft), robotics, operations management, and more (mnih2013playing; silver2016mastering; vinyals2017starcraft; kober2013reinforcement). However, many of the existing theoretical results can not be applied to many practical applications due to intractably-large number of states and/or actions. A common modeling assumption to address this issue is to assume the existence of a known feature mapping that maps state and action to a

-dimensional feature vector, and that either the underlying MDP dynamics or value functions is linear in this feature mapping. In this work, we consider the common setting where the optimal action-value function (or

-function) is linear and can be written as the inner product of the feature mapping of state-action pairs and some unknown parameter vector. The primary goal is to determine whether there exists algorithms that can achieve a near-optimal policy using an efficient number of samples. Here, efficient sample complexity refers to a polynomial number of samples with respect to the feature dimension , the horizon , and the size of the action set .

This setting has garnered much attention recently, however, in the general case pessimistic results have been shown in weisz2021exponential; weisz2022tensorplan; du2019good; wang2021exponential, which indicate that this problem is exponentially hard in the horizon or the size of the action set . Furthermore, this pessimistic result is true even with access to a generative model that allows for arbitrary state “resets.” Recently, several works have made further assumptions on the MDP that allow for efficient learning when the function is linear (jin2020provably; amortila2022few). These typically include additional assumptions on the transition or reward model, or access to additional side information such as expert queries. However, many of these assumptions are restrictive, unrealistic, or unfeasible for many practical use cases, since in the real world, we typically do not we have well-behaved transition models or access to expert oracles.

We seek a general yet practical assumption that is novel, realistic, and amenable to efficient learning. Our work is motivated by the observation that, in some difficult real-world RL applications such as game-play and operations management, it may be easier to think of actions (or consecutive actions) in feature space rather than state space. For example, in a typical dungeon-survival game with various tasks such as fighting monsters, eating food, or searching for treasure, the feature space could include combat statistics, health, and special items. Instead of actions consisting of low-level controls (e.g. movement, engage, run, etc.), we would consider higher-level “feature space” actions (e.g. fight monster, eat food, dig for treasure). Now, we conjecture that if a learning algorithm is always able to play actions to property explore the feature space, then, combined with a -dimensional feature mapper that exists in the case of linear RL, it should be able to learn a near-optimal policy efficiently. In order to mathematically characterize this property, we introduce the concept of a “ball structure” action space. This assumes that our action space always lies within a -dimensional ball of radius , so that every direction of the feature space has a corresponding action that can be taken, and therefore at any time step, we are able to explore in any direction of the feature space. However, a perfect ball-shaped action space may be somewhat unrealistic, therefore, we allow some flexibility on the degree of exploration in each direction by considering less-restrictive settings, such as when the action is instead contained within a convex set, or when the radius of the ball is allowed to differ from one time step to the next.

Our main result is the BallRL (pronounced baller) algorithm that leverages the exploration capabilities of the ball structure assumption and achieves sample-efficient bounds on learning. The results hold under very mild trajectory learning/PAC learning setting, i.e. we do not assume have access to the action space, and each sampled trajectory gives us information only about action sets along the trajectory, together with total rewards. The algorithm takes advantage of the ball structure action space for exploration, which can be shown to be efficient using the closed form solution of the optimal Bellman Equation, and enjoys a sample complexity bound of for an -optimal policy. Furthermore, a similar algorithm and complexity bound hold in the case of convex action set instead of a ball action set, under additional mild assumptions. All together, our results show that with a ball structure action set, we can achieve an exponential improvement in comparison to algorithms to linear problem without the ball structure assumption. We also demonstrate that our algorithm is easy to implement and is computationally efficient as well.

### 1.1 Organization of the Paper

The rest of the paper is structured as follows: in Section 2 and 3 we will introduce the problem setting and review prior work in the literature for the linear RL problem. In Section 4 we present our learning algorithm where we demonstrate that efficient learning is possible assuming a ball structure action set. In particular, we present two special generalizations of the assumption: in Section 4.1 consider when every state in step shares the same convex action set, and in Section 4.2 we assume the ball structure action set is allowed to vary by state. Note that the simpler ball structure assumption is a special case of both settings. We finally conclude in Section 5 with some discussion.

### 1.2 Notations

We will use and to denote the inner product and the 2-norm in , respectively. Let represent the -ball of radius in . The expectation will denote the expectations over all trajectories obtained according to and the underlying transition models and reward functions. We also follow standard big-Oh notation, that is, we will write if there exists some positive constant such that , and write if there exists some such that . Here, is the dimension of the feature space, the time horizon of an episode,

the high probability parameter, and

the near-optimality parameter of the learned policy.

## 2 Background

### 2.1 Preliminaries

(sutton2018reinforcement; puterman2014markov) is a well-known model of the typical reinforcement learning environment. We consider finite-horizon MDPs which are defined by the tuple , where the horizon and the state space is known to the learner, but the action space of each state , the transition model , the reward function , and initial state distribution are not known. To avoid confusion, without loss of generality we assume that have no intersection between each other.

For a given MDP, a policy is a mapping from state space to the action space, where for all . For a given policy , we define its value functions () and functions according to the following iterative equations:

 VπH+1(sh+1) =0,QH+1(sH+1,aH+1)=0,∀sH+1,aH+1, Qπh(sh,ah) =r(sh,ah)+∑sh+1P(sh+1|sh,ah)Vπh+1(sh+1), Vπh(sh) =Qπh(sh,π(sh)).

We further define the optimal and function as:

 Q∗h(sh,ah)=maxπQπh(sh,ah),V∗h(sh)=maxπVπh(sh).

For the optimal function we have the optimal Bellman Equations, so that for all ,

 Q∗h(sh,ah)=r(sh,ah)+∑sh+1P(sh+1|sh,ah)maxah+1Q∗h+1(sh+1,ah+1). (1)

A typical reinforcement learning problem objective is to determine an algorithm that recovers a policy that performs well relative to the unknown optimal policy ; performance is generally defined by comparing the learned policy’s and optimal policy’s value functions. In the following section we detail our specific problem setting and objective.

### 2.2 Our problem setting

Due to the intractability of dealing with extremely high-dimensional state spaces, we make the standard assumption of a linear function for our problem; that is, the optimal function is linear in a -dimensional feature mapping of the state and action:

###### Assumption 1 (Non-Stationary Linear Q∗ Assumption)

For each state-action pair , there exists a feature vector . There are also unknown parameters , such that the -function of state-action pairs has the following parametrization

 Q∗h(sh,ah)=⟨φ(sh,ah),θ∗h⟩,∀1≤h≤H,sh∈Sh,ah∈A(sh).

While Assumption 1 appears to be a very strong statement on the optimal function, it is known that by itself, the assumption is not enough to guarantee efficient learning. Therefore, we present our ball structure assumption that we will show will allow for sample-efficient RL under the linear assumption.

###### Assumption 2 (Ball Structure Action Set)

Define the ball with radius as

 B2(ρ)≜{x∈Rd∣∣∥x∥2≤ρ}.

For each state , there a feature vector and a positive number such that

 {φ(s,a)|a∈As}=φ(s)+B2(ρ(s)).

Without loss of generality, we can assume that

 A(s)=B2(ρh(s))≜{a∈Rd∣∣∥a∥2≤ρh(s)},

and also

 φ(s,a)=φ(s)+a.

This is because if there exists two actions such that , then Assumption 1 implies that . Hence if we remove from the action set, the value of will remain the same. Therefore, if we remove these redundant actions and find a near-optimal policy of the MDP, this policy must also be a near-optimal policy of the original MDP.

By above, after removing redundant actions, we can assume that is an injection, meaning is a one-to-one mapping from to . Hence we can replace every action with , and then we will have property . Thus, without loss of generality, in the rest of the paper, we will assume always holds.

Because the transition model and reward function are unknown at the beginning, the learner will only be able to access samples, or realizations, of them by directly interacting with the environment. That is, the learner must execute a policy to actually observe the outcome of those actions. We will consider the following trajectory learning setting:

[Trajectory Learning] At every iteration, the learner first picks a policy (a function mapping every state to some action in ), and then a trajectory is sampled according to the true underlying MDP. Only the following two pieces of information are revealed to the learner:

1. : The action sets of each state in the trajectory;

2. : The sum of total reward of the trajectory, where denotes the instant reward obtained by taking action at state , which satisfies .

Note that our trajectory learning setting is weaker than the standard PAC learning setting in the literature, where it is assumed that all the information of the trajectories is revealed, including the states and the instantaneous rewards . Our algorithm also does not require the use of a generative model that is standard in some linear works. Therefore, our algorithm applies to both the common PAC learning setting and generative model setting. For more information about this, please refer to Section 3.

Finally, in order to measure the performance of our learner’s policy, we define the closeness to optimality of a policy via the standard notion of an -optimal policy: [-optimal policy] If a policy satisfies

 |Vπ(s0)−V∗(s0)|≤ϵ,

then we call the policy an -optimal policy. Here, is the value function with respect to following the policy , and is the value function of the true optimal policy.

Our objective in this work is to develop an algorithm which can find an -optimal policy with high probability, by using a polynomial number (in and ) of trajectory learning iterations.

## 3 Related Literature

The linear problem is one of the simplest and most intuitive ways to describe reinforcement learning with parametrization. Many works have studied this setting of RL with the goal to develop a sample-efficient algorithm to learn a near-optimal policy. However, in the most general case, recent work has yielded only pessimistic results related to this problem. In weisz2021exponential; weisz2022tensorplan; du2019good; wang2021exponential; foster2021statistical, the linear problem has been shown to be exponentially hard in or or , even when the number of actions are small. Their main idea revolves around showing a lower exponential bound by constructing a needle in haystack-type MDP, i.e., among exponentially many actions there is only one action that induces rewards, hence in order to find the optimal action the learner must run policies an exponential number of times. Additionally, they also adopt the Johnson-Lindenstrauss lemma to show that they can choose these actions such that every two actions are sufficiently far away from each other, so that querying non-optimal actions gives limited information of the optimal action.

Apart from pessimistic results, there are many works which demonstrate that the linear problem is polynomially solvable with added additional assumptions. Assumptions are quite varied and numerous, and we attempt to give an overview of the different types that have allowed for efficient learning. If for all policies , the -function can be linearly parameterized, then the problem is polynomially solvable by using approximate policy iteration (lattimore2020learning). If both the transition model and reward function are deterministic, then the problem is polynomially solvable by eliminating functions that does not satisfy the linear function assumptions (wen2013efficient). If a ‘core set’ (that is, features of every state action pairs can be written as the convex combinations of features in the core set) exists for the MDP, then the problem is polynomially solvable (zanette2019limiting; shariff2020efficient). In comparison to our assumption, our algorithm has access to an orthogonal basis at first, which is similar to the idea of core set. However, the core set cannot capture our setting, since the ball cannot be written as convex combination of basis vectors - simply adopting their algorithm would induce exponential sample complexity. Under the assumption that the action set is finite, the TensorPlan Algorithm in  weisz2021query can obtain an -optimal policy using number of samples. Alternatively, if we assume access to an expert oracle which gives the value of when queried at state , the DELPHI algorithm can solve these linear problem in polynomial time using no more than calls of expert queries  (amortila2022few).

Beyond the linear problem, there are also several works which achieve polynomial sample complexity under general assumptions of the MDP’s underlying properties. If the transition model can be linearly parametrized, then the MDP problem becomes polynomially solvable as shown in jin2020provably; yang2019sample; yang2020reinforcement; jia2020model. However, a linear transition model is a fairly strong assumption and generally not a very practical assumption, as most systems do not behave as such. There are also works focused on generalized function approximations, e.g. Eluder Dimensions (ayoub2020model; wang2020reinforcement), Bellman Rank (jiang2017contextual), Bellman Eluder Dimension (jin2021bellman), Bilinear Class (du2021bilinear), Bellman Closeness (jin2021bellman; zanette2020learning). However, again these assumptions on the models are either hard to verify in practice or generally do not occur in real world systems, which makes the use of these algorithms difficult to justify in practice.

## 4 The BallRL Algorithm

We now present the main result of our paper, that is, an algorithm that achieves polynomial sample complexity in the linear setting under the assumption of a ball structure action space. Before proceeding with the details, we highlight two versions of our algorithm, Convex-BallRL and DiffR-BallRL, both of which are essentially extensions of the standard ball structure assumption (Assumption 2). In the first case, we consider convex action sets where the action sets are identical across state. While every state necessarily has the same set of actions to take, the magnitude to which one can explore different directions is permitted to vary, so long the overall action set is convex. This can be seen as a slightly more realistic version of the standard ball assumption, as in practice it may be difficult to guarantee the magnitude of every feature direction to be the same. In the second case, we consider the standard ball structure action set but allow the action set to vary depending on the current state. The motivation behind these two slightly different settings is to represent a more realistic generalization of the original ball structure presented earlier, as in practical settings action spaces may not always be uniformly a perfect ball.

### 4.1 Identical Convex Action Sets within One Step

In this section, we make the assumption that the action set is identical for every , and moreover, we assume the action sets are regular convex sets, which is a generalization of the ball structure presented in Assumption 2. Intuitively, the action set is contained between a smaller radius and a larger radius ball. [Regular Convex Set] We call a set a regular convex set with parameter if there exists such that and

 B2(ρ)⊂M⊂B2(η).

Regular convex sets include many different types of structures such as balls, cubes, ellipsoids, etc. Some specific examples include:

1. All balls are regular convex sets with parameter ;

2. Cubes in dimension are regular convex sets with parameter ;

3. Ellipsoids are regular convex sets with parameter , where are the longest and shortest axes.

Let us formally characterize our assumption for the setting with convex action sets.

###### Assumption 3 (Identical Convex Action Sets within One Step)

For every , there exists a regular convex set with parameter , such that for all , . Specifically, there exists , such that for every we have

 ηhρh=B,B2(ρh)⊂Ah⊂B2(ηh).

Without loss of generality, we also assume that the features still satisfy .

We develop an algorithm, Convex-BallRL, that works in the trajectory learning setting (Definition 2) under Assumption 1 and 3, and is guaranteed to find an -optimal policy using a polynomial number of trajectories.

#### 4.1.1 Intuition and key ideas

Before presenting the algorithm itself, we provide some intuition on the key ideas behind our algorithm. With loss of generality, we assume that we know the value of at the beginning. Otherwise, we can run one trajectory according to any policy, then all the action sets will be revealed to us, from which we can determine the values of .

We start by observing the following equation due to telescoping of Bellman Equation (1):

 E[⟨φ(s1),θ∗1⟩]+E[H∑h=1⟨ah,θ∗h⟩]=H∑h=1E[R(sh,ah)]+H∑h=1ρh+1maxah+1∈Ah+1⟨ah+1,θ∗h+1⟩. (2)

Our next observation is that the first term of LHS and the second term of RHS in (2) are identical for every policy. Hence, if we compare (2) between two different policies, we can obtain information of (according to

) based on the first term in RHS, which can be estimated through sampled trajectories. Formally, we choose

to be the all-zero policy:

 π0(sh)=0∈Rd,∀1≤h≤H, (3)

and () to be the following policy: for and ,

 πh,i(sh′)={0if h′≠h,ρheiif h′=h, (4)

where is the -th basis vector in . Comparing (2) according to policy and also policy , we obtain that

 ρh⟨ei,θ∗h⟩=Eπh,i[H∑h=1R(sh,ah)]−Eπ0[H∑h=1R(sh,ah)].

The right hand side can be estimated according to trajectories from policy and , which leads to the estimate of -th component of .

Finally after getting accurate enough estimations on , we adopt the greedy policy, i.e.

 π(sh)=argmaxah∈Ah⟨ah,^θh⟩, (5)

and then show that this policy is a nearly optimal policy.

#### 4.1.2 Algorithm and Sample Complexity

The pseudocode for BallRL with convex action sets is given in Algorithm 1. The main result of this section is the following theorem about its sample complexity, in particular, that it has polynomial sample complexity. The complete proof details are provided in Appendix A. For any , if we choose

 M=8H2B2dlog(2dH/δ)ϵ2,

then with probability at least , the output policy from the above algorithm is an -optimal policy. The total number of trajectories used in this algorithm is

 16H3B2d2log(2dH/δ)ϵ2.

If we assume all action sets have ball structure, then all action sets are regular convex sets with parameter . Hence the above algorithm is guaranteed to find an -optimal policy using number of trajectories.

In this section, we abandon the assumption that all states in step share identical action sets, and allow for the action set to vary depending on the state. However, we again assume that the action set corresponding to each state is a ball as in Assumption 2. We further assume that the norm of are all the same for , and also that the norm of features, rewards and radius are bounded:

###### Assumption 4 (Boundedness)

For each state , action , we have

 ∥φ(s,a)∥2≤1;

For some , we have

 ∥θ1∥2=⋯=∥θH∥2=Θ;

For every trajectories , we have

 0≤H∑h=1R(sh,ah)≤1,0≤H∑h=1ρ(sh)≤1.

We again aim to develop an algorithm that works under Definition 2 (trajectory learning), but under Assumption 1 (Linear assumption), 2 (Ball Structure Assumption) and 4 (Boundedness Assumption).

#### 4.2.1 Intuition and key ideas

We begin by presenting the following key ideas of our algorithm:

##### To Exploit the Ball Structure Action space

Similar to (2) in Convex-BallRL, our algorithm is again based on the telescoping of Bellman Equation (1), which exploits the ball structure of the action space:

 ⟨φ(s1),θ∗1⟩+Eπ[H∑h=1⟨ah,θ∗h⟩]=Eπ[H∑h=1R(sh,ah)]+Θ⋅Eπ[H∑h=1ρ(sh+1)]. (6)
##### Estimation of Norm by Grid Search

According to (6), we can estimate in the LHS based on the RHS. However, , which is the norm of the unknown parameters , is difficult to estimate. Hence in our algorithm, we adopt a grid search method for the value of : choosing for , so that at least one such is -close to the true . Therefore, if we develop our policy based on these , then at least one policy will necessarily be an -optimal policy.

##### Hierarchical Exploration

For the exploration in our algorithm, we will choose actions to be for in order to give information about the -th component of . However, one problem is that this estimation has accuracy at most , which will explode as goes to zero. To deal with this problem, we consider a hierarchical exploration method:

Suppose the policy we currently use for exploration is , and the greedy policy we calculated is . We can show that the exploration will guarantee accuracy on (up to logarithmic factors), and hence the error of is . Therefore, if for every we all have , then the error of the greedy policy is of order , which can be bounded by choosing some proper . Otherwise, we use the greedy policy to construct another exploration policy as follows:

 πh,0(sh′) ={π(sh′)if 1≤h′

Then these new policies will guarantee that , i.e. the value of becomes at least twice of its previous value. Therefore, we can show that this process will end in at most number of times, provided that the initial value of is at least .

We will show that if within a policy , the expected radius at step is smaller than , then the effect of different actions within this step can be ignored, and we do not need to carry out the above exploration in this step.

#### 4.2.2 Algorithm and Sample Complexity

Combine these ideas together, we construct the following Algorithm 2.

Finally, we arrive at our main result - that DiffR-BallRL is sample efficient. The proof details are provided in Appendix B. For any , with the choice

 ε=ϵ8H,δ′=δ(d+3HL)(1+Hlog2(1/ε)),η=ϵ8Hd, M2=2log(1/δ′)⋅16(2+4H+2Hd)2ϵ2,M1=2log(1/δ′)⋅256H2d2ϵ2,L=1η

Algorithm 2 will output an -optimal policy with probability at least . This algorithm will use at most

 ~O(H5d3ϵ3)

number of trajectories.

##### Proof Sketch of Theorem 4.2.2.

Our first step of the proof is to use Bellman Equation to prove (6):

 ⟨φ(s1),θ∗1⟩+Eπ[H∑h=1⟨ah,θ∗h⟩]=Eπ[H∑h=1R(sh,ah)]+Θ⋅Eπ[H∑h=1ρ(sh+1)],

which can be obtained through telescoping the following closed form of Bellman Equation at step :

 Eπ[⟨φ(sh),θ∗h⟩+⟨ah,θ∗h⟩]=Eπ[R(sh,ah)+ρ(sh+1)⋅∥θh+1∥2+⟨φ(sh+1),θ∗h+1⟩].

Our second step is a result bounding the value function error of the greedy policies:

 E[V∗1(s1)]−E[Vπ1(s1)]≤2H∑h=1Eπ[ρ(sh)]⋅∥∥^θh−θ∗h∥∥2.

Therefore, if is small for some (say less than ), then we can ignore this term, since it will never make big difference on the error. In the following, we assume that .

Next, we observe that there exists some such that . And for the iteration , according to Hoeffding inequality we can get

 ρh∥∥θh−^θl0h∥∥2≤ηd+2d√2log(1/δ′)M1+d√2log(1/δ′)M2

with high probability, where is the expectation of according to the exploration policy.

Finally, if of the greedy policy satisfies that , then the above inequality can guarantee that this greedy policy is near optimal. Otherwise, the value of will become twice as before according to our algorithm, and this process will terminate in number of iterations, since according to our assumption the initial value of is at least , and cannot be large than 1.

Combining these steps together, we can show that the algorithm will end in certain number of iterations, and when the algorithm ends, it will output a near optimal policy with high probability.

## 5 Conclusion

We presented the BallRL reinforcement learning algorithm that provides sample-efficient learning guarantees when the optimal action-value function is linear and actions exhibit a ball structure. We further generalized the ball structure to both convex actions sets and changing ball radius between states. Our techniques demonstrate that there is hope for efficient learning in linear RL when actions can sufficiently explore the feature space. The ball structure assumption itself is a sufficient, but not fully necessary condition to ensure full exploration of the feature space. We believe that the idea of the action set allowing for sufficient exploration can be achieved (perhaps approximately) in many practical settings.

An interesting research direction is to dive deeper into assuming the actions lie in convex sets instead of a pure ball structure. While the problem can be solved when the action set is consistent between all actions, it remains to be shown if convex sets can vary between states. Additionally, with different radii, our algorithm is polynomially efficient when unknown parameters across the horizon share the same norm. It would be valuable to see whether this assumption can be removed and parameters allowed to have different norms.

We thank Philip Amortila for helpful discussions.

## Appendix A Proof of Theorem 4.1.2

[Proof of Theorem 4.1.2]

First of all, according to Bellman Equation (1), we have

 ⟨φ(sh),θ∗h⟩+⟨ah,θ∗h⟩=Q∗h(sh,ah) =r(sh,ah)+∑sh+1P(sh+1|sh,ah)V∗h+1(sh+1) =r(sh,ah)+∑sh+1P(sh+1|sh,ah)maxah+1∈Ah+1⟨φ(sh+1)+ah+1,θ∗h+1⟩ =r(sh,ah)+∑sh+1P(sh+1|sh,ah)⟨φ(sh+1),θ∗h+1⟩+∑sh+1P(sh+1|sh,ah)ρh+1maxah+1∈Ah+1⟨ah+1,θ∗h+1⟩ =E[R(sh,ah)+⟨φ(sh+1),θ∗h+1⟩+ρh+1maxah+1∈Ah+1⟨ah+1,θ∗h+1⟩∣∣sh,ah],

where is the instant reward we obtain after choosing action from state ( has mean ). Hence for a given fixed policy , suppose a trajectory following this policy is , then we have

 Eπ[⟨φ(sh),θ∗h⟩]+Eπ[⟨ah,θ∗h⟩]=Eπ[⟨φ(sh+1),θ∗h+1⟩]+Eπ[R(sh,ah)]+ρh+1maxah+1∈Ah+1⟨ah+1,θ∗h+1⟩.

Summing this up from to and noticing that , we obtain that

 Eπ[⟨φ(s1),θ∗1⟩]+Eπ[H∑h=1⟨ah,θ∗h⟩]=H∑h=1Eπ[R(sh,ah)]+H∑h=1ρh+1maxah+1∈Ah+1⟨ah+1,θ∗h+1⟩.

With our choice of (the policy which choose action at any state and step), we obtain

 Eπ0[⟨φ(s1),θ∗1⟩]=H∑h=1Eπ0[R(sh,ah)]+H∑h=1ρh+1maxah+1∈Ah+1⟨ah+1,θ∗h+1⟩.

We notice that for every policy , are identical. Hence after subtracting the above two equations, we get

 Eπ[H∑h=1⟨ah,θ∗h⟩]=Eπ[H∑h=1R(sh,ah)]−Eπ0[H∑h=1R(sh,ah)]. (8)

With our choice of policy , the above equation indicates that

 ρhθ∗h,i=ρh⟨ei,θ∗h⟩=Eπh,i[H∑h=1R(sh,ah)]−Eπ0[H∑h=1R(sh,ah)],

where is the -th component of . According to our algorithm, we have , and we also have

 E[Rh,i−Rh,i]=Eπh,i[H∑h=1R(sh,ah)]−Eπ0[H∑h=1R(sh,ah)]=ρhθ∗h,i.

Therefore, according to Hoeffding inequality, with probability at least we have,

 ρh∣∣^θh,i−θ∗h,i∣∣≤2√log(1/δ′)2M=√2log(1/δ′)M,

if we assume that always holds. This indicates that with probability at least , for every ,

 ρh∥∥^θh−θ∗h∥∥2≤√2dlog(1/δ′)M.

Moreover, since we have

 ⟨ah,θ∗h⟩−⟨a∗h,θ∗h⟩ =⟨ah,θ∗h−^θh⟩+⟨ah,^θh⟩−⟨a∗h,^θh⟩+⟨a∗h,^θh−θ∗h⟩ ≤⟨ah,θ∗h−^θh⟩+⟨a∗h,^θh−θ∗h⟩ ≤2ηh⋅∥θh−^θh∥2 ≤2B⋅√2dlog(1/δ′)M,

where the first inequality is due to , and the second inequality is due to . Therefore, according to (8) we have

 V∗(s1)−Vπ(s1)≤2BH√2dlog(1/δ′)M.

With our choice of and , we have with probability at least , the output policy satisfies that

 V∗(s1)−Vπ(s1)≤ϵ