# Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm

In this paper, we investigate the sample complexity of policy evaluation in infinite-horizon offline reinforcement learning (also known as the off-policy evaluation problem) with linear function approximation. We identify a hard regime dγ^2>1, where d is the dimension of the feature vector and γ is the discount rate. In this regime, for any q∈[γ^2,1], we can construct a hard instance such that the smallest eigenvalue of its feature covariance matrix is q/d and it requires Ω(d/γ^2(q-γ^2)ε^2exp(Θ(dγ^2))) samples to approximate the value function up to an additive error ε. Note that the lower bound of the sample complexity is exponential in d. If q=γ^2, even infinite data cannot suffice. Under the low distribution shift assumption, we show that there is an algorithm that needs at most O(max{‖θ^π‖ _2^4/ε^4logd/δ,1/ε^2(d+log1/δ)}) samples (θ^π is the parameter of the policy in linear function approximation) and guarantees approximation to the value function up to an additive error of ε with probability at least 1-δ.

## Authors

• 64 publications
• 20 publications
• 54 publications
• ### What are the Statistical Limits of Offline RL with Linear Function Approximation?

Offline reinforcement learning seeks to utilize offline (observational) ...
10/22/2020 ∙ by Ruosong Wang, et al. ∙ 0

• ### Variance-Aware Off-Policy Evaluation with Linear Function Approximation

We study the off-policy evaluation (OPE) problem in reinforcement learni...
06/22/2021 ∙ by Yifei Min, et al. ∙ 7

• ### Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation

We consider the offline reinforcement learning problem, where the aim is...
11/21/2021 ∙ by Dylan J. Foster, et al. ∙ 0

• ### Sample Complexity and Overparameterization Bounds for Projection-Free Neural TD Learning

We study the dynamics of temporal-difference learning with neural networ...
03/02/2021 ∙ by Semih Cayci, et al. ∙ 0

• ### Bellman-consistent Pessimism for Offline Reinforcement Learning

The use of pessimism, when reasoning about datasets lacking exhaustive e...
06/13/2021 ∙ by Tengyang Xie, et al. ∙ 0

• ### Linear Systems can be Hard to Learn

In this paper, we investigate when system identification is statisticall...
04/02/2021 ∙ by Anastasios Tsiamis, et al. ∙ 73

• ### Cooperative Deep Q-learning Framework for Environments Providing Image Feedback

In this paper, we address two key challenges in deep reinforcement learn...
10/28/2021 ∙ by Krishnan Raghavan, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In offline reinforcement learning (also known as batch reinforcement learning) [15, 2, 9]

, we are interested in evaluating a strategy and making sequential decisions when the algorithm has access to a batch of offline data (for example, watching StarCraft game videos and reading click logs of users of Amazon) rather than interacts directly with the environment, which is modeled by a Markov decision process (MDP). Research on offline reinforcement learning has gained increasing interest because of the following reasons. First, exploration can be expensive and even risky. For example, while a robot explores the environment, in addition to the time and economic costs, it can damage its own hardware as well as objects around and even hurt people. Second, we can use offline reinforcement learning to pre-train an agent efficiently using existing data and evaluate the exploitation performance of an algorithm.

To handle large-scale and even continuous states, researchers introduced function approximation to approximate the value of states and state-action pairs [19, 10, 22, 1, 26]. Linear function approximation assumes that every state-action pair is assigned a (hand-craft or learned) feature vector and that the value function is the inner product of the feature vector and an unknown parameter that depends on the policy [5, 16, 17, 24, 11]. [13, 23] considered online and offline episodic finite-horizon reinforcement learning with linear function approximation, respectively. Our work considers infinite-horizon offline reinforcement learning with linear function approximation. We investigate the sample complexity of approximating the value function up to an additive error bound under a given policy (this problem is also known as the off-policy evaluation). Our results consist of a lower bound and an upper bound. Throughout this paper, let denote the dimension of the feature vector and the discount rate.

##### Lower Bound

Recall that the assumption of linear function approximation means that the value function is linear in the unknown policy-specific parameter . For the feature vectors of the state-action pairs in the dataset, we call their covariance matrix the feature covariance matrix. We identify a hard regime . In this regime, inspired by [23, 3], we construct a hard instance whose value function satisfies the assumption of linear function approximation and feature covariance matrix is well- or even best-conditioned possible. To be precise, for any , we can construct a hard instance whose feature covariance matrix has the smallest eigenvalue . To approximate the value of a state in this instance up to an additive error , with high probability we need samples. We see that the sample complexity depends exponentially in and suffers from the curse of dimensionality. In fact, represents the best-conditioned feature covariance matrix because the smallest eigenvalue has a upper bound. If one chooses , even infinite data cannot guarantee good approximation and we recover the result of [3]. We would like to remark that the result of [3] is a special case of ours. The smallest eigenvalue is in the construction of [3]. We can make it in our construction at a cost of degrading the sample complexity lower bound from infinity to being exponential in . This agrees with our intuition that a problem with a better-conditioned feature covariance matrix (which indicates better feature coverage) is easier to solve. In addition, our result fills the gap from to the best possible condition .

##### Upper Bound

Under the low distribution shift assumption, we show that the Least-Squares Policy Evaluation (LSPE) algorithm needs at most samples ( is the parameter of the policy in linear function approximation) and guarantees approximation to the value function up to an additive error of with probability at least . If we also assume as in [13, 23], the sample complexity becomes . In addition, we show that our hard instance does not satisfy the low distribution shift assumption and therefore the upper bound does not contradict the lower bound.

##### Paper Organization

The rest of the paper is organized as follows. Section 2 presents related work. We introduce notation and preliminaries in Section 3. We show the lower bound in Section 4 and upper bound in Section 5. Section 6 concludes the paper.

## 2 Related Work

There is a large body of work on policy evaluation in offline reinforcement learning (also known as the off-policy evaluation) with function approximation [8, 25, 27, 14, 6, 7, 28, 23, 3, 21]. The seminal work [6] studied offline infinite-horizon reinforcement learning whose value function is approximated by a finite function class. Assuming both low distribution shift and policy completeness, they showed an upper bound on the sample complexity. The upper bound depends polynomially in and (in their paper, denotes the multiplicative approximation error bound), logarithmically in the size of the function class, and linearly in the concentratability coefficient that quantifies distribution shift. If low distribution shift is not assumed, they showed a lower bound that excludes polynomial sample complexity if the MDP dynamics are unrestricted. [7] studied episodic finite-horizon off-policy evaluation with linear function approximation. They assumed that the function class is closed under the conditional transition operator and that the data consists of i.i.d. episode samples, each being a trajectory generated by some policy. Under these two assumptions, they determined the minimax-optimal error of evaluating a policy.

The papers closest to ours are probably [28, 23, 3]. [23] studied offline episodic finite-horizon reinforcement learning with linear function approximation. They proved an sample complexity lower bound in order to achieve a constant additive approximation error with high probability, where is the planning horizon. In their hard instance, the smallest eigenvalue of the feature covariance matrix is . Under the low distribution shift assumption, they proved an upper bound on the sample complexity. In particular, they showed that the squared additive error is at most , where is the number of samples and are positive constants coming from their low distribution shift assumption. However, this additional assumption does not exclude their hard instance. It is possible that and their upper bound still gives an upper bound exponential in . [3] considered the same problem as in this paper. They presented a hard instance such that the smallest eigenvalue of the feature covariance matrix is and any algorithm must have additive approximation error, even with infinite data. We have compared our work to [3] in Section 1. [28] investigated a different setting where data is obtained via policy-free queries and policy-induced queries. [28] did not consider the condition number of the feature covariance matrix.

## 3 Preliminaries

We use the shorthand notation . If is a set, write

for the uniform distribution on

. If is a matrix, write

for its spectral norm, which equals its largest singular value. If

is a vector, agrees with its Euclidean norm. For two square matrices and of the same size, we write if is a positive semidefinite matrix.

##### Infinite-Horizon Reinforcement Learning

We consider the infinite-horizon Markov decision process (MDP) [18]. It is defined by the tuple , where is the set of states, is the set of actions that an agent can choose and play, and

and respectively given a state-action pair , and is the discount factor. We assume that the reward takes values from

. We will also denote this random variable by

, i.e., (we overload the notation ). A policy is a probability distribution on given a state . If is deterministic, we will abuse the notation and write if is a delta distribution at . Given a policy as well as an initial state , it induces a random trajectory , where , and . The value of a state and the -function of a state-action pair are given by

 Vπ(s)=E[∑i≥0γiri∣s0=s],Qπ(s,a)=E[∑i≥0γiri∣s0=s,a0=a].

Since we assume that the absolute value of rewards is at most , we have and .

##### Linear Function Approximation

The following Assumption 1 assumes that the -function is the inner product of the feature vector of a state-action pair and the unknown policy-specific parameter . This assumption was also assumed in [16, 3]. Although it was not directly assumed in [13], their linear MDP assumption (Assumption A) implies our Assumption 1 (see Proposition 2.3 in [13]) and they stated it in the context of episodic finite-horizon reinforcement learning.

###### Assumption 1 ([16, 13]).

For every state-action pair and every policy , there is a feature vector and a parameter such that

 Qπ(s,a)=ϕ(s,a)⊤θπ.
###### Assumption 2 ([23]).

Since , without loss of generality, we assume for every .

In fact, if , we can use the normalized feature vectors and the new parameter for the policy becomes , where is the original policy parameter.

##### Offline Reinforcement Learning

In offline reinforcement learning, rather than interacts with the MDP directly, the agent has access to a batch of samples , where are i.i.d. samples from a distribution on , , and . Given a policy , we are interested in evaluating the value of a state under this policy approximately, using samples from . If our problem satisfies Assumption 1, the feature covariance matrix of [23, 3] is defined by

 Λ≜E(s,a)∼μ[ϕ(s,a)ϕ(s,a)⊤].

We require that the feature covariance matrix be well-conditioned (the smallest eigenvalue of is lower bounded), which indicates that has a good feature coverage. In our hard instance to be presented in Section 4, the smallest eigenvalue satisfies , where can be any value on . Note that under Assumption 2, is at most . To see this, we compute the trace . Since , we get . In other words, is the best possible condition.

## 4 Lower Bound

In this section, we present our lower bound on the sample complexity of infinite-horizon offline reinforcement learning with linear function approximation. Recall that is the dimension in linear function approximation and is the discount rate. Inspired by [3, 23], we can construct a hard instance provided that . In the assumption of our lower bound theorem below (Theorem 1), we require that be a multiple of for some constant . If , there exists such that . Then Theorem 1 gives an at least exponential, and potentially infinite, lower bound of sample complexity, depending on the condition number (the smallest eigenvalue of the feature covariance matrix, i.e., ) that we would like to achieve. In other words, we suffer from the curse of dimensionality. Therefore, we can say that the regime where is a hard regime.

###### Theorem 1.

Let denote the set of all infinite-horizon MDPs that satisfy Assumption 1 and Assumption 2 and whose feature vectors have dimension , rewards lie in . Let denote the set of all probability measures on such that the feature covariance matrix has smallest eigenvalue at least . Fix and . For any dimension which is a multiple of (thus is a positive integer), if , we have

 sup(S,A,P,R,γ)∈Ids∈S,μ∈Mq/d(S,A)inf^V,π∣∣^V−Vπ(s)∣∣≥Ω⎛⎜⎝ ⎷1+γ(q−γ2)(1−γ)γ2Ndb,γbdb,γln(18δ(1−2δ))⎞⎟⎠. (1)

with probability at least , where is the number of samples from and is a real-valued function with samples as input.

###### Remark 1 (Sample complexity).

In the proof of Theorem 1, we present a hard instance with only one action such that the smallest eigenvalue of the feature covariance matrix is . For this instance, any algorithm requires

 Ω(1+γ(q−γ2)(1−γ)γ2ε2db,γbdb,γln(18δ(1−2δ)))

samples in order to approximate the value of a state up to an additive error with probability at least . This lower bound for sample complexity follows directly from Equation 1.

###### Remark 2.

Our result subsumes [3] as a special case. Recall that the smallest eigenvalue of the feature covariance matrix is at most . Therefore, the parameter is at most . If , no algorithm can approximate the value of a state up to a constant additive error even provided with an arbitrarily large dataset. In this case, the smallest eigenvalue of the feature covariance matrix is . We recover the impossibility result of [3].

###### Remark 3.

If and , we have

 sup(S,A,P,R,γ)∈Ids∈S,μ∈Mq/d(S,A)inf^V,π∣∣^V−Vπ(s)∣∣≥Ω⎛⎜⎝1γ(1−γ) ⎷1Ndb,γbdb,γln(18δ(1−2δ))⎞⎟⎠.

The sample complexity lower bound becomes

 Ω(1γ2(1−γ)2ε2db,γbdb,γln(18δ(1−2δ))).
###### Proof.

Fix integers and . We will set to either or . Our hard instance has three groups of states. Each state has one single action. Therefore, we omit the action in and and write and , respectively (in this case, is the value of state ). All transitions are deterministic. Group A contains states . Group B contains states . Group C contains states . The total number of states in all three groups is . Every state in group A transitions to the corresponding state in group B. All states in group B and C on level transition to state . All states in group B and C on level have a self-loop and transition to themselves. All states in group A have zero reward. All states in group B on level have zero reward and those in group C on level have reward . Moreover, define the reward of the state in group C on level to be . The reward of the states in group B on level is a random variable taking values from :

 R(s0,i)=⎧⎨⎩1with probability 1+r0(1−γ)2,−1with probability 1−r0(1−γ)2\,.

We illustrate our hard instance in Figure 1. We set the distribution to the mixture of uniform distributions on and , i.e., , where .

First, we check that all rewards lie in . Recall that all states in group A have zero reward. In group B, the reward of (, ) is zero and the reward of () is either or . In group C, if , the reward of () is zero. If , recalling , we have

 R(s0,0)=r0√m(1−γ)≤√m(1−γ)γL−1mL/2=1−γ(mγ2)(L−1)/2≤1

and

 R(sl,0)=r0(√mγ)l(√m−1)≤R(sL−1,0)≤(√mγ)L−1γL−1mL/2(√m−1)≤1.

The second step is to compute the value of each state. We will show for by induction. It holds for because . Assume that it holds for some . We have

 Q(sl+1,0)=R(sl+1,0)+γQ(sl,0)=r0(√mγ)l+1(√m−1)+γr0(√mγ)l√m=r0(√mγ)l+1√m.

Then for and , we have

 Q(sl,i)=γQ(sl−1,0)=γr0(√mγ)l−1√m=r0(√mγ)l.

Finally, for , we obtain . In group A, we have .

Let be the standard basis vectors of , where . Recall for and . Define and for , , and

 θπ=∑i∈[m]L−1∑l=0r0(√mγ)lel,i.

For , we have

 ϕ(sl,i)⊤θπ =r0(√mγ)l=Q(sl,i), ϕ(s′l,i)⊤θπ =γr0(√mγ)l=Q(s′l,i), ϕ(sl,0)⊤θπ =1√m∑i∈[m]r0(√mγ)l=r0(√mγ)l√m=Q(sl,0).

The feature vectors of the states in group B and C have unit norm: for and . Those in group A have norm . We are in a position to compute the feature covariance matrix

 Es∼μ[ϕ(s)ϕ(s)⊤]=p⎛⎝1d∑i∈[m]L−1∑l=0el,ie⊤l,i⎞⎠+(1−p)⎛⎝γ2d∑i∈[m]L−1∑l=0el,ie⊤l,i⎞⎠=qdId,

where is the identity matrix.

Set and . In this case, our requirement is satisfied. Next, we consider an algorithm evaluating the value of . If , . If , . To approximate the value of up to an additive error of , the algorithm has to distinguish and . The only way that the algorithm obtains the information of is to sample the reward of () because the other states in the support of have reward . Recall the two possible reward distributions of ():

Using Lemma 5.1 in [4], we have any algorithm outputs an incorrect from the two choices and with probability at least

where is the number of samples of (). Since only of samples from are , any algorithm outputs an incorrect with probability at least

 14⎛⎜ ⎜⎝1−  ⎷1−exp⎛⎝−Θ⎛⎝pLN(2ε(1−γ)γL−1mL/2)2⎞⎠⎞⎠⎞⎟ ⎟⎠ = 14⎛⎜⎝1− ⎷1−exp(−Θ(pNε2(1−γ)2γ2L(mγ2)L))⎞⎟⎠ ≥ 14⎛⎜⎝1− ⎷1−exp(−Θ(pNε2(1−γ)2γ2db,γbdb,γ))⎞⎟⎠ = 14⎛⎜⎝1− ⎷1−exp(−Θ((q−γ2)(1−γ)γ21+γ⋅Nε2db,γbdb,γ))⎞⎟⎠,

where is the number of samples from and the inequality follows from and . If and , we can solve and obtain

 ε=Θ⎛⎜⎝ ⎷1+γ(q−γ2)(1−γ)γ2Ndb,γbdb,γln(18δ(1−2δ))⎞⎟⎠.

## 5 Upper Bound

In this section, we show that under the low distribution shift assumption, the Least-Squares Policy Evaluation approximates the value function up to any given additive error bound with samples. Suppose that the samples that the agent has access to are , where , and . We would like to approximate the value of state . Recall the feature covariance matrix . Define , , and

###### Assumption 3 (Low distribution shift).

There exists and such that and .

###### Remark 4.

Assumption 3 rules out the hard instance in Theorem 1. Specifically, there is no such that in the hard instance.

###### Proof of Remark 4.

In the proof of Theorem 1, we show that . In the sequel, we compute the matrix . Recall the data distribution , where , and . Suppose that is the next state for . Every state in group A transitions to the corresponding state in group B, i.e., . Therefore, if we have and

 Es∼Unif(GA),¯s∼P(⋅∣s)[ϕ(¯s)ϕ(¯s)⊤]=1mL∑i∈[m]L−1∑l=0el,ie⊤l,i=1dId≜¯ΛA. (2)

Every state in group B on level transition to state in group C. All states in group B on level have a self-loop. As a result, if , we have (for all ) and (for all ). Therefore, we deduce

 Es∼Unif(GB),¯s∼P(⋅∣s)[ϕ(¯s)ϕ(¯s)⊤]=1mL∑i∈[m]e0,ie⊤0,i+1LL−2∑l=0⎛⎝1√m∑i∈[m]el,i⎞⎠⎛⎝1√m∑i∈[m]el,i⎞⎠⊤. (3)

Let denote the matrix in Equation 3. For an index in , we denote it by two indices . Then for ( is the Kronecker delta such that if and it is zero otherwise) and for and . We see that is a block diagonal matrix. The matrix is one of the blocks, where is an all-one matrix. Recall in Equation 2 is . Then the matrix is also a block diagonal matrix. The matrix is one of its blocks and its eigenvalues are (with multiplicity ) and (with multiplicity ). These eigenvalues are also eigenvalues of . Therefore . Consider the function . We will show that for all . Notice that it is a linear function. It suffices to check . We have because (we use the assumption ). At , we have . We conclude that . Hence there is no such that . ∎

If Assumption 3 is fulfilled, the following theorem presents an upper bound on the sample complexity of approximating the value of a state up to additive error bound . See our discussion in Remark 5. Recall that in Theorem 1, we show that there is an instance with for which evaluating a state up to a constant additive error is impossible (see also [3]). This suggests that if , it is generally impossible to approximate the value of a state. If , there exists such that . As a result, we have and and therefore Assumption 3 holds (with ). Thus our upper bound covers all cases in the regime . Note that Assumption 3 may also cover some cases in the regime .

###### Theorem 2.

Suppose that and are constants. Let denote the set of all infinite-horizon MDPs that satisfy Assumption 1 and Assumption 2 and whose feature vectors have dimension , rewards lie in . Let denote the set of all probability measures on such that Assumption 3 holds with constants and . Let be such that , and . With probability at least , we have

 inf^Vsup(S,A,P,R,</