Du, Kakade, Wang, and Yang recently established intriguing lower bounds on sample complexity, which suggest that reinforcement learning with a misspecified representation is intractable. Another line of work, which centers around a statistic called the eluder dimension, establishes tractability of problems similar to those considered in the Du-Kakade-Wang-Yang paper. We compare these results and reconcile interpretations.

## Authors

• 42 publications
• 9 publications
• ### Sample Complexity Lower Bounds for Linear System Identification

This paper establishes problem-specific sample complexity lower bounds f...
03/25/2019 ∙ by Yassir Jedra, et al. ∙ 0

• ### On Lower Bounds for Regret in Reinforcement Learning

This is a brief technical note to clarify the state of lower bounds on r...
08/09/2016 ∙ by Ian Osband, et al. ∙ 0

• ### Lower Bounds for the Happy Coloring Problems

In this paper, we study the Maximum Happy Vertices and the Maximum Happy...
06/12/2019 ∙ by Ivan Bliznets, et al. ∙ 0

• ### A Note on Echelon-Ferrers Construction

Echelon-Ferrers is one of important techniques to help researchers to im...
02/24/2020 ∙ by Xianmang He, et al. ∙ 0

• ### Settling the Sample Complexity of Single-parameter Revenue Maximization

This paper settles the sample complexity of single-parameter revenue max...
04/10/2019 ∙ by Chenghao Guo, et al. ∙ 0

• ### New Limits for Knowledge Compilation and Applications to Exact Model Counting

We show new limits on the efficiency of using current techniques to make...
06/08/2015 ∙ by Paul Beame, et al. ∙ 0

• ### Gauged Mini-Bucket Elimination for Approximate Inference

Computing the partition function Z of a discrete graphical model is a fu...
01/05/2018 ∙ by Sungsoo Ahn, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Du, Kakade, Wang, and Yang [1] recently established intriguing lower bounds on the sample complexity of reinforcement learning with a misspecified representation. Versions of the lower bound apply to model learning, value function learning, and policy learning. The cornerstone of their analysis is a basic problem, embedded in each of their results, of bandit learning with a misspecified linear model. The problem is one of finding a needle in a haystack: an agent must identify among an exponentially large number of actions the only one that generates rewards. This obviously requires exponentially many trials. One might hope that with a suitable choice of features, by using a linearly parameterized approximation to generalize across actions, the agent can efficiently identify the rewarding action. However, as established in [1], even if the linear model can approximate rewards with uniform accuracy across actions, an exponentially large number of trials may be required.

Another line of work, which centers around a statistic called the eluder dimension [2, 3], offers additional insight into bandit learning. In particular, an analysis from [2, 3] suggests qualitatively different behavior, indicating that if the linear model can approximate rewards uniformly with sufficient accuracy, the agent can efficiently identify the rewarding action. In this technical note, we reconcile what may appear to be contradictory narratives stemming from these two lines of analysis.

What we find is that the example used to establish the lower bound of [1] violates assumptions imposed by the upper bound of [2, 3]. In essence, the latter requires features to be sufficiently informative. The example that establishes the lower bound makes use of features that are uninformative though they enable accurate approximation of rewards, in some sense. Upon sharing an early version of this technical note, we discovered that Lattimore and Szepesvári also arrived at a similar conclusion and are working toward a deeper analysis of this issue [4].

We begin by formulating in the next section a class of bandit learning problems. Then, in Section 3, we discuss the special case of finding a needle in a haystack and a lower bound that can be established via the analysis of [1]. We next establish an upper bound based on arguments developed in [2, 3]. Finally, we interpret these results in a manner that reconciles narratives.

## 2 A Bandit Learning Problem

Consider a bandit learning problem characterized by a pair , where is a non-singleton finite set and is a class of reward functions that each maps to . Let denote the reward function that generates observed outcomes. An agent begins with knowledge of but not . The agent operates over time periods , in each period selecting an action and observing a deterministic outcome .

Suppose that before making its first decision, the agent is provided with a feature map

, which assigns a feature vector

to each action . Let denote a linear combination of features with coefficients . Suppose the agent is also informed that can be closely approximated by a linear combination of features in the sense that

 minθ∈Rd:∥θ∥2≤1∥f∗−~fθ∥∞≤ϵ, (1)

for some known .

We consider assessing an agent based on the expected number of trials it requires to identify an -optimal action, for some tolerance parameter . Here we define an action to be -optimal if

 f∗(x)≥maxx′∈Xf∗(x′)−ϵ′.

The agent’s algorithm takes , , , and as input. The expectation is over algorithmic randomness in the event that the agent uses a randomized algorithm.

## 3 A Lower Bound

The analysis of [1] yields the following lower bound.

###### Theorem 1.

For all learning algorithms, , and , for , there exists , , and a feature map satisfying

 minθ∈Rd:∥θ∥2≤1∥f∗−~fθ∥∞≤ϵ,

such that the expected number of trials required to identify an -optimal action is .

The expectation is over algorithmic randomness, in the event that the agent employs a randomized algorithm. This result indicates that an exponentially large number of trials can be required even if the agent knows features that can accurately approximate rewards. As demonstrated in [1], this can be established via a simple example which we will now discuss.

Consider a function class comprised of one-hot functions. In particular, and, for each , there is a function for which and for all . Let denote the unknown function of interest and be such that . To produce coefficients such that

 ∥f∗−~fθt∥∞<0.5,

the agent must identify . It is easy to see that this requires trials.

Now suppose the agent knows features that can accurately approximate rewards, in the sense of (1). Lemma 5.1 of [1], restated here, allows us to select features that are uninformative while meeting such an accuracy requirement.

###### Lemma 1.

For all non-singleton finite , , and , there exists such that, for all with , and .

Fixing and letting , this lemma prescribes a feature map such that

 maxf∈Fminθ∈Rd:∥θ∥2≤1∥f−~fθ∥∞ = maxf∈Fminθ∈Rd:∥θ∥2≤1maxx∈X|f(x)−θ⊤ϕ(x)| ≤ maxz∈Xminy∈Xmaxx∈X|1(x=z)−ϕ⊤(y)ϕ(x)| ≤ ϵ.

Since this feature map does not depend on , it does not offer any information that assists in identifying . As such, given these features, the agent still requires trials.

## 4 An Upper Bound

The following result offers an upper bound for an agent that selects actions that aim to quickly hone in on . The result is general, applying not only to the “needle in a haystack” instance in Section 3 but more broadly to the bandit learning problem in Section 2.

###### Theorem 2.

For all , , , feature maps such that

 minθ∈Rd:∥θ∥2≤1∥f∗−~fθ∥∞≤ϵ,

and

 ϵ′≥2ϵ(1+√3dlog(1+1dϵ2)), (2)

there exists a learning algorithm that identifies an -optimal action within trials.

The theorem can be established via an analysis developed in [2, 3] to bound the eluder dimension of linear function classes. For convenience, we provide a self-contained proof in the appendix, which adapts those provided in the papers.

Theorem 2 establishes a sense in which the agent learns efficiently, so long as (2) is satisfied. Our discussion in the next section offers some intuition motivating this constraint on and , and aims to reconcile the efficiency result with Theorem 1.

## 5 Discussion

The lower bound established by Theorem 1 suggests that an accurate linear representation does not suffice for efficient learning while the upper bound established by Theorem 2 suggests it does. Reconciling the results requires careful examination how examples that establish the lower bound violate assumptions under which the upper bound holds. Examples that establish the lower bound involve features of the kind identified by Lemma 1. The constraint on dimension required by this lemma can be written as

 ϵ√d≥√8ln(|X|). (3)

Letting to simplify the comparison, the requirement (2) of the upper bound established by Theorem 2 is satisfied if

 ϵ√d≤1100. (4)

Hence, the upper bound holds when is small while the lower bound holds when is large. These constraints can be viewed as complementarity conditions, requiring and to suitably offset one another.

Recall that is the number of features while is the error within which they can approximate . When is large, the error is large relative to the number of features, or the number of features is large relative to the error, or both are large. The proof of the lower bound is constructive, and involves identifying features that achieve a particular level of error. These features can be generated without any information about , so they must not be helpful in learning . As such, (3) captures levels of error that can be achieved when features offer no useful information. Clearly, as the number of features increases, even if they are uninformative, error should decrease. So we could also view this result as capturing a rate at which error can decrease as uninformative features are incorporated.

If we apply the upper bound to the hard instance of finding a needle in a haystack, the fact that the result guarantees efficient learning implies that the features must be required to offer useful information and must therefore depend on . To ensure this, the error needs to be small relative to the number of features, or the number of features needs to be small relative to the error. This is intuitive: if few features lead to small error, the features must be informative. Figure 1 illustrates how the lower and upper bounds reflect different regimes in the space of pairs. The grey region represents pairs that satisfy neither (3) nor (4). The upper bound of Theorem 2 should apply within some of this grey region, as the constraint is much stronger than and chosen to simplify (2).

Note that the requirement (3) for the lower bound depends on the number of actions . This is because, as the number of actions grows, the number of uninformative features required to achieve error also grows. On the other hand, the requirement (4) does not exhibit any dependence on the number of actions. As increases, the uninformative regime identified by the lower bound shrinks, and the grey region of Figure 1 grows.

## Appendix A Proof of Theorem 2

We will establish the result for an algorithm that selects actions according to

 xt∈{\rm argmax}x∈X(maxθ∈Θt~fθ(x)−minθ∈Θt~fθ(x)),

where

 Θt=⎧⎨⎩θ∈Rd:∥θ∥2≤1,(t−1∑τ=0(yτ+1−~fθ(xτ))2)1/2≤ϵ√t⎫⎬⎭,

Note that the set is nonempty because , since .

Let

 wt=maxθ∈Θt~fθ(xt)−minθ∈Θt~fθ(xt),

To prove the result, we first bound the number of times can be larger than .

###### Lemma 2.

If , for , then

 t≤3dlog(1+1dϵ2).

Proof: Let for . For shorthand, let , , and . Let

 ~Θτ=⎧⎪⎨⎪⎩ρ∈Rd:∥ρ∥2≤2,(τ−1∑k=0(ρ⊤ϕk(xk))2)12≤2ϵ√τ⎫⎪⎬⎪⎭,

and note that, for all , we have . Since ,

 ~Θτ = {ρ∈Rd:∥ρ∥2≤2,ρ⊤Φτρ≤4ϵ2τ} = {ρ∈Rd:ρ⊤(ϵ2τI)ρ≤4ϵ2τ,ρ⊤Φτρ≤4ϵ2τ} ⊆ {ρ∈Rd:ρ⊤Ψτρ≤8ϵ2τ}.

Hence,

 wτ = maxθ∈Θτ~fθ(xτ)−minθ∈Θτ~fθ(xτ) (5) = maxθ,θ′∈Θτ(θ−θ′)⊤ϕτ ≤ maxρ∈~Θτρ⊤ϕτ ≤ supρ:ρ⊤Ψτρ≤8ϵ2τρ⊤ϕτ = √8ϵ2τϕ⊤τΨ−1τϕτ.

Combining (5) and the fact that , we have that .

Note that . Let . The Matrix Determinant Lemma yields

 detΨt = (1+ϕ⊤t−1Ψ−1t−1ϕt−1)detΨt−1 ≥ 32detΨt−1≥⋯ ≥ (32)tdet(λI)=(32)tλd.

The determinant of a positive semidefinite matrix is the product of the eigenvalues, whereas the trace is their sum. As such, the inequality of arithmetic and geometric means yields

 detΨt ≤ (trace(Ψt)d)d = (trace(λI)+∑t−1τ=0trace(ϕτϕ⊤τ)d)d ≤ (λ+td)d.

It follows that and therefore,

 t≤dlog32(1+tλd)≤3dlog(1+1dϵ2),

as desired.

Note that

 maxx∈Xf∗(x′)−f∗(x) ≤ maxx′∈X~fθ∗(x′)−~fθ∗(x)+2ϵ ≤ maxθ∈Θt~fθ(x)−minθ∈Θt~fθ(x)+2ϵ.

Hence, when , for any action , that action has been identified as an -optimal action. It follows that action has been identifies as -optimal if .

Recall that

 ϵ′≥2ϵ(1+√3dlog(1+1dϵ2))

By Lemma 2, if , for , then

 t≤3dlog(1+1dϵ2).

This inequality implies that . It follows that an -optimal action is identified within trials.