Du, Kakade, Wang, and Yang  recently established intriguing lower bounds on the sample complexity of reinforcement learning with a misspecified representation. Versions of the lower bound apply to model learning, value function learning, and policy learning. The cornerstone of their analysis is a basic problem, embedded in each of their results, of bandit learning with a misspecified linear model. The problem is one of finding a needle in a haystack: an agent must identify among an exponentially large number of actions the only one that generates rewards. This obviously requires exponentially many trials. One might hope that with a suitable choice of features, by using a linearly parameterized approximation to generalize across actions, the agent can efficiently identify the rewarding action. However, as established in , even if the linear model can approximate rewards with uniform accuracy across actions, an exponentially large number of trials may be required.
Another line of work, which centers around a statistic called the eluder dimension [2, 3], offers additional insight into bandit learning. In particular, an analysis from [2, 3] suggests qualitatively different behavior, indicating that if the linear model can approximate rewards uniformly with sufficient accuracy, the agent can efficiently identify the rewarding action. In this technical note, we reconcile what may appear to be contradictory narratives stemming from these two lines of analysis.
What we find is that the example used to establish the lower bound of  violates assumptions imposed by the upper bound of [2, 3]. In essence, the latter requires features to be sufficiently informative. The example that establishes the lower bound makes use of features that are uninformative though they enable accurate approximation of rewards, in some sense. Upon sharing an early version of this technical note, we discovered that Lattimore and Szepesvári also arrived at a similar conclusion and are working toward a deeper analysis of this issue .
We begin by formulating in the next section a class of bandit learning problems. Then, in Section 3, we discuss the special case of finding a needle in a haystack and a lower bound that can be established via the analysis of . We next establish an upper bound based on arguments developed in [2, 3]. Finally, we interpret these results in a manner that reconciles narratives.
2 A Bandit Learning Problem
Consider a bandit learning problem characterized by a pair , where is a non-singleton finite set and is a class of reward functions that each maps to . Let denote the reward function that generates observed outcomes. An agent begins with knowledge of but not . The agent operates over time periods , in each period selecting an action and observing a deterministic outcome .
Suppose that before making its first decision, the agent is provided with a feature map
, which assigns a feature vectorto each action . Let denote a linear combination of features with coefficients . Suppose the agent is also informed that can be closely approximated by a linear combination of features in the sense that
for some known .
We consider assessing an agent based on the expected number of trials it requires to identify an -optimal action, for some tolerance parameter . Here we define an action to be -optimal if
The agent’s algorithm takes , , , and as input. The expectation is over algorithmic randomness in the event that the agent uses a randomized algorithm.
3 A Lower Bound
The analysis of  yields the following lower bound.
For all learning algorithms, , and , for , there exists , , and a feature map satisfying
such that the expected number of trials required to identify an -optimal action is .
The expectation is over algorithmic randomness, in the event that the agent employs a randomized algorithm. This result indicates that an exponentially large number of trials can be required even if the agent knows features that can accurately approximate rewards. As demonstrated in , this can be established via a simple example which we will now discuss.
Consider a function class comprised of one-hot functions. In particular, and, for each , there is a function for which and for all . Let denote the unknown function of interest and be such that . To produce coefficients such that
the agent must identify . It is easy to see that this requires trials.
Now suppose the agent knows features that can accurately approximate rewards, in the sense of (1). Lemma 5.1 of , restated here, allows us to select features that are uninformative while meeting such an accuracy requirement.
For all non-singleton finite , , and , there exists such that, for all with , and .
Fixing and letting , this lemma prescribes a feature map such that
Since this feature map does not depend on , it does not offer any information that assists in identifying . As such, given these features, the agent still requires trials.
4 An Upper Bound
The following result offers an upper bound for an agent that selects actions that aim to quickly hone in on . The result is general, applying not only to the “needle in a haystack” instance in Section 3 but more broadly to the bandit learning problem in Section 2.
For all , , , feature maps such that
there exists a learning algorithm that identifies an -optimal action within trials.
The theorem can be established via an analysis developed in [2, 3] to bound the eluder dimension of linear function classes. For convenience, we provide a self-contained proof in the appendix, which adapts those provided in the papers.
The lower bound established by Theorem 1 suggests that an accurate linear representation does not suffice for efficient learning while the upper bound established by Theorem 2 suggests it does. Reconciling the results requires careful examination how examples that establish the lower bound violate assumptions under which the upper bound holds. Examples that establish the lower bound involve features of the kind identified by Lemma 1. The constraint on dimension required by this lemma can be written as
Hence, the upper bound holds when is small while the lower bound holds when is large. These constraints can be viewed as complementarity conditions, requiring and to suitably offset one another.
Recall that is the number of features while is the error within which they can approximate . When is large, the error is large relative to the number of features, or the number of features is large relative to the error, or both are large. The proof of the lower bound is constructive, and involves identifying features that achieve a particular level of error. These features can be generated without any information about , so they must not be helpful in learning . As such, (3) captures levels of error that can be achieved when features offer no useful information. Clearly, as the number of features increases, even if they are uninformative, error should decrease. So we could also view this result as capturing a rate at which error can decrease as uninformative features are incorporated.
If we apply the upper bound to the hard instance of finding a needle in a haystack, the fact that the result guarantees efficient learning implies that the features must be required to offer useful information and must therefore depend on . To ensure this, the error needs to be small relative to the number of features, or the number of features needs to be small relative to the error. This is intuitive: if few features lead to small error, the features must be informative. Figure 1 illustrates how the lower and upper bounds reflect different regimes in the space of pairs. The grey region represents pairs that satisfy neither (3) nor (4). The upper bound of Theorem 2 should apply within some of this grey region, as the constraint is much stronger than and chosen to simplify (2).
Note that the requirement (3) for the lower bound depends on the number of actions . This is because, as the number of actions grows, the number of uninformative features required to achieve error also grows. On the other hand, the requirement (4) does not exhibit any dependence on the number of actions. As increases, the uninformative regime identified by the lower bound shrinks, and the grey region of Figure 1 grows.
-  Simon S. Du, Sham M. Kakade, Ruosong Wang, and Lin F. Yang. Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? arXiv preprint arXiv:1910.03016, 2019.
-  Daniel Russo and Benjamin Van Roy. Learning to Optimize via Posterior Sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
-  Daniel Russo and Benjamin Van Roy. Eluder Dimension and the Sample Complexity of Optimistic Exploration. In Advances in Neural Information Processing Systems, pages 2256–2264, 2013.
-  Tor Lattimore and Csaba Szepesvári. Learning with Good Feature Representations in Bandits and in RL with a Generative Model. Working Paper, 2019.
Appendix A Proof of Theorem 2
We will establish the result for an algorithm that selects actions according to
Note that the set is nonempty because , since .
To prove the result, we first bound the number of times can be larger than .
If , for , then
Proof: Let for . For shorthand, let , , and . Let
and note that, for all , we have . Since ,
Combining (5) and the fact that , we have that .
Note that . Let . The Matrix Determinant Lemma yields
It follows that and therefore,
Hence, when , for any action , that action has been identified as an -optimal action. It follows that action has been identifies as -optimal if .
By Lemma 2, if , for , then
This inequality implies that . It follows that an -optimal action is identified within trials.