1 Introduction
We study reinforcement learning (RL) in episodic environments with rich observations, such as images and texts. While many modern empirical RL algorithms are designed to handle such settings
(see, e.g., Mnih et al., 2015), relatively few works focus on the question of strategic exploration in this literature (Ostrovski et al., 2017; Osband et al., 2016) and the sample efficiency of these techniques is not theoretically understood.From a theoretical perspective, strategic exploration algorithms for provably sampleefficient RL have long existed in the classical tabular setting (Kearns & Singh, 2002; Brafman & Tennenholtz, 2002). However, these methods are difficult to adapt to rich observation spaces, because they all require a number of interactions polynomial in the number of observed states, and, without additional structural assumptions, such a dependency is unavoidable (see, e.g., Jaksch et al., 2010; Lattimore & Hutter, 2012). Consequently, treating the observations directly as unique states makes this class of methods unsuitable for most settings of practical interest.
In order to avoid the dependency on the observation space, one must exploit some inherent structure in the problem. The recent line of work on contextual decision processes (Krishnamurthy et al., 2016; Jiang et al., 2017; Dann et al., 2018) identified certain lowrank structures that enable exploration algorithms with sample complexity polynomial in the rank parameter. Such lowrank structure is crucial to circumventing informationtheoretic hardness, and is typically found in problems where complex observations are emitted from a small number of latent states. Unlike tabular approaches, which require the number of states to be small and observed, these works are able to handle settings where the observation spaces are uncountably large or continuous and the underlying states never observed during learning. They achieve this by exploiting the lowrank structure implicitly, operating only in the observation space. The resulting algorithms are sampleefficient, but either provably computationally intractable, or practically quite cumbersome even under strong assumptions (Dann et al., 2018).
In this work, we take an alternative route: we recover the latentstate structure explicitly by learning a decoding function (from a large set of candidates) that maps a rich observation to the corresponding latent state; note that if such a function is learned perfectly, the richobservation problem is reduced to a tabular problem where exploration is tractable. We show that our algorithms are:
Provably sampleefficient: Under certain identifiability assumptions, we recover a mapping from the observations to underlying latent states as well as a good exploration policy using a number of samples which is polynomial in the number of latent states, horizon and the complexity of the decoding function class with no explicit dependence on the observation space size. Thus we significantly generalize beyond the works of Dann et al. (2018) who require deterministic dynamics and Azizzadenesheli et al. (2016a) whose guarantees scale with the observation space size.
Computationally practical: Unlike many prior works in this vein, our algorithm is easy to implement and substantially outperforms naïve exploration in experiments, even when the baselines have cheating access to the latent states.
In the process, we introduce a formalism called
block Markov decision process
(also implicit in some prior works), and a new solution concept for exploration called –policy cover.The main challenge in learning the decoding function is that the hidden states are never directly observed. Our key novelty is the use of a backward conditional probability vector (Equation
1) as a representation for latent state, and learning the decoding function via conditional probability estimation, which can be solved using least squares regression. While learning a lowdimensional representations of rich observations has been explored in recent empirical works (e.g., Silver et al., 2017; Oh et al., 2017; Pathak et al., 2017), our work provides a precise mathematical characterization of the structures needed for such approaches to succeed and comes with rigorous samplecomplexity guarantees.2 Setting and Task Definition
We begin by introducing some basic notation. We write to denote the set . For any finite set , we write
to denote the uniform distribution over
. We write for the simplex in . Finally, we write and , respectively, for the Euclidean and the norms of a vector.2.1 Block Markov Decision Process
In this paper we introduce and analyze a block Markov decision process or BMDP. It refers to an environment described by a finite, but unobservable latent state space , a finite action space , with , and a possibly infinite, but observable context space . The dynamics of a BMDP is described by the initial state and two conditional probability functions: the statetransition function and contextemission function , defining conditional probabilities and for all , , .^{1}^{1}1For continuous context spaces, describes a density function relative to a suitable measure (e.g., Lebesgue measure).
The model may further include a distribution of reward conditioned on context and action. However, rewards do not play a role in the central task of the paper, which is the exploration of all latent states. Therefore, we omit rewards from our formalism, but we discuss in a few places how our techniques apply in the presence of rewards (for a thorough discussion see Appendix B).
We consider episodic learning tasks with a finite horizon . In each episode, the environment starts in the state . In the step of an episode, the environment generates a context , the agent observes the context (but not the state ), takes an action , and the environment transitions to a new state . The sequence generated in an episode is called a trajectory. We emphasize that a learning agent does not observe components from the trajectory.
So far, our description resembles that of a partially observable Markov decision process (POMDP). To finish the definition of BMDP, and distinguish it from a POMDP, we make the following assumption:
Assumption 2.1 (Block structure).
Each context uniquely determines its generating state . That is, the context space can be partitioned into disjoint blocks , each containing the support of the conditional distribution .
The sets are unique up to the sets of measure zero under . In the paper, we say “for all ” to mean “for all up to a set of measure zero under .”
The block structure implies the existence of a perfect decoding function , which maps contexts into their generating states. This means that a BMDP is indeed an MDP with the transition operator . Hence the contexts observed by the agent form valid Markovian states, but the size of is too large, so only learning the MDP parameters in the smaller, latent space is tractable.
We note that Assumption 2.1 or similar MDPstructures have been previously studied by Krishnamurthy et al. (2016) and Dann et al. (2018). It can naturally model visual gridworldlike environments often studied in empirical RL (e.g. Johnson et al., 2016), as well as noisy observations of the latent state due to imperfect sensors.
To streamline our analysis, we make a standard assumption for episodic settings. We assume that can be partitioned into disjoints sets , , such that is supported on whenever . We refer to as the level and assume that it is observable as part of the context, so the context space is also partitioned into sets . We use notation for the set of states up to level , and similarly define .
We assume that . We seek learning algorithms that scale polynomially in parameters , and , but do not explicitly depend on , which might be infinite.
2.2 Solution Concept: Cover of Exploratory Policies
In this paper, we focus on the problem of exploration. Specifically, for each state , we seek an agent strategy for reaching that state . We formalize an agent strategy as an step policy, which is a map specifying which action to take in each context up to step . When executing an step policy with , an agent acts according to for steps and then arbitrarily until the end of the episode (e.g., according to a specific default policy).
For an step policy , we write
to denote the probability distribution over
step trajectories induced by . We write for the probability of an event . For example, is the probability of reaching the state when executing .We also consider randomized strategies, which we formalize as policy mixtures. An step policy mixture is a distribution over step policies. When executing , an agent randomly draws a policy at the beginning of the episode, and then follows throughout the episode. The induced distribution over step trajectories is denoted .
Our algorithms create specific policies and policy mixtures via concatenation. Specifically, given an step policy , we write for the step policy that executes for steps and chooses action in step . Similarly, if is a policy mixture and a distribution over , we write for the policy mixture equivalent to first sampling and following a policy according to and then independently sampling and following an action according to .
We finally introduce two key concepts related to exploration: maximum reaching probability and policy cover.
Definition 2.1 (Maximum reaching probability.).
For any , its maximum reaching probability is
where the maximum is taken over all maps . The policy attaining the maximum for a given is denoted .^{2}^{2}2It suffices to consider maps for .
Without loss of generality, we assume that all the states are reachable, i.e., for all . We write for the value of the hardesttoreach state. Since is finite and all states are reachable, .
Given maximum reaching probabilities, we formalize the task of finding policies that reach states as the task of finding an –policy cover in the following sense:
Definition 2.2 (Policy cover of the state space).
We say that a set of policies is an –policy cover of if for all there exists an step policy such that . A set of policies is an –policy cover of if it is an –policy cover of for all .
Intuitively, we seek a policy cover of a small size, typically , and with a small . Given such a cover, we can reach every state with the largest possible probability (up to ) by executing each policy from the cover in turn. This enables us to collect a dataset of observations and rewards at all (sufficiently) reachable states and further obtain a policy that maximizes any reward (details in Appendix B).
3 Embedding Approach
A key challenge in solving the BMDP exploration problem is the lack of access to the latent state . Our algorithms work by explicitly learning a decoding function
which maps contexts to the corresponding latent states. This appears to be a hard unsupervised learning problem, even under the blockstructure assumption, unless we make strong assumptions about the structure of
or about the emission distributions . Here, instead of making assumptions about or , we make certain “separability” assumptions about the latent transition probabilities . Thus, we retain a broad flexibility to model rich context spaces, and also obtain the ability to efficiently learn a decoding function . In this section, we define key components of our approach and formally state the separability assumption.3.1 Embeddings and Function Approximation
In order to construct the decoding function , we learn lowdimensional representations of contexts as well as latent states in a shared space, namely . We learn embedding functions for contexts and for states, with the goal that and should be close if and only if . Such embedding functions always exist due to the blockstructure: for any set of distinct vectors , it suffices to define for .
As we see later in this section, embedding functions and can be constructed via an essentially supervised approach, assuming separability. The state embedding is a lower complexity object (a tuple of at most points in ), whereas the context embedding
has a high complexity for even moderately rich context spaces. Therefore, as is standard in supervised learning, we limit attention to functions
from some class , such as generalized linear models, tree ensembles, or neural nets. This is a form of function approximation where the choice of includes any inductive biases about the structure of the contexts. By limiting the richness of , we can generalize across contexts as well as control the sample complexity of learning. At the same time, needs to include embedding functions that reflect the block structure. Allowing a separate for each level, we require realizability in the following sense:Assumption 3.1 (Realizability).
For any and , there exists such that for all and .
In words, the class must be able to match any stateembedding function across all blocks . To satisfy this assumption, it is natural to consider classes obtained via a composition where is a decoding function from some class and is any mapping . Conceptually, first decodes the context to a state which is then embedded by into . The realizability assumption is satisfied as long as contains a perfect decoding function , for which whenever . The core representational power of is thus driven by , the class of candidate decoding functions .
Given such a class , our goal is find a suitable contextembedding function in using a number of trajectories that is proportional to when is finite, or a more general notion of complexity such as a log covering number when is infinite. Throughout this paper, we assume that is finite as it serves to illustrate the key ideas, but our approach generalizes to the infinite case using standard techniques.
As we alluded to earlier, we learn context embeddings by solving supervised learning problems. In fact, we only require the ability to solve least squares problems. Specifically, we assume access to an algorithm for solving vectorvalued leastsquares regression over the class . We refer to such an algorithm as the ERM oracle:
Definition 3.1 (ERM Oracle).
Let be a function class that maps to . An empirical risk minimization oracle (ERM oracle) for is any algorithm that takes as input a data set with , , and computes .
3.2 Backward Probability Vectors and Separability
For any distribution over trajectories, we define backward probabilities as the conditional probabilities of the form —note that conditioning is the opposite of transitions in . For the backward probabilities to be defined, we do not need to fully specify a full distribution over trajectories, only a distribution over . For any such distribution , any , and , the backward probability is defined as
(1) 
For a given , we collect the probabilities across all , into the backward probability vector
, padding with zeros if
. Backward probability vectors are at the core of our approach, because they correspond to the state embeddings approximated by our algorithms. Our algorithms require that for different states are sufficiently separated from one other for a suitable choice of :Assumption 3.2 (Separability).
There exists such that for any and any distinct , the backward probability vectors with respect to the uniform distribution are separated by a margin of at least , i.e., , where .
In Appendix F we show that the uniform distribution above can be replaced with any distribution supported on , although the margins would be different.
The key property that makes vectors algorithmically useful is that they arise as solutions to a specific least squares problem with respect to data generated by a policy whose marginal distribution over matches . Let denote the vector of the standard basis in corresponding to the coordinate indexed by . Then the following statement holds:
Theorem 3.1.
Let be a distribution supported on and let be a distribution over defined by sampling , , and . Let
(2) 
Then, under Assumption 3.1, every minimizer satisfies for all and .
The distribution is exactly the marginal distribution induced by a policy whose marginal distribution over matches . Any minimizer yields context embeddings corresponding to state embeddings . Our algorithms build on Theorem 3.1: they replace the expectation by an empirical sample and obtain an approximate minimizer by invoking an ERM oracle.
4 Algorithm for Separable BMDPs
With the main components defined, we can now derive our algorithm for learning a policy cover in a separable BMDP.
The algorithm proceeds inductively, level by level. On each level , we learn the following objects:

[leftmargin=*,topsep=0pt]

The set of discovered latent states and a decoding function , which allows us to identify latent states at level from observed contexts.

The estimated transition probabilities across all , , .

A set of step policies .
We establish a correspondence between the discovered states and true states via a bijection , under which the functions accurately decode contexts into states, the probability estimates are close to true probabilities, and is an –policy cover of . Specifically, we prove the following statement for suitable accuracy parameters , and :
Claim 4.1.
There exists a bijection such that the following conditions are satisfied for all , , , and , , where is the bijection for the previous level:
Accuracy of :  (3)  
Accuracy of :  
(4)  
Coverage by :  (5) 
Algorithm 1 constructs , , and level by level. Given these objects up to level , the construction for the next level proceeds in the following three steps, annotated with the lines in Algorithm 1 where they appear:
(1) Regression step: learn (lines 7–9). We collect a dataset of trajectories by repeatedly executing a specific policy mixture . We use to identify on each trajectory, obtaining samples from induced by . The context embedding is then obtained by solving the empirical version of (2).
Our specific choice of ensures that each state is reached with probability at least , which is bounded away from zero if is sufficiently small. The uniform choice of actions then guarantees that each state on the next level is also reached with sufficiently large probability.
(2) Clustering step: learn and (lines 10–12). Thanks to Theorem 3.1, we expect that for the distribution induced by .^{3}^{3}3Theorem 3.1 uses distributions and over true states , but its analog also holds for distributions over , as long as decoding is approximately correct at the previous level. Thus, all contexts generated by the same latent state have embedding vectors close to each other and to . Thanks to separability,^{4}^{4}4Although Assumption 3.2 is stated w.r.t. the uniform distribution, in Appendix F we show that it automatically implies separatability under any fully supported distribution. we can therefore use clustering to identify all contexts generated by the same latent state, and this procedure is sampleefficient since the embeddings are lowdimensional vectors. Each cluster corresponds to some latent state and any vector from that cluster can be used to define the state embedding . The decoding function is defined to map any context to the state whose embedding is the closest to .
(3) Dynamic programming: construct (lines 13–19). Finally, with the ability to identify states at level via , we can use collected trajectories to learn an approximate transition model up to level . This allows us to use dynamic programming to find policies that (approximately) optimize the probability of reaching any specific state . The dynamic programming finds policies that act by directly observing decoded latent states. The policies are obtained by composing with the decoding functions .
The next theorem guarantees that with a polynomial number of samples, Algorithm 1 finds a small –policy cover.^{5}^{5}5The , , and notation suppresses factors that are polynomial in , , and .
Theorem 4.1 (Sample Complexity of Algorithm 1).
Fix any and a failure probability . Set , , , . Then with probability at least , Algorithm 1 returns an –policy cover of , with size at most .
In addition to dependence on the usual parameters like and , our sample complexity also scales inversely with the separability margin and the worstcase reaching probability . While the exact dependence on these parameters is potentially improvable, Appendix F suggest that some inverse dependence is unavoidable for our approach. Compared with Azizzadenesheli et al. (2016a), there is no explicit dependence on , although they make spectral assumptions instead of the explicit block structure.
4.1 Deterministic BMDPs
As a special case of general BMDPs, many prior works study the case of deterministic transitions, that is, for a unique state for each . Also, many simulationbased empirical RL benchmarks exhibit this property. We refer to these BMDPs as deterministic, but note that only the transitions are deterministic, not the emissions . In this special case, the algorithm and guarantees of the previous section can be improved, and we present this specialization here, both for a direct comparison with prior work and potential usability in deterministic environments.
To start, note that and in any deterministic BDMP. The former holds as any reachable state is reached with probability one. For the latter, if transitions to , then cannot appear in the backward distribution of any other state . Consequently, the backward probabilities for distinct states must have disjoint support over , and thus their distance is exactly two.
Deterministic transitions allow us to obtain the policy cover with ; that is, we learn policies that are guaranteed to reach any given state with probability one. Moreover, it suffices to consider policies with simple structure: those that execute a fixed sequence of actions. Also, since we have access to policies reaching states in the prior level with probability one, there is no need for a decoding function when learning states and context embeddings on level . The final, more technical implication of determinism (which we explain below) is that it allows us to boost the accuracy of the context embedding in the clustering step, leading to improved sample complexity.
The details are presented in Algorithm 4. At each level , we construct the following objects:

[leftmargin=*,topsep=0pt]

A set of discovered states .

A set of step policies .
We proceed inductively and for each level prove that the following claim holds with a high probability:
Claim 4.2.
There exists a bijection such that reaches with probability one.
This implies that can be viewed as a latent state space, and is an –policy cover of with .
To construct these objects for next level , Algorithm 4 proceeds in three steps similar to Algorithm 1 for the stochastic case. The regression step, that is, learning of (lines 6–8), is identical. The clustering step (lines 9–15) is slightly more complicated. We boost the accuracy of the learned context embedding by repeatedly sampling contexts that are guaranteed to be emitted from the same latent state (because they result from the same sequence of actions), and taking an average. This step allows us to get away with a lower accuracy of compared with Algorithm 1. Finally, the third step, learning of (line 16), is substantially simpler. Since any action sequence reaching a given cluster can be picked as a policy to reach the corresponding latent state, dynamic programming is not needed.
The following theorem characterizes the sample complexity of Algorithm 4. It shows we only need samples to find a policy cover with .
Theorem 4.2 (Sample Complexity of Algorithm 4).
Set , and . Then with probability at least , Algorithm 4 returns an –policy cover of , with and size at most .
The policy cover we compute can be used within a PAC RL algorithm to optimize a reward. As one example, if the reward depends on the latent state, we can use the policy cover to reach each stateaction pair and collect samples to estimate the expected reward for this stateaction to accuracy . Thus, using at most samples in addition to those needed by Algorithm 4, we can find the trajectory with the largest expected reward within an error. To summarize (see also Appendix B):
Corollary 4.1.
With probability at least , Algorithm 4 can be used to find an suboptimal policy using at most trajectories from a deterministic BMDP.
We can now compare Corollary 4.1 with the related work of Dann et al. (2018). Our result significantly improves dependence on and compared with their bound, although their functionclass complexity term is not directly comparable to ours, as their work approximates optimal value functions and policies, while we approximate ideal decoding functions.
5 Experiments
We perform an empirical evaluation of our decodingbased algorithms in six challenging RL environments—some meeting the BMDP assumptions and some not—with two choices for the function class . We compare our algorithm, which operates directly on rich observations, against two tabular algorithms that operate on the latent state.
The environments. All environments share the same latent structure, and are a form of “combination lock,” with levels, 3 states per level, and 4 actions. Nonzero reward is only achievable from states and . From and one action leads with probability to and with probability to , another has the flipped behavior, and the remaining two lead to . All actions from lead to . The “good” actions are randomly assigned for every state. From and , two actions receive reward; all others provide zero reward. The start state is . We consider deterministic variant () and stochastic variant (). (See Appendix C.)
The environments are designed to be difficult for exploration. For example, the deterministic variant has paths with nonzero reward, but paths in total, so random exploration requires exponentially many trajectories.
We also consider two observation processes, which we use only for our algorithm, while the baselines operate directly on the latent state space. In LockBernoulli, the observation space is where the first
coordinates are reserved for onehot encoding of the state and the last
coordinates are drawn iid from . This space meets the BMDP assumptions and can be perfectly decoded via linear functions. Note that the space is not partitioned across time, which our algorithms track internally. In LockGaussian, the observation space is . As before the first coordinates are reserved for onehot encoding of the state, but this encoding is corrupted with Gaussian noise. Formally, if the agent is at state the observation is , where is one of the first three standard basis vectors and has entries. We consider . Note that these environments do not satisfy Assumption 3.1 since the emission distributions cannot be perfectly separated.Baselines, hyperparameters. We compare our algorithms against two tabular approaches that cheat by directly accessing the latent states. The first, OracleQ, is the Optimistic Learning algorithm of Jin et al. (2018), which has a nearoptimal regret bound in tabular environments and serves as a skyline.^{6}^{6}6We use the Hoeffding version, which is conceptually much simpler, but statistically slightly worse. The second, QLearning, is tabular learning with greedy exploration. This algorithm serves as a baseline: any algorithm with strategic exploration should vastly outperform QLearning, even though it is cheating.
Each algorithm has two hyperparameters that we tune. In our algorithm (PCID), we use means clustering instead of Algorithm 2, so one of the hyperparameters is the number of clusters . The second one is the number of trajectories to collect in each outer iteration. For OracleQ, these are the learning rate and a confidence parameter . For QLearning, these are the learning rate and , a fraction of the 100K episodes over which to anneal the exploration probability linearly from 1 down to 0.01.
For both LockBernoulli and LockGaussian
, we experiment with linear decoding functions, which we fit via ordinary least squares. For
LockGaussian only, we also use twolayer neural networks. Specifically, these functions are of the form with the standard sigmoid activation, where the inner dimension is set to the clustering hyperparameter . These networks are trained using AdaGrad with a fixed learning rate of , for a maximum of 5K iterations. See Appendix C for more details on hyperparameters and training.Experimental setup. We run the algorithms on all environments with varying , which also influences the dimension of the observation space. Each algorithm runs for 100K episodes and we say that it has solved the lock by episode if at round its runningaverage reward is . The timetosolve is the smallest for which the algorithm has solved the lock. For each hyperparameter, we run 25 replicates with different randomizations of the environment and seeds, and we plot the median timetosolve of the best hyperparameter setting (along with error bands corresponding to and percentiles) against the horizon .
Results. The results are in Figure 1 in a loglinear plot. First, QLearning works well for small horizon problems but cannot solve problems with within 100K episodes, which is not surprising.^{7}^{7}7We actually ran QLearning for 1M episodes and found it solves with 170K episodes. The performance curve for QLearning is linear, revealing an exponential sample complexity, and demonstrating that these environments cannot be solved with naïve exploration. As a second observation, OracleQ performs extremely well, and as we verify in Appendix C demonstrates a linear scaling with .^{8}^{8}8This is incomparable with the result in Jin et al. (2018) since we are not measuring regret here.
In LockBernoulli, PCID is roughly a factor of 5 worse than the skyline OracleQ for all values of , but the curves have similar behavior. In Appendix C, we verify a nearlinear scaling with , even better than predicted by our theory. Of course PCID is an exponential improvement over QLearning with greedy exploration here.
In LockGaussian with linear functions, the results are similar for the lownoise setting, but the performance of PCID degrades as the noise level increases. For example, with noise level , it fails to solve the stochastic problem with in 100K episodes. On the other hand, the performance is still quite good, and the scaling represents a dramatic improvement over QLearning.
Finally, PCID with neural networks is less robust to noise and stochasticity in LockGaussian. Here, with the algorithm is unable to solve the problem, both with and without stochasticity, but still does quite well with . The scaling with is still quite favorable.
Sensitivity analysis. Lastly, we perform a simple sensitivity analysis to assess how the hyperparameters influence the behavior of PCID. In Figure 2 we display a heatmap showing the runningaverage reward (taking median over 25 replicates) of the algorithm on the stochastic LockBernoulli environment with as we vary both and . The best parameter choice here is and . As we expect, if we underestimate either or the algorithm fails, either because it cannot identify all latent states, or it does not collect enough data to solve the induced regression problems. On the other hand, the algorithm is quite robust to overestimating both parameters, with a graceful degradation in performance.
Summary. We have shown on several richobservation environments with both linear and nonlinear functions that PCID scales to largehorizon richobservation problems. It dramatically outperforms tabular QLearning with greedy exploration, and is roughly a factor of 5 worse than OracleQ, an extremely effective tabular method, run on the corresponding tabular environment. Finally, the performance degrades gracefully as the assumptions are violated, and the algorithm is fairly robust to hyperparameter choices.
References
 Antos et al. (2008) Antos, A., Szepesvári, C., and Munos, R. Learning nearoptimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 2008.
 Azizzadenesheli et al. (2016a) Azizzadenesheli, K., Lazaric, A., and Anandkumar, A. Reinforcement learning of POMDPs using spectral methods. In Conference on Learning Theory, 2016a.
 Azizzadenesheli et al. (2016b) Azizzadenesheli, K., Lazaric, A., and Anandkumar, A. Reinforcement learning in richobservation MDPs using spectral methods. arxiv:1611.03907, 2016b.
 Bagnell et al. (2004) Bagnell, J. A., Kakade, S. M., Schneider, J. G., and Ng, A. Y. Policy search by dynamic programming. In Advances in Neural Information Processing Systems, 2004.
 Brafman & Tennenholtz (2002) Brafman, R. I. and Tennenholtz, M. Rmaxa general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research, 2002.
 Dann et al. (2018) Dann, C., Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. On oracleefficient PAC reinforcement learning with rich observations. In Advances in Neural Information Processing Systems, 2018.
 Ernst et al. (2005) Ernst, D., Geurts, P., and Wehenkel, L. Treebased batch mode reinforcement learning. Journal of Machine Learning Research, 2005.
 Givan et al. (2003) Givan, R., Dean, T., and Greig, M. Equivalence notions and model minimization in Markov decision processes. Artificial Intelligence, 2003.
 Hallak et al. (2013) Hallak, A., DiCastro, D., and Mannor, S. Model selection in Markovian processes. In International Conference on Knowledge Discovery and Data Mining, 2013.
 Jaksch et al. (2010) Jaksch, T., Ortner, R., and Auer, P. Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 2010.
 Jiang et al. (2015) Jiang, N., Kulesza, A., and Singh, S. Abstraction selection in modelbased reinforcement learning. In International Conference on Machine Learning, 2015.
 Jiang et al. (2017) Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. Contextual decision processes with low Bellman rank are PAClearnable. In International Conference on Machine Learning, 2017.
 Jin et al. (2018) Jin, C., AllenZhu, Z., Bubeck, S., and Jordan, M. I. Is Qlearning provably efficient? In Advances in Neural Information Processing Systems, 2018.
 Johnson et al. (2016) Johnson, M., Hofmann, K., Hutton, T., and Bignell, D. The Malmo Platform for artificial intelligence experimentation. In International Joint Conference on Artificial Intelligence, 2016.
 Kearns & Singh (2002) Kearns, M. and Singh, S. Nearoptimal reinforcement learning in polynomial time. Machine learning, 2002.
 Krishnamurthy et al. (2016) Krishnamurthy, A., Agarwal, A., and Langford, J. PAC reinforcement learning with rich observations. In Advances in Neural Information Processing Systems, 2016.
 Lattimore & Hutter (2012) Lattimore, T. and Hutter, M. PAC bounds for discounted MDPs. In International Conference on Algorithmic Learning Theory, 2012.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Humanlevel control through deep reinforcement learning. Nature, 2015.
 Oh et al. (2017) Oh, J., Singh, S., and Lee, H. Value prediction network. In Advances in Neural Information Processing Systems, 2017.
 Ortner et al. (2014) Ortner, R., Maillard, O.A., and Ryabko, D. Selecting nearoptimal approximate state representations in reinforcement learning. In International Conference on Algorithmic Learning Theory, 2014.
 Osband et al. (2016) Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing Systems, 2016.
 Ostrovski et al. (2017) Ostrovski, G., Bellemare, M. G., Oord, A. v. d., and Munos, R. Countbased exploration with neural density models. In International Conference on Machine Learning, 2017.
 Pathak et al. (2017) Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiositydriven exploration by selfsupervised prediction. In International Conference on Machine Learning, 2017.
 Silver et al. (2017) Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., DulacArnold, G., Reichert, D., Rabinowitz, N., Barreto, A., and Degris, T. The predictron: Endtoend learning and planning. In International Conference on Machine Learning, 2017.
 Weissman et al. (2003) Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., and Weinberger, M. J. Inequalities for the L1 deviation of the empirical distribution. HewlettPackard Labs, Tech. Rep, 2003.
 Whitt (1978) Whitt, W. Approximations of dynamic programs, I. Mathematics of Operations Research, 1978.
Appendix A Comparison of BMDPs with other related frameworks
The problem setup in a BMDP is closely related to the literature on state abstractions, as our decoding function can be viewed as an abstraction over the rich context space. Since we learn the decoding function instead of assuming it given, it is worth comparing to the literature on state abstraction learning. The most popular notion of abstraction in modelbased RL is bisimulation (Whitt, 1978; Givan et al., 2003), which is more general than our setup since our context is sampled i.i.d. conditioned on the hidden state (the irrelevant factor discarded by a bisimulation may not be i.i.d.). Such generality comes with a cost as learning good abstractions turns out to be very challenging. The very few results that come with finite sample guarantees can only handle a small number of candidate abstractions (Hallak et al., 2013; Ortner et al., 2014; Jiang et al., 2015). In contrast, we are able to learn a good decoding function from an exponentially large and unstructured family (that is, the decoding functions combined with the state encodings ).
The setup and algorithmic ideas in our paper are related to the work of Azizzadenesheli et al. (2016a, b), but we are able to handle continuous observation spaces with no direct dependence on the number of unique contexts due to the use of function approximation. The recent setup of Contextual Decision Processes (CDPs) with low Bellman rank, introduced by Jiang et al. (2017) is a strict generalization of BMDPs (the Bellman rank of any BMDP is at most ). The additional assumptions made in our work enable the development of a computationally efficient algorithm, unlike in their general setup. Most similar to our work, Dann et al. (2018) study a subclass of CDPs with low Bellman rank where the transition dynamics are deterministic.^{9}^{9}9While not explicitly assumed in their work, the assumption of the optimal policy and value functions depending only on the current observation and not hidden state is most reasonable when the observations are disjoint across hidden states like in this work. However, instead of the deterministic dynamics in Dann et al., we consider stochastic dynamics with certain reachability and separability conditions. As we note in Section 4, these assumptions are trivially valid under deterministic transitions. In terms of the realizability assumptions, Assumption 3.1 posits the realizability of a decoding function, while Dann et al. (2018) assume realizability of the optimal value function. These assumptions are not directly comparable, but are both reasonable if the decoding and value functions implicitly first map the contexts to hidden states, followed by a tabular function as discussed after Assumption 3.1. Finally as noted by Dann et al., certain empirical RL benchmarks such as visual grid world are captured reasonably well in our setting.
On the empirical side, (Pathak et al., 2017)
learn a encoding function that compresses the rich obervations to a lowdimensional representation, which serves a similar purpose as our decoding function, using prediction errors in the lowdimensional space to drive exploration. This approach has weaknesses, as it cannot cope with stochastic transition structures. Given this, our work can also be viewed as a rigorous fix for these types of empirical heuristics.
Appendix B Incorporating Rewards in BMDPs
At a high level, there are two natural choices for modeling rewards in a BMDP. In some cases, the rewards might only depend on the latent state. This is analogous to how rewards are typically modeled in the POMDP literature, for instance and respects the semantics that is indeed a valid state to describe an optimal policy or value function. For such problems, finding a near optimal policy or value function building on Algorithms 1 or 4 is relatively straightforward. Note that along with the policy cover, our algorithms implicitly construct an approximately correct dynamics model in the latent state space as well as decoding functions which map contexts to the latent states generating them with a small error probability. While these objects are explicit in Algorithm 1, they are implicit in Algorithm 4 since each policy in the cover reaches a unique latent state with probability 1 so that we do not need any decoding function as highlighted before. Indeed for deterministic BMDPs, we do not need the dynamics model at all given the policy cover to maximize a statedependent reward as shown in Corollary 4.1. For stochastic BMDPs, given any reward function, we can simply plan within the dynamics model over the latent states to obtain a nearoptimal policy as a function of the latent state. We construct a policy as a function of contexts by first decoding the context using and then applying the nearoptimal policy over latent states found above. As we show in the following sections, there are parameters and controlled by our algorithms, such that the policy found using the procedure described above is at most suboptimal.
In the second scenario where the reward depends on contexts, the optimal policies and value functions cannot be constructed using the latent states alone. However, our policy cover can still be used to generate a good exploration dataset for subsequent use in offpolicy RL algorithms, as it guarantees good coverage for each stateaction pair. Concretely, if we use valuefunction approximation, then the dataset can be fed into an approximate dynamic programming (ADP) algorithm (e.g., FQI Ernst et al., 2005). Given a good exploration dataset, these approaches succeed under certain representational assumptions on the valuefunction class (Antos et al., 2008). Similarly, one can use PSDP style policy learning methods from such a dataset (Bagnell et al., 2004).
We conclude this subsection by observing that in reward maximization for RL, most works fall into either seeking a PAC or a regret guarantee. Our approach of first constructing a policy cover and then learning policies or value functions naturally aligns with the PAC criterion, but not with regret minimization. Nevertheless, as we see in our empirical evaluation, for challenging RL benchmarks, our approach still has a good performance in terms of regret. We wrap up this section by discussing the relationship of the BMDP framework with similar related problem settings in the literature.
Appendix C Experimental Details and Reproducibility Checklist
c.1 Implementation Details
Environment transition diagram.
The hidden state transition diagram for the Lock environment is displayed in Figure 3.