1 Introduction
Markov Decision Processes (MDPs) are natural models for a wide variety of sequential decisionmaking problems. It is wellknown that the optimal control problem in MDPs can be solved, in principle, by standard algorithms such as value and policy iterations. These algorithms, however, are often not directly applicable to many practical MDP problems for several reasons. First, they do not scale computationally as their complexity grows quickly with the size of the state space and especially for continuous state space. Second, in problems with complicated dynamics, the transition kernel of the underlying MDP is often unknown, or an accurate model thereof is lacking. To circumvent these difficulties, many modelfree Reinforcement Learning (RL) algorithms have been proposed, in which one estimates the relevant quantities of the MDPs (e.g., the value functions or the optimal policies) from observed data generated by simulating the MDP.
A popular modelfree Reinforcement Learning (RL) algorithm is the so called Qlearning Qlearning , which directly learns the optimal actionvalue function (or Q function) from the observations of the system trajectories. A major advantage of Qlearning is that it can be implemented in an online, incremental fashion, in the sense that Qlearning can be run as data is being sequentially collected from the system operated/simulated under some policy, and continuously refines its estimates as new observations become available. The behaviors of standard Qlearning in finite stateaction problems have by now been reasonably understood; in particular, both asymptotic and finitesample convergence guarantees have been established Tsitsiklis1994Qlearning ; Jaakkola1994Qlearning ; Szepesy1997Qlearning ; EvenDar2004Qlearning .
In this paper, we consider the general setting with continuous state spaces. For such problems, existing algorithms typically make use of a parametric function approximation method, such as a linear approximation Melo2008QLearningLinear
, to learn a compact representation of the actionvalue function. In many of the recently popularized applications of Qlearning, much more expressive function approximation method such as deep neural networks have been utilized. Such approaches have enjoyed recent empirical success in game playing and robotics problems
Silver2016AlphaGo ; Mnih2015Atari ; Duan2016Robotics . Parametric approaches typically require careful selection of approximation method and parametrization (e.g., the architecture of neural networks). Further, rigorous convergence guarantees of Qlearning with deep neural networks are relatively less understood. In comparison, nonparametric approaches are, by design, more flexible and versatile. However, in the context of modelfree RL with continuous state spaces, the convergence behaviors and finitesample analysis of nonparametric approaches are less understood.Summary of results. In this work, we consider a natural combination of the Qlearning with Kernelbased nearest neighbor regression for continuous statespace MDP problems, denoted as NearestNeighbor based QLearning (NNQL). As the main result, we provide finite sample analysis of NNQL for a single, arbitrary sequence of data for any infinitehorizon discountedreward MDPs with continuous state space. In particular, we show that the algorithm outputs an accurate (with respect to supremum norm) estimate of the optimal Qfunction with high probability using a number of observations that depends polynomially on , the model parameters and the “cover time” of the sequence of the data or trajectory of the data utilized. For example, if the data was sampled per a completely random policy, then our generic bound suggests that the number of samples would scale as where is the dimension of the state space. We establish effectively matching lower bound stating that for any policy to learn optimal function within approximation, the number of samples required must scale as . In that sense, our policy is nearly optimal.
Our analysis consists of viewing our algorithm as a special case of a general biased
stochastic approximation procedure, for which we establish nonasymptotic convergence guarantees. Key to our analysis is a careful characterization of the bias effect induced by nearestneighbor approximation of the population Bellman operator, as well as the statistical estimation error due to the variance of finite, dependent samples. Specifically, the resulting Bellman nearest neighbor operator allows us to connect the update rule of NNQL to a class of stochastic approximation algorithms, which have
biasednoisy updates. Note that traditional results from stochastic approximation rely on unbiased updates and asymptotic analysis
Robbins1951stochastic ; Tsitsiklis1994Qlearning . A key step in our analysis involves decomposing the update into two subupdates, which bears some similarity to the technique used by Jaakkola1994Qlearning . Our results make improvement in characterizing the finitesample convergence rates of the two subupdates.In summary, the salient features of our work are

[itemsep=0mm,leftmargin=*]

Unknown system dynamics: We assume that the transition kernel and reward function of the MDP is unknown. Consequently, we cannot exactly evaluate the expectation required in standard dynamic programming algorithms (e.g., value/policy iteration). Instead, we consider a samplebased approach which learns the optimal value functions/policies by directly observing data generated by the MDP.

Single sample path: We are given a single, sequential samples obtained from the MDP operated under an arbitrary policy. This in particular means that the observations used for learning are dependent. Existing work often studies the easier settings where samples can be generated at will; that is, one can sample any number of (independent) transitions from any given state, or reset the system to any initial state. For example, Parallel Sampling in Kearns1999PQL . We do not assume such capabilities, but instead deal with the realistic, challenging setting with a single path.

Online computation: We assume that data arrives sequentially rather than all at once. Estimates are updated in an online fashion upon observing each new sample. Moreover, as in standard Qlearning, our approach does not store old data. In particular, our approach differs from other batch methods, which need to wait for all data to be received before starting computation, and require multiple passes over the data. Therefore, our approach is space efficient, and hence can handle the datarich scenario with a large, increasing number of samples.

Nonasymptotic, near optimal guarantees: We characterize the finitesample convergence rate of our algorithm; that is, how many samples are needed to achieve a given accuracy for estimating the optimal value function. Our analysis is nearly tight in that we establish a lower bound that nearly matches our generic upper bound specialized to setting when data is generated per random policy or more generally any policy with random exploration component to it.
While there is a large and growing literature on Reinforcement Learning for MDPs, to the best of our knowledge, ours is the first result on Qlearning that simultaneously has all of the above four features. We summarize comparison with relevant prior works in Table 1. Detailed discussion can be found in Appendix A.
2 Setup
In this section, we introduce necessary notations, definitions for the framework of Markov Decision Processes that will be used throughout the paper. We also precisely define the question of interest.
Notation. For a metric space endowed with metric , we denote by the set of all bounded and measurable functions on . For each let be the supremum norm, which turns into a Banach space . Let denote the set of Lipschitz continuous functions on with Lipschitz bound , i.e.,
The indicator function is denoted by . For each integer , let
Markov Decision Process. We consider a general setting where an agent interacts with a stochastic environment. This interaction is modeled as a discretetime discounted Markov decision process (MDP). An MDP is described by a fivetuple , where and are the state space and action space, respectively. We shall utilize to denote time. Let be state at time . At time , the action chosen is denoted as . Then the state evolution is Markovian as per some transition probability kernel with density (with respect to the Lebesgue measure on ). That is,
(1) 
for any measurable set . The onestage reward earned at time
is a random variable
with expectation , where is the expected reward function. Finally, is the discount factor and the overall reward of interest is The goal is to maximize the expected value of this reward. Here we consider a distance function so that forms a metric space. For the ease of exposition, we use for the joint stateaction spaceWe start with the following standard assumptions on the MDP:
Assumption 1 (MDP Regularity).
We assume that: (A1.) The continuous state space is a compact subset of ; (A2.) is a finite set of cardinality ; (A3.) The onestage reward is nonnegative and uniformly bounded by , i.e., almost surely. For each for some . (A4.) The transition probability kernel satisfies
where the function satisfies
The first two assumptions state that the state space is compact and the action space is finite. The third and forth stipulate that the reward and transition kernel are Lipschitz continuous (as a function of the current state). Our Lipschitz assumptions are identical to (or less restricted than) those used in the work of Rust1997 , Chow1991TACmultigrid , and Dufour2015StochasticsMDP . In general, this type of Lipschitz continuity assumptions are standard in the literature on MDPs with continuous state spaces; see, e.g., the work of Dufour2012AnalAppl ; Dufour2013LP , and Bertsekas1975TACdiscretization .
A Markov policy gives the probability of performing action given the current state . A deterministic policy assigns each state a unique action. The value function for each state under policy , denoted by is defined as the expected discounted sum of rewards received following the policy from initial state i.e., . The actionvalue function under policy is defined by . The number is called the Qvalue of the pair , which is the return of initially performing action at state and then following policy Define
Since all the rewards are bounded by , it is easy to see that the value function of every policy is bounded by EvenDar2004Qlearning ; Stehl2006PAC . The goal is to find an optimal policy that maximizes the value from any start state. The optimal value function is defined as , The optimal actionvalue function is defined as . The Bellman optimality operator is defined as
It is well known that is a contraction with factor on the Banach space (Bertsekas1995DPOC, , Chap. 1). The optimal actionvalue function is the unique solution of the Bellman’s equation in In fact, under our setting, it can be show that is bounded and Lipschitz. This is stated below and established in Appendix B.
Lemma 1.
Under Assumption 1, the function satisfies that and that for each .
3 Reinforcement Learning Using Nearest Neighbors
In this section, we present the nearestneighborbased reinforcement learning algorithm. The algorithm is based on constructing a finitestate discretization of the original MDP, and combining Qlearning with nearest neighbor regression to estimate the
values over the discretized state space, which is then interpolated and extended to the original continuous state space. In what follows, we shall first describe several building blocks for the algorithm in Sections
3.1–3.4, and then summarize the algorithm in Section 3.5.3.1 State Space Discretization
Let be a prespecified scalar parameter. Since the state space is compact, one can find a finite set of points in such that
The finite grid is called an net of and its cardinality can be chosen to be the covering number of the metric space Define . Throughout this paper, we denote by the ball centered at with radius ; that is,
3.2 Nearest Neighbor Regression
Suppose that we are given estimated Qvalues for the finite subset of states , denoted by . For each stateaction pair we can predict its Qvalue via a regression method. We focus on nonparametric regression operators that can be written as nearest neighbors averaging in terms of the data of the form
(2) 
where is a weighting kernel function satisfying Equation (2) defines the socalled Nearest Neighbor (NN) operator , which maps the space into the set of all bounded function over . Intuitively, in (2) one assesses the Qvalue of by looking at the training data where the action has been applied, and by averaging their values. It can be easily checked that the operator is nonexpansive in the following sense:
(3) 
This property will be crucially used for establishing our results. is assumed to satisfy
(4) 
where is the discretization parameter defined in Section 3.1.^{1}^{1}1This assumption is not absolutely necessary, but is imposed to simplify subsequent analysis. In general, our results hold as long as decays sufficiently fast with the distance . This means that the values of states located in the neighborhood of are more influential in the averaging procedure (2). There are many possible choices for . In Section C we describe three representative choices that correspond to Nearest Neighbor Regression, FixedRadius Near Neighbor Regression and Kernel Regression.
3.3 A Joint BellmanNN Operator
Now, we define the joint BellmanNN (Nearest Neighbor) operator. As will become clear subsequently, it is this operator that the algorithm aims to approximate, and hence it plays a crucial role in the subsequent analysis.
For a function we denote by the nearestneighbor average extension of to ; that is,
The joint BellmanNN operator on is defined by composing the original Bellman operator with the NN operator and then restricting to ; that is, for each ,
(5) 
It can be shown that is a contraction operator with modulus mapping to itself, thus admitting a unique fixed point, denoted by ; see Appendix E.2.
3.4 Covering Time of Discretized MDP
As detailed in Section 3.5 to follow, our algorithm uses data generated by an abritrary policy for the purpose of learning. The goal of our approach is to estimate the Qvalues of every state. For there to be any hope to learn something about the value of a given state, this state (or its neighbors) must be visited at least once. Therefore, to study the convergence rate of the algorithm, we need a way to quantify how often samples from different regions of the stateaction space .
Following the approach taken by EvenDar2004Qlearning and Azar2011SQL , we introduce the notion of the covering time of MDP under a policy . This notion is particularly suitable for our setting as our algorithm is based on asynchronous Qlearning (that is, we are given a single, sequential trajectory of the MDP, where at each time step one stateaction pair is observed and updated), and the policy may be nonstationary. In our continuous state space setting, the covering time is defined with respect to the discretized space , as follows:
Definition 1 (Covering time of discretized MDP).
For each and , a ballaction pair is said to be visited at time if and . The discretized stateaction space is covered by the policy if all the ballaction pairs are visited at least once under the policy . Define , the covering time of the MDP under the policy , as the minimum number of steps required to visit all ballaction pairs starting from state at timestep Formally, is defined as
with notation that minimum over empty set is .
We shall assume that there exists a policy with bounded expected cover time, which guarantees that, asymptotically, all the ballaction pairs are visited infinitely many times under the policy
Assumption 2.
There exists an integer such that , Here the expectation is defined with respect to randomness introduced by Markov kernel of MDP as well as the policy .
In general, the covering time can be large in the worst case. In fact, even with a finite state space, it is easy to find examples where the covering time is exponential in the number of states for every policy. For instance, consider an MDP with states , where at any state , the chain is reset to state 1 with probability regardless of the action taken. Then, every policy takes exponential time to reach state starting from state , leading to an exponential covering time.
To avoid the such bad cases, some additional assumptions are needed to ensure that the MDP is wellbehaved. For such MDPs, there are a variety of polices that have a small covering time. Below we focus on a class of MDPs satisfying a form of the uniform ergodic assumptions, and show that the standard greedy policy (which includes the purely random policy as special case by setting ) has a small covering time. This is done in the following two Propositions. Proofs can be found in Appendix D.
Proposition 1.
Suppose that the MDP satisfies the following: there exists a probability measure on , a number and an integer such that for all all and all policies ,
(6) 
Let , where we recall that is the cardinality of the discretized state space. Then the expected covering time of greedy is upper bounded by .
Proposition 2.
Suppose that the MDP satisfies the following: there exists a probability measure on , a number and an integer such that for all all , there exists a sequence of actions ,
(7) 
Let , where we recall that is the cardinality of the discretized state space. Then the expected covering time of greedy is upper bounded by .
3.5 Qlearning using Nearest Neighbor
We describe the nearestneighbor Qlearning (NNQL) policy. Like Qlearning, it is a modelfree policy for solving MDP. Unlike standard Qlearning, it is (relatively) efficient to implement as it does not require learning the Q function over entire space . Instead, we utilize the nearest neighbor regressed Q function using the learned Q values restricted to . The policy assumes access to an existing policy (which is sometimes called the “exploration policy”, and need not have any optimality properties) that is used to sample data points for learning.
The pseudocode of NNQL is described in Policy 1. At each time step , action is performed from state as per the given (potentially nonoptimal) policy , and the next state is generated according to Note that the sequence of observed states take continuous values in the state space .
The policy runs over iteration with each iteration lasting for a number of time steps. Let denote iteration count, denote time when iteration starts for . Initially, , , and for , the policy is in iteration . The iteration is updated from to when starting with , all ballaction pairs have been visited at least once. That is, In the policy description, the counter records how many times the ballaction pair has been visited from the beginning of iteration till the current time ; that is, By definition, the iteration ends at the first time step for which .
During each iteration, the policy keeps track of the Qfunction over the finite set . Specifically, let denote the approximate Qvalues on within iteration . The policy also maintains , which is a biased empirical estimate of the joint BellmanNN operator applied to the estimates . At each time step within iteration , if the current state falls in the ball , then the corresponding value is updated as
(8) 
where . We notice that the above update rule computes, in an incremental fashion, an estimate of the joint BellmanNN operator applied to the current for each discretized stateaction pair , using observations that fall into the neighborhood of . This nearestneighbor approximation causes the estimate to be biased.
At the end of iteration , i.e., at time step , a new is generated as follows: for each ,
(9) 
At a high level, this update is similar to standard Qlearning updates — the Qvalues are updated by taking a weighted average of , the previous estimate, and , an onestep application of the Bellman operator estimated using newly observed data. There are two main differences from standard Qlearning: 1) the Qvalue of each is estimated using all observations that lie in its neighborhood — a key ingredient of our approach; 2) we wait until all ballaction pairs are visited to update their Qvalues, all at once.
Given the output of Policy 1, we obtain an approximate Qvalue for each (continuous) stateaction pair via the nearestneighbor average operation, i.e., here the superscript emphasizes that the algorithm is run for time steps with a sample size of .
4 Main Results
As a main result of this paper, we obtain finitesample analysis of NNQL policy. Specifically, we find that the NNQL policy converges to an accurate estimate of the optimal with time that has polynomial dependence on the model parameters. The proof can be found in Appendix E.
Theorem 1.
The theorem provides sufficient conditions for NNQL to achieve accuracy (in sup norm) for estimating the optimal actionvalue function . The conditions involve the bandwidth parameter and the number of time steps , both of which depend polynomially on the relevant problem parameters. Here an important parameter is the covering number : it provides a measure of the “complexity” of the state space , replacing the role of the cardinality in the context of discrete state spaces. For instance, for a unit volume ball in the corresponding covering number scales as (cf. Proposition 4.2.12 in vershynin_hdpbook ). We take note of several remarks on the implications of the theorem.
Sample complexity: The number of time steps , which also equals the number of samples needed, scales linearly with the covering time of the underlying policy to sample data for the given MDP. Note that depends implicitly on the complexities of the state and action space as measured by and . In the best scenario, , and hence as well, is linear in (up to logarithmic factors), in which case we achieve (near) optimal linear sample complexity. The sample complexity also depends polynomially on the desired accuracy and the effective horizon of the discounted MDP — optimizing the exponents of the polynomial dependence remains interesting future work.
Space complexity: The space complexity of NNQL is , which is necessary for storing the values of . Note that NNQL is a truly online algorithm, as each data point is accessed only once upon observation and then discarded; no storage of them is needed.
Computational complexity: In terms of computational complexity, the algorithm needs to compute the NN operator and maximization over in each time step, as well as to update the values of for all and in each iteration. Therefore, the worstcase computational complexity per time step is , with an overall complexity of . The computation can be potentially sped up by using more efficient data structures and algorithms for finding (approximate) nearest neighbors, such as kd trees Bentley1979kdtree , random projection trees Dasgupta2008random , Locality Sensitive Hashing Indyk1998LSH and boundary trees Mathy2015boundary .
Choice of : NNQL requires as input a userspecified parameter , which determines the discretization granularity of the state space as well as the bandwidth of the (kernel) nearest neighbor regression. Theorem 1 provides a desired value , where we recall that is the Lipschitz parameter of the optimal actionvalue function (see Lemma 1). Therefore, we need to use a small if we demand a small error , or if fluctuates a lot with a large .
4.1 Special Cases and Lower Bounds
Theorem 1, combined with Proposition 1, immediately yield the following bound that quantify the number of samples required to obtain an optimal actionvalue function with high probability, if the sample path is generated per the uniformly random policy. The proof is given in Appendix F.
Corollary 1.
Corollary 1 states that the sample complexity of NNQL scales as We will show that this is effectively necessary by establishing a lower bound on any algorithm under any sampling policy! The proof of Theorem 2 can be found in Appendix G.
Theorem 2.
For any reinforcement learning algorithm and any number , there exists an MDP problem and some number such that
where is a constant. Consequently, for any reinforcement learning algorithm and any sufficiently small , there exists an MDP problem such that in order to achieve
one must have
where is a constant.
5 Conclusions
In this paper, we considered the reinforcement learning problem for infinitehorizon discounted MDPs with a continuous state space. We focused on a reinforcement learning algorithm NNQL that is based on kernelized nearest neighbor regression. We established nearly tight finitesample convergence guarantees showing that NNQL can accurately estimate optimal Q function using nearly optimal number of samples. In particular, our results state that the sample, space and computational complexities of NNQL scale polynomially (sometimes linearly) with the covering number of the state space, which is continuous and has uncountably infinite cardinality.
In this work, the sample complexity analysis with respect to the accuracy parameter is nearly optimal. But its dependence on the other problem parameters is not optimized. This will be an important direction for future work. It is also interesting to generalize approach to the setting of MDP beyond infinite horizon discounted problems, such as finite horizon or averagecost problems. Another possible direction for future work is to combine NNQL with a smart exploration policy, which may further improve the performance of NNQL. It would also be of much interest to investigate whether our approach, specifically the idea of using nearest neighbor regression, can be extended to handle infinite or even continuous action spaces.
Acknowledgment
This work is supported in parts by NSF projects CNS1523546, CMMI1462158 and CMMI1634259 and MURI 1336685079809.
References
 [1] A. Antos, C. Szepesvári, and R. Munos. Learning nearoptimal policies with Bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
 [2] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. J. Kappen. Reinforcement learning with a near optimal rate of convergence. Technical Report, 2011.
 [3] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. J. Kappen. Speedy Qlearning. In NIPS, 2011.
 [4] A. Barreto, D. Precup, and J. Pineau. Practical kernelbased reinforcement learning. The Journal of Machine Learning Research, 17(1):2372–2441, 2016.
 [5] J. L. Bentley. Multidimensional binary search trees in database applications. IEEE Transactions on Software Engineering, (4):333–340, 1979.
 [6] D. Bertsekas. Convergence of discretization procedures in dynamic programming. IEEE Transactions on Automatic Control, 20(3):415–419, 1975.
 [7] D. P. Bertsekas. Dynamic programming and optimal control, volume II. Athena Scientific, Belmont, MA, 3rd edition, 2007.
 [8] Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 1691–1692. PMLR, 06–09 Jul 2018.
 [9] N. Bhat, V. F. Farias, and C. C. Moallemi. Nonparametric approximate dynamic programming via the kernel method. In NIPS, 2012.
 [10] C.S. Chow and J. N. Tsitsiklis. The complexity of dynamic programming. Journal of Complexity, 5(4):466–488, 1989.
 [11] C.S. Chow and J. N. Tsitsiklis. An optimal oneway multigrid algorithm for discretetime stochastic control. IEEE Transactions on Automatic Control, 36(8):898–914, 1991.
 [12] Gal Dalal, Balázs Szörényi, Gugan Thoppe, and Shie Mannor. Finite sample analysis for TD(0) with linear function approximation. arXiv preprint arXiv:1704.01161, 2017.

[13]
S. Dasgupta and Y. Freund.
Random projection trees and low dimensional manifolds.
In
Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing
, pages 537–546. ACM, 2008.  [14] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016.
 [15] F. Dufour and T. PrietoRumeau. Approximation of Markov decision processes with general state space. Journal of Mathematical Analysis and applications, 388(2):1254–1267, 2012.

[16]
F. Dufour and T. PrietoRumeau.
Finite linear programming approximations of constrained discounted Markov decision processes.
SIAM Journal on Control and Optimization, 51(2):1298–1324, 2013.  [17] F. Dufour and T. PrietoRumeau. Approximation of average cost Markov decision processes using empirical distributions and concentration inequalities. Stochastics: An International Journal of Probability and Stochastic Processes, 87(2):273–307, 2015.
 [18] E. EvenDar and Y. Mansour. Learning rates for Qlearning. JMLR, 5, December 2004.
 [19] W. B. Haskell, R. Jain, and D. Kalathil. Empirical dynamic programming. Mathematics of Operations Research, 41(2), 2016.
 [20] H. V. Hasselt. Double Qlearning. In NIPS. 2010.

[21]
P. Indyk and R. Motwani.
Approximate nearest neighbors: towards removing the curse of dimensionality.
In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pages 604–613. ACM, 1998.  [22] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Comput., 6(6), 1994.
 [23] M. Kearns and S. Singh. Finitesample convergence rates for Qlearning and indirect algorithms. In NIPS, 1999.
 [24] S. H. Lim and G. DeJong. Towards finitesample convergence of direct reinforcement learning. In Proceedings of the 16th European Conference on Machine Learning, pages 230–241. SpringerVerlag, 2005.

[25]
Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar Mahadevan, and Marek Petrik.
Finitesample analysis of proximal gradient TD algorithms.
In
Proceedings of the ThirtyFirst Conference on Uncertainty in Artificial Intelligence
, pages 504–513. AUAI Press, 2015. 
[26]
C. Mathy, N. Derbinsky, J. Bento, J. Rosenthal, and J. S. Yedidia.
The boundary forest algorithm for online supervised and unsupervised learning.
In TwentyNinth AAAI Conference on Artificial Intelligence, pages 2864–2870, 2015.  [27] F. S. Melo, S. P. Meyn, and M. I. Ribeiro. An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on Machine learning, pages 664–671. ACM, 2008.

[28]
Francisco S Melo and M Isabel Ribeiro.
Qlearning with linear function approximation.
In
International Conference on Computational Learning Theory
, pages 308–322. Springer, 2007.  [29] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K Fidjeland, and G. Ostrovski. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [30] R. Munos and C. Szepesvári. Finitetime bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815–857, 2008.
 [31] E. A. Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141–142, 1964.
 [32] D. Ormoneit and P. Glynn. Kernelbased reinforcement learning in averagecost problems. IEEE Trans. Automatic Control, 47(10), 2002.
 [33] D. Ormoneit and Ś. Sen. Kernelbased reinforcement learning. Mach. Learning, 49(23), 2002.
 [34] Jason Pazis and Ronald Parr. PAC optimal exploration in continuous space Markov decision processes. In Proceedings of the TwentySeventh AAAI Conference on Artificial Intelligence, pages 774–781. AAAI Press, 2013.
 [35] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
 [36] J. Rust. Using randomization to break the curse of dimensionality. Econometrica, 65(3), 1997.
 [37] N. Saldi, S. Yuksel, and T. Linder. On the asymptotic optimality of finite approximations to markov decision processes with borel spaces. Math. of Operations Research, 42(4), 2017.
 [38] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 [39] Charles J. Stone. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, pages 1040–1053, 1982.
 [40] A. L Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. PAC modelfree reinforcement learning. In ICML, 2006.
 [41] C. Szepesvári. The asymptotic convergencerate of Qlearning. In NIPS, 1997.
 [42] C. Szepesvári and W. D. Smart. Interpolationbased Qlearning. In Proceedings of the TwentyFirst International Conference on Machine learning, page 100. ACM, 2004.
 [43] J. N. Tsitsiklis. Asynchronous stochastic approximation and Qlearning. Mach. Learning, 16(3), 1994.
 [44] J. N. Tsitsiklis and B. Van Roy. An analysis of temporaldifference learning with function approximation. IEEE Trans. Automatic Control, 42(5), 1997.
 [45] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, 2009.
 [46] Roman Vershynin. HighDimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2017.
 [47] C. J. C. H. Watkins and P. Dayan. Qlearning. Mach. learning, 8(34), 1992.

[48]
G. S. Watson.
Smooth regression analysis.
Sankhyā: The Indian Journal of Statistics, Series A, pages 359–372, 1964.
Appendix A Related works
Given the large body of relevant literature, even surveying the work on Qlearning in a satisfactory manner is beyond the scope of this paper. Here we only mention the most relevant prior works, and compare them to ours in terms of the assumptions needed, the algorithmic approaches considered, and the performance guarantees provided. Table 1 provides key representative works from the literature and contrasts them with our result.
Qlearning has been studied extensively for finitestate MDPs. [43] and [22] are amongst the first to establish its asymptotic convergence. Both of them cast Qlearning as a stochastic approximation scheme — we utilize this abstraction as well. More recent work studies nonasymptotic performance of Qlearning; see, e.g., [41], [18], and [24]. Many variants of Qlearning have also been proposed and analyzed, including Double Qlearning [20], Speedy Qlearning [3], Phased Qlearning [23] and Delayed Qlearning [40].
A standard approach for continuousstate MDPs with known transition kernels, is to construct a reduced model by discretizing state space and show that the new finite MDP approximates the original one. For example, Chow and Tsitsiklis establish approximation guarantees for a multigrid algorithm when the state space is compact [10, 11]. This result is recently extended to averagecost problems and to general Borel state and action spaces in [37]. To reduce the computational complexity, Rust proposes a randomized version of the multigrid algorithm and provides a bound on its approximation accuracy [36]. Our approach bears some similarities to this line of work: we also use state space discretization, and impose similar continuity assumptions on the MDP model. However, we do not require the transition kernel to be known, nor do we construct a reduced model; rather, we learn the actionvalue function of the original MDP directly by observing its sample path.
The closest work to this paper is by Szepesvari and Smart [42], wherein they consider a variant of Qlearning combined with local function approximation methods. The algorithm approximates the optimal Qvalues at a given set of sample points and interpolates it for each query point. Followup work considers combining Qlearning with linear function approximation [28]. Despite algorithmic similarity, their results are distinct from ours: they establish asymptotic convergence of the algorithm, based on the assumption that the datasampling policy is stochastic stationary. In contrast, we provide finitesample bounds, and our results apply for arbitrary sample paths (including nonstationary policies). Consequently, our analytical techniques are also different from theirs.
Some other closely related work is by Ormoneit and coauthors on modelfree reinforcement learning for continuous state with unknown transition kernels [33, 32]. Their approach, called KBRL, constructs a kernelbased approximation of the conditional expectation that appears in the Bellman operator. Value iteration can then be run using the approximate Bellman operator, and asymptotic consistency is established for the resulting fixed points. A subsequent work demonstrates applicability of KBRL to practical largescale problems [4]. Unlike our approach, KBRL is an offline, batch algorithm in which data is sampled at once and remains the same throughout the iterations of the algorithm. Moreover, the aforementioned work does not provide convergence rate or finitesample performance guarantee for KBRL. The idea of approximating the Bellman operator by an empirical estimate, has also been used in the context of discrete statespace problems [19]. The approximate operator is used to develop Empirical Dynamic Programming (EDP) algorithms including value and policy iterations, for which nonasymptotic error bounds are provided. EDP is again an offline batch algorithm; moreover, it requires multiple, independent transitions to be sampled for each state, and hence does not apply to our setting with a single sample path.
In terms of theoretical results, most relevant is the work in [30], who also obtain finitesample performance guarantees for continuous space problems with unknown transition kernels. Extension to the setting with a single sample path is considered in [1]. The algorithms considered therein, including fitted value iteration and Bellmanresidual minimization based fitted policy iteration, are different from ours. In particular, these algorithms perform updates in a batch fashion and require storage of all the data throughout the iterations.
There are other papers that provide finitesample guarantees, such as [25, 12]; however, their settings (availability of i.i.d. data), algorithms (TD learning) and proof techniques are very different from ours. The work by Bhandari et al. [8] also provides a finite sample analysis of TD learning with linear function approximation, for both the i.i.d. data model and a single trajectory. We also note that the work on PACMDP methods [34] explores the impact of exploration policy on learning performance. The focus of our work is estimation of Qfunctions rather than the problem of exploration; nevertheless, we believe it is an interesting future direction to study combining our algorithm with smart exploration strategies.
Appendix B Proof of Lemma 1
Proof.
Let be the set of all functions such that Let be the set of all functions such that . Take any , and fix an arbitrary . For any , we have
where the last equality follows from the definition of . This means that . Also, for any , we have
This means that , so . Putting together, we see that maps to , which in particular implies that maps to itself. Since is closed and is contraction, both with respect to , the Banach fixed point theorem guarantees that has a unique fixed point . This completes the proof of the lemma. ∎
Appendix C Examples of Nearest Neighbor Regression Methods
Below we describe three representative nearest neighbor regression methods, each of which corresponds to a certain choice of the kernel function in the averaging procedure (2).

nearest neighbor (NN) regression: For each we find its nearest neighbors in the subset and average their Qvalues, where is a prespecified number. Formally, let denote the th closest data point to amongst the set Thus, the distance of each state in to satisfies Then the NN estimate for the Qvalue of is given by This corresponds to using in (2) the following weighting function
Under the definition of in Section 3.1, the assumption (4) is satisfied if we use . For other values of , the assumption holds with a potentially different value of .

Fixedradius near neighbor regression: We find all neighbors of up to a threshold distance and average their Qvalues. The definition of ensures that at least one point is within the threshold distance , i.e., such that . We then can define the weighting function function according to

Kernel regression: Here the Qvalues of the neighbors of are averaged in a weighted fashion according to some kernel function [31, 48]. The kernel function takes as input a distance (normalized by the bandwidth parameter ) and outputs a similarity score between and Then the weighting function is given by
For example, a (truncated) Gaussian kernel corresponds to . Choosing reduces to the fixedradius NN regression described above.
Appendix D Bounds on Covering time
d.1 Proof of Proposition 1
Proof.
Without loss of generality, we may assume that the balls are disjoint, since the covering time will only become smaller if they overlap with each other. Note that under greedy policy, equation (6) implies that
(10) 
First assume that the above assumption holds with . Let be the total number of ballaction pairs. Let be a fixed ordering of these pairs. For each integer , let be the number of ballaction pairs visited up to time . Let be the first time when all ballaction pairs are visited. For each , let be the the first time when pairs are visited, and let be the time to visit the th pair after pairs have been visited. We use the convention that . By definition, we have
When pairs have been visited, the probability of visiting a new pair is at least
where the last inequality follows from Eq. (10). Therefore, is stochastically dominated by a geometric random variable with mean at most It follows that
This prove the proposition for .
For general values of , the proposition follows from a similar argument by considering the MDP only at times ∎
d.2 Proof of Proposition 2
Proof.
We shall use a line of argument similar to that in the proof of Proposition 1. We assume that the balls are disjoint. Note that under greedy policy , for all for all we have
Comments
There are no comments yet.