1 Introduction
The problem of policy evaluation arises naturally in the context of reinforcement learning (RL) (Sutton & Barto, 2018) when one wants to evaluate the (action) values of a policy in a Markov decision process (MDP). In particular, policy iteration (Howard, 1960) is a classic algorithmic framework for solving MDPs that poses and solves a policy evaluation problem during each iteration. Being motivated by the setting of reinforcement learning, i.e., the underlying MDP parameters are unknown, we focus on solving the policy evaluation problem given only a single sample path.
Following a stationary Markov policy in an MDP, i.e., actions are determined based solely on the current state, gives rise to a Markov reward process (MRP) (Puterman, 1994). In this paper, we focus on MRPs and consider the problem of estimating the infinitehorizon discounted state values of an unknown MRP.
A straightforward approach to policy evaluation is to estimate the parameters of the MRP directly. Modelbased methods (Bertsekas & Tsitsiklis, 1996)
estimate the underlying Markov chain transitions and mean rewards, and then compute the discounted values using the estimated parameters. This approach provides excellent estimates of discounted values, as our numerical experiments show (Section
5). However, modelbased estimators suffer from a space complexity of , where is the number of states in the MRP. In contrast, modelfree methods enjoy a lower space complexity of by not explicitly estimating the model parameters (Sutton, 1988) but tend to exhibit a greater estimation error.A popular class of estimators, step bootstrapping temporal difference or TD(k)^{1}^{1}1An important variant is TD(), but we do not include it in our experiments since there is not a canonical implementation of the idea of estimating return (Sutton & Barto, 2018). However, any implementation is expected to exhibit similar behaviors as TD(k) with large corresponding to large (Kearns & Singh, 2000). estimates a state’s value based on the estimated value of other states, biasing its current estimates. This complicates the analysis of the algorithm’s convergence in the single sample path setting. The fact that the estimate of a state’s value is tied to the estimates of other states makes it hard to study the convergence of a specific state’s value estimate in isolation.
In this work, we show that it is possible to achieve space complexity with provably efficient sample complexity (in the form of a PACstyle bound). The key insight is that it is possible to circumvent the general difficulties of nonindependent samples entirely by recognizing the embedded regenerative structure of an MRP. Regenerative processes are central to renewal theory (Ross, 1996), whose main approach is based on the independence between renewals in a stochastic process. We alleviate the reliance on estimates of other states by studying segments of the sample path that start and end in the same state, i.e., loops. This results in a novel and simple algorithm we call the loop estimator (Algorithm 1).
We first review the requisite definitions (Section 3) and then propose the loop estimator (Section 4.2). Unlike prior works that relied on a generative model and reduction arguments based on cover times, we directly establish the convergence of a single state’s value estimate in isolation over a single sample path at a rate independent of the size of the state space. First We then analyze the algorithm’s rate of convergence over each visit to a specified state (Theorem 4.2). Using the exponential concentration of first return times (Lemma 4.3), we relate visits to their waiting times and establish the rate of convergence over steps (Theorem 4.5). Lastly, we obtain the convergence in norm of all states via the union bound. Besides theoretical analysis, we also compare the loop estimator to several other estimators numerically on a commonly used example (Section 5). Finally, we discuss an interesting observation concerning the loop estimator’s modelbased vs. modelfree status (Section 6).
Our main contributions in this paper are threefold:

By recognizing the embedded regenerative structure in MRPs, we derive a new Bellman equation over loops, segments that start and end in the same state.

We introduce loop estimator, a novel algorithm that efficiently estimates the discounted values of a MRP from a single sample path.

We formally analyze the algorithm’s convergence with renewal theory and establish a rate of over steps for a single state.
In the interest of a concise presentation, we defer detailed proofs to Appendix A with fully expanded logarithmic factors and constants.
2 Related works
Much work that formally studies the convergence of value estimators (particularly the TD estimators) relies on having access to independent trajectories that start in all the states (Dayan & Sejnowski, 1994; EvenDar & Mansour, 2003; Jaakkola et al., 1994; Kearns & Singh, 2000). This is sometimes called a generative model. Operationally speaking, independent trajectories can be obtained if we additionally assume some reset mechanism, e.g., that the process restarts upon reaching known costfree terminal states. (Kearns & Singh, 1999) consider how a set of independent trajectories can be obtained by following a single sample path via the stationary distribution. But naively speaking, the resulting reduction will scale with the cover time of the MRP which can be quite large, c.f., the maximal expected hitting time in this work. Common to this family of results is the notable absence of an explicit dependency on the structure of the underlying process. In contrast, such dependency shows up naturally in our results (in the form of in Theorem 4.5). We do not assume access to samples other than a single trajectory from the MRP in our analysis, similar to Antos et al. (2008)
. To ensure convergence is at all possible, we assume that the specific state to estimate is visited infinitely often with probability 1 (Assumption
3.1), otherwise we cannot hope to (significantly) improve its value estimate after the final visit. We think this assumption feels reasonable as recurrence is an important feature of many Markov chains and it connects naturally to the online (interactive) setting. A possible drawback is the exclusion of transient states where most of the state space is transient.Besides the interest in the RL community to study the policy evaluation problem (or sometimes, the value prediction problem), operation researchers were also motivated to study estimation primarily as a computational method to leverage simulations. Classic work by Fox & Glynn (1989) deals with estimating discounted value in a continuous time setting, and one of its estimators echoes our key insight of leveraging the regenerative structure of MRPs. However, their analysis also relies on independent trajectories. While this assumption is reasonable in the simulation setting, is not immediately comparable to our setting.
Outside of the studies on reward processes, the regenerative structure of Markov chains has found application in the local computation of PageRank (Lee et al., 2013). We make use of a lemma (Lemma 4.3, whose proof is included in the Appendix A.3 for completeness) from this work to establish an upper bound on waiting times (Corollary 4.4). Similar in spirit to the concept of locality studied by Lee et al. (2013), our loop estimator enables spaceefficient estimation of a single state value with a space complexity of and an error bound without explicit dependency on the size of the state space.
3 Preliminaries
3.1 Markov reward processes and Markov chains
Consider a finite state space whose size is , a transition probability matrix that specifies the transition probabilities between consecutive states and , i.e., (strong) Markov property , and a reward function where , then is called a discretetime finite Markov reward process (MRP) (Puterman, 1994). Note that is an embedded Markov chain with transition law . Furthermore, we denote the mean rewards as . As conventions, we denote and .
The first step when a Markov chain visits a state is called the hitting time to . Note that if a chain starts at , then . We refer to the first time a chain returns to as the first return time to
(1) 
Definition 3.1 (Expected recurrence time).
Given a Markov chain, we define the expected recurrence time of state as the expected first return time of starting in
(2) 
A state is positive recurrent if its expected recurrence time is finite, i.e., .
Definition 3.2 (Maximal expected hitting time (Lee et al., 2013)).
Given a Markov chain, we define the maximal expected hitting time of state as the maximal expected first return time over starting states
(3) 
3.2 Discounted total rewards
In RL, we are generally interested in some expected longterm rewards that will be collected by following a policy. In the infinitehorizon discounted total reward setting, following a Markov policy on an MDP induces an MRP and the state value of state is
(4) 
where is the discount factor. Note that since the reward is bounded by , state values are also bounded by . A fundamental result relating values to the MRP parameters is the Bellman equation for each state (Sutton & Barto, 2018)
(5) 
3.3 Problem statement
Suppose that we have a sample path of length from an MRP whose parameters are unknown. Given a state and discount factor , we want to estimate .
Assumption 3.1 (State is reachable).
We will assume that state is reachable from all states, i.e., .
Otherwise there is some nonnegligible probability that state will not be visited from some starting state. This will prevent the convergence in probability (in the form of a PACstyle error bound) that we seek. Assumption 3.1 can be weakened to the assumption that is positive recurrent and the MRP starts in the recurrent class containing . However, we will adopt Assumption 3.1 in the rest of the article without much loss of generality,^{2}^{2}2We can recover similar results by restricting in the definition of to the recurrent class containing . so we do not need to worry about where the MRP starts. Note that Assumption 3.1 implies the positive recurrence of , i.e., , by definition, and that the MRP visits state for infinitely many times with probability 1.
3.4 Renewal theory and loops
Stochastic processes in general can exhibit complex dependencies between random variables at different steps, and thus often fall outside of the applicability of approaches that rely on independence assumptions. Renewal theory
(Ross, 1996) focuses on a class of stochastic processes where the process restarts after an “renewal” event. Such regenerative structure allows us to apply results from the independent and identical distribution (IID) settings.In particular, we consider the visits to state as renewal events and define waiting times for , to be the number of steps before the th visit
(6) 
and the interarrival times to be the steps between the th and th visit
(7) 
Remark 3.1.
The random times relate to each other in a few intuitive relations. The waiting time of the first visit is the same as the hitting time . Waiting times relate to interarrival times .
To justify treating visits to as renewal events, consider the subprocesses starting at and at —both MRPs start in state —due to Markov property of MRP, they are statistical replica of each other. Since semgents start and end in the same state, we call them loops. It follows that loops are independent of each other and obey the same statistical law. Intuitively speaking, an MRP is (probabilistically) specified by its starting state.
Definition 3.3 (Loop discounted rewards).
Given a Markov reward process and a positive recurrent state , we define the th loop discounted rewards as the discounted total rewards over the th loop
(8) 
Definition 3.4 (Loop discount).
Given a Markov reward process and a positive recurrent state , we define the th loop discount as the total discounting over the th loop
(9) 
forms a regenerative process that has nice independence relations. Specifically, , , and when . Furthermore, are identically distributed the same as . Similarly, are identically distributed. Note however that .
4 Main results
4.1 Bellman equations over loops
Given the regenerative process , we derive a new Bellman equation over the loops for state value .
Theorem 4.1 (Loop Bellman equations).
Suppose the expected loop discount is and the expected loop discounted rewards is , we can relate the state value to itself
(10) 
4.2 Loop estimator
We plug in the empirical means for the expected loop discount and the expected loop discounted rewards into the loop Bellman equation (10) and define the th loop estimator for state value
(11) 
where
(12) 
and
(13) 
Furthermore, we have visited state for times before step where is a random variable that counts the number of loops before step
(14) 
and the estimate would be the last estimate before step . Hence, with a slight abuse of notations, we define
(15) 
By using incremental updates to keep track of empirical means, Algorithm 1 implements the loop estimator with a space complexity of . Running many copies of loop estimators, one for each state , takes a space complexity of .
4.3 Rates of convergence
Now we investigate the convergence of the loop estimator, first over visits, i.e., as , then over steps, i.e., as . By applying Hoeffding bound to the definition of loop estimator (11), we obtain a PACstyle upper bound on the estimation error.
Theorem 4.2 (Convergence rate over visits).
Given a sample path from an MRP , a discount factor , and a positive recurrent state , with probability of at least , the loop estimator converges to
To determine the convergence rate over steps, we need to study the concentration of waiting times which allows us to lowerbound the number of visits with high probability. As an intermediate step, we use the fact that the tail of the distribution of first return times is upperbounded by an exponential distribution per the Markov property of MRP using Markov inequality
(Lee et al., 2013; Aldous & Fill, 1999).Lemma 4.3 (Exponential concentration of first return times (Lee et al., 2013; Aldous & Fill, 1999)).
Given a Markov chain defined on a finite state space , for any state and any , we have
Secondly, since by Remark 3.1 we have , we apply the union bound to upperbound the tail of waiting times.
Corollary 4.4 (Upper bound on waiting times).
With probability of at least , .
Remark 4.2.
Note that the waiting time is nearly linear in with a dependency on the Markov chain structure via the maximal expected hitting time of , namely . In contrast, the expected waiting time scales with the expected recurrence time .
Theorem 4.5 (Convergence rate over steps).
With probability of at least , for any , the MRP visits state for at least many times, and the last loop estimate converges to
Suppose we run a copy of loop estimator to estimate each state’s value in
, and denote them with a vector
. Convergence of the estimation error in terms of the norm follows immediately by applying the union bound.Corollary 4.6 (Convergence rate over all states).
With probability of at least , for any , the MRP visits each state for at least many times, and the last loop estimates converge to state values
5 Numerical experiments
We consider RiverSwim, an MDP proposed by Strehl & Littman (2008) that is often used to illustrate the challenge of exploration in RL. The MDP consists of six states and two actions . Executing the “swim upstream” action often fails due to the strong current, while there is a high reward for staying in the most upstream state . For our experiments, we use the MRP induced by always taking the “swim upstream” action (see Figure 1 for numerical details).
The most relevant aspect of the induced MRP is that the maximal expected hitting times are very different for different states: , , , , , . Figure 2 shows a plot of the estimation errors of the loop estimator for each state over the square root of maximal expected hitting times of that state. The observed linear relationship between the two quantities (supported by a good linear fit) is consistent with the instancedependence in our result of , c.f., Theorem 4.5.
5.1 Alternative estimators
We define several alternative estimators for and briefly mention their relevance for comparison.
Modelbased. We compute add1 smoothed maximum likelihood estimates (MLE) of the MRP parameters from the sample path
(16) 
and
(17) 
We then solve for the discounted state values from the Bellman equation (5) for the MRP parameterized by , i.e., the (column) vector of estimated state values
(18) 
where
is the identity matrix.
TD(k). step temporal difference (or step backup) estimators are commonly recursively defined (Kearns & Singh, 2000) with TD(0) being a textbook classic (Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 2018). Let
for all states . And for
where is the learning rates. A common choice is to set which statisfies the RobbinsMonro conditions (Bertsekas & Tsitsiklis, 1996). But it has been shown to lead to slower convergence than where (EvenDar & Mansour, 2003).
It is more accurate to consider TD methods as a large family of estimators each with different choices of , . Choosing these parameters can create extra work and sometimes confusion for practitioners. Whereas the loop estimator, like the modelbased estimator, has no parameters to tune. In any case, it is not our intention to compare with the TD family exhaustively (see more results on TD on (Kearns & Singh, 2000; EvenDar & Mansour, 2003)). Instead, we will compare with , , both with , and with .
5.2 Comparative experiments
We experiment with different values for the discount factor , because, roughly speaking, sets the horizon beyond which rewards are discounted too heavily to matter. We compare the estimation errors measured in norm, which is important in RL. The results are shown in Figure 3.

The modelbased estimator dominates all other estimators for every discount setting we tested.

TD(k) estimators perform well if .

The loop estimator performs worse than, but is competitive with, the modelbased estimator. Furthermore, similar to the modelbased estimator and unlike the TD(k) estimators, its performance seems to be less influenced by discounting.
6 Discussions
The elementary identity below relates the expected first return times to the transition probabilities for a finite Markov chain. Using the matrix notations, suppose that the expected first return times are organized in a matrix , and the transition matrix of the Markov chain, then we have
where is a matrix with the same diagonal as and zero elsewhere, and is a matrix with all ones. Thus, knowing is equivalent to knowing the full model, as we can compute using this identity. Recall that by definition , which is exactly the diagonal of . But only knowing the diagonal is not sufficient to determine the entire set of model parameters, namely , the loop estimator based on indeed falls short of being a modelbased method. It may be considered a semimodelbased method as it estimates some but not all of the model parameters.
We believe that regenerative structure can be further exploited in RL (particularly in the form of the loop Bellman equation (10)) and we think this article provides the fundemental results for future study in this direction.
Practically, in many iterative algorithms, some iteration lengths have to be specified and revisits provide a very natural choice as suggested by this work in the setting of policy evaluation. Fundamentally, recurrence is a natural and important behavor of Markov chains (Levin et al., 2008; Aldous & Fill, 1999). Perhaps renewal theory has a larger role to play in RL theory for simpler (if not sharper) analyses and new algorithms.
Another possible extension is to investigate how a loop estimatorlike algorithm can be scaled to large problems where there are few positive recurrent states. A natural extension is to consider recurrence of features instead of states, e.g., a video game screen might not repeat itself completely but the same items might reappear. After all, without repetition exactly or approximately, it would not be possible for an agent to learn and improve its decisions.
Acknowledgements
We thank Mesrob I. Ohannessian for a helpful discussion on Markov chains.
References
 Aldous & Fill (1999) Aldous, D. and Fill, J. Reversible Markov chains and random walks on graphs, 1999. Book in preparation (available at http://www.stat.berkeley.edu/~aldous/RWG/Chap2.pdf).
 Antos et al. (2008) Antos, A., Szepesvári, C., and Munos, R. Learning nearoptimal policies with Bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
 Bertsekas & Tsitsiklis (1996) Bertsekas, D. P. and Tsitsiklis, J. N. Neurodynamic programming. Athena Scientific Belmont, MA, 1996.
 Dayan & Sejnowski (1994) Dayan, P. and Sejnowski, T. J. TD () converges with probability 1. Machine Learning, 14(3):295–301, 1994.
 EvenDar & Mansour (2003) EvenDar, E. and Mansour, Y. Learning rates for Qlearning. Journal of machine learning Research, 5(Dec):1–25, 2003.
 Fox & Glynn (1989) Fox, B. L. and Glynn, P. W. Simulating discounted costs. Management Science, 35(11):1297–1315, 1989.
 Howard (1960) Howard, R. A. Dynamic programming and Markov processes. John Wiley, 1960.
 Jaakkola et al. (1994) Jaakkola, T., Jordan, M. I., and Singh, S. P. Convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems, pp. 703–710, 1994.
 Kearns & Singh (1999) Kearns, M. J. and Singh, S. P. Finitesample convergence rates for Qlearning and indirect algorithms. In Advances in Neural Information Processing Systems, pp. 996–1002, 1999.

Kearns & Singh (2000)
Kearns, M. J. and Singh, S. P.
Biasvariance error bounds for temporal difference updates.
In Conference on Learning Theory, pp. 142–147, 2000.  Lee et al. (2013) Lee, C. E., Ozdaglar, A., and Shah, D. Approximating the stationary probability of a single state in a Markov chain. arXiv preprint arXiv:1312.1986, 2013.
 Levin et al. (2008) Levin, D. A., Peres, Y., and Wilmer, E. L. Markov chains and mixing times. American Mathematical Soc., 2008.
 Puterman (1994) Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
 Ross (1996) Ross, S. M. Stochastic processes. John Wiley, 2nd edition, 1996.
 Strehl & Littman (2008) Strehl, A. L. and Littman, M. L. An analysis of modelbased interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
 Sutton (1988) Sutton, R. S. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44, 1988.
 Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2nd edition, 2018.
Appendix A Detailed proofs
a.1 Proof of Theorem 4.1
Proof.
Note that since , we have and . Since only state appears here, we will suppress from the random variables below to simplify the notations. We use Assumption 3.1 or the weaker assumption that is positive recurrent, i.e., , to guarantee that with probability 1.
a.2 Proof of Theorem 4.2
Proof.
Since only state appears below, we will suppress it in the interest of conciseness. Consider
By the definition of MRP, we have and . Furthermore, implies that and . Hence the estimation error
Divide by  
With failure probability of at most , from Hoeffding’s inequality, we have
and similarly
Applying the union bound and we have
a.3 Proof of Lemma 4.3
This proof largely follows the proof by Lee et al. (2013) and is presented here in the interest of selfcontainedness.
Proof.
Suppose , consider the probability of the event that is not visited in the next steps given that it is not visited in the previous steps, that is
(19) 
Let , and apply the above many times to
Apply (19)  
Let  
a.4 Proof of Corollary 4.4
Proof.
For conciseness, we suppress in our notations here since only state appears. Suppose , by Remark 3.1, we have if and for . Note that and distribute identically as . Immediately from Lemma 4.3, we have with failure probability of at most , is bounded
Suppose each fail with probability of at most , then similarly, we have
Applying the union bound, and with probability of at least , we have
∎
a.5 Proof of Theorem 4.5
Proof.
First, we introduce the Lambert W function to invert Corollary 4.4. Recall that the Lambert W function is a transcedental function defined such that and thus it is a monotonically increasing function. At step , suppose
By the definition of  
Exponentiate both sides  
Use the fact that if , then . So given , we can lower bound the number of visits
a.6 Proof of Corollary 4.6
Proof.
We run many copies of loop estimators, one for each state . Following Theorem 4.5, with failure probability of at most , we can ensure that each estimator has an error of at most
The largest upper bound comes from the state with the largest maximal expected hitting time . Apply the union bound and we have
∎