Log In Sign Up

Loop estimator for discounted values in Markov reward processes

by   Falcon Z. Dai, et al.

At the working heart of policy iteration algorithms commonly used and studied in the discounted setting of reinforcement learning, the policy evaluation step estimates the value of state with samples from a Markov reward process induced by following a Markov policy in a Markov decision process. We propose a simple and efficient estimator called loop estimator that exploits the regenerative structure of Markov reward processes without explicitly estimating a full model. Our method enjoys a space complexity of O(1) when estimating the value of a single positive recurrent state s unlike TD (with O(S)) or model-based methods (with O(S^2)). Moreover, the regenerative structure enables us to show, without relying on the generative model approach, that the estimator has an instance-dependent convergence rate of O(√(τ_s/T)) over steps T on a single sample path, where τ_s is the maximal expected hitting time to state s. In preliminary numerical experiments, the loop estimator outperforms model-free methods, such as TD(k), and is competitive with the model-based estimator.


page 1

page 2

page 3

page 4


Polynomial Value Iteration Algorithms for Detrerminstic MDPs

Value iteration is a commonly used and empirically competitive method in...

On the Convergence of Policy Gradient in Robust MDPs

Robust Markov decision processes (RMDPs) are promising models that provi...

Practical Open-Loop Optimistic Planning

We consider the problem of online planning in a Markov Decision Process ...

An Incremental Off-policy Search in a Model-free Markov Decision Process Using a Single Sample Path

In this paper, we consider a modified version of the control problem in ...

Structured Policy Iteration for Linear Quadratic Regulator

Linear quadratic regulator (LQR) is one of the most popular frameworks t...

Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees

We consider the task of estimating a structural model of dynamic decisio...

Two-Sample Testing in Reinforcement Learning

Value-based reinforcement-learning algorithms have shown strong performa...

1 Introduction

The problem of policy evaluation arises naturally in the context of reinforcement learning (RL) (Sutton & Barto, 2018) when one wants to evaluate the (action) values of a policy in a Markov decision process (MDP). In particular, policy iteration (Howard, 1960) is a classic algorithmic framework for solving MDPs that poses and solves a policy evaluation problem during each iteration. Being motivated by the setting of reinforcement learning, i.e., the underlying MDP parameters are unknown, we focus on solving the policy evaluation problem given only a single sample path.

Following a stationary Markov policy in an MDP, i.e., actions are determined based solely on the current state, gives rise to a Markov reward process (MRP) (Puterman, 1994). In this paper, we focus on MRPs and consider the problem of estimating the infinite-horizon discounted state values of an unknown MRP.

A straightforward approach to policy evaluation is to estimate the parameters of the MRP directly. Model-based methods (Bertsekas & Tsitsiklis, 1996)

estimate the underlying Markov chain transitions and mean rewards, and then compute the discounted values using the estimated parameters. This approach provides excellent estimates of discounted values, as our numerical experiments show (Section 

5). However, model-based estimators suffer from a space complexity of , where is the number of states in the MRP. In contrast, model-free methods enjoy a lower space complexity of by not explicitly estimating the model parameters (Sutton, 1988) but tend to exhibit a greater estimation error.

A popular class of estimators, -step bootstrapping temporal difference or TD(k)111An important variant is TD(), but we do not include it in our experiments since there is not a canonical implementation of the idea of estimating -return (Sutton & Barto, 2018). However, any implementation is expected to exhibit similar behaviors as TD(k) with large corresponding to large  (Kearns & Singh, 2000). estimates a state’s value based on the estimated value of other states, biasing its current estimates. This complicates the analysis of the algorithm’s convergence in the single sample path setting. The fact that the estimate of a state’s value is tied to the estimates of other states makes it hard to study the convergence of a specific state’s value estimate in isolation.

In this work, we show that it is possible to achieve space complexity with provably efficient sample complexity (in the form of a PAC-style bound). The key insight is that it is possible to circumvent the general difficulties of non-independent samples entirely by recognizing the embedded regenerative structure of an MRP. Regenerative processes are central to renewal theory (Ross, 1996), whose main approach is based on the independence between renewals in a stochastic process. We alleviate the reliance on estimates of other states by studying segments of the sample path that start and end in the same state, i.e., loops. This results in a novel and simple algorithm we call the loop estimator (Algorithm 1).

We first review the requisite definitions (Section 3) and then propose the loop estimator (Section 4.2). Unlike prior works that relied on a generative model and reduction arguments based on cover times, we directly establish the convergence of a single state’s value estimate in isolation over a single sample path at a rate independent of the size of the state space. First We then analyze the algorithm’s rate of convergence over each visit to a specified state (Theorem 4.2). Using the exponential concentration of first return times (Lemma 4.3), we relate visits to their waiting times and establish the rate of convergence over steps (Theorem 4.5). Lastly, we obtain the convergence in -norm of all states via the union bound. Besides theoretical analysis, we also compare the loop estimator to several other estimators numerically on a commonly used example (Section 5). Finally, we discuss an interesting observation concerning the loop estimator’s model-based vs. model-free status (Section 6).

Our main contributions in this paper are three-fold:

  • By recognizing the embedded regenerative structure in MRPs, we derive a new Bellman equation over loops, segments that start and end in the same state.

  • We introduce loop estimator, a novel algorithm that efficiently estimates the discounted values of a MRP from a single sample path.

  • We formally analyze the algorithm’s convergence with renewal theory and establish a rate of over steps for a single state.

In the interest of a concise presentation, we defer detailed proofs to Appendix A with fully expanded logarithmic factors and constants.

2 Related works

Much work that formally studies the convergence of value estimators (particularly the TD estimators) relies on having access to independent trajectories that start in all the states (Dayan & Sejnowski, 1994; Even-Dar & Mansour, 2003; Jaakkola et al., 1994; Kearns & Singh, 2000). This is sometimes called a generative model. Operationally speaking, independent trajectories can be obtained if we additionally assume some reset mechanism, e.g., that the process restarts upon reaching known cost-free terminal states. (Kearns & Singh, 1999) consider how a set of independent trajectories can be obtained by following a single sample path via the stationary distribution. But naively speaking, the resulting reduction will scale with the cover time of the MRP which can be quite large, c.f., the maximal expected hitting time in this work. Common to this family of results is the notable absence of an explicit dependency on the structure of the underlying process. In contrast, such dependency shows up naturally in our results (in the form of in Theorem 4.5). We do not assume access to samples other than a single trajectory from the MRP in our analysis, similar to Antos et al. (2008)

. To ensure convergence is at all possible, we assume that the specific state to estimate is visited infinitely often with probability 1 (Assumption 

3.1), otherwise we cannot hope to (significantly) improve its value estimate after the final visit. We think this assumption feels reasonable as recurrence is an important feature of many Markov chains and it connects naturally to the online (interactive) setting. A possible drawback is the exclusion of transient states where most of the state space is transient.

Besides the interest in the RL community to study the policy evaluation problem (or sometimes, the value prediction problem), operation researchers were also motivated to study estimation primarily as a computational method to leverage simulations. Classic work by Fox & Glynn (1989) deals with estimating discounted value in a continuous time setting, and one of its estimators echoes our key insight of leveraging the regenerative structure of MRPs. However, their analysis also relies on independent trajectories. While this assumption is reasonable in the simulation setting, is not immediately comparable to our setting.

Outside of the studies on reward processes, the regenerative structure of Markov chains has found application in the local computation of PageRank (Lee et al., 2013). We make use of a lemma (Lemma 4.3, whose proof is included in the Appendix A.3 for completeness) from this work to establish an upper bound on waiting times (Corollary 4.4). Similar in spirit to the concept of locality studied by Lee et al. (2013), our loop estimator enables space-efficient estimation of a single state value with a space complexity of and an error bound without explicit dependency on the size of the state space.

3 Preliminaries

3.1 Markov reward processes and Markov chains

Consider a finite state space whose size is , a transition probability matrix that specifies the transition probabilities between consecutive states and , i.e., (strong) Markov property , and a reward function where , then is called a discrete-time finite Markov reward process (MRP) (Puterman, 1994). Note that is an embedded Markov chain with transition law . Furthermore, we denote the mean rewards as . As conventions, we denote and .

The first step when a Markov chain visits a state is called the hitting time to . Note that if a chain starts at , then . We refer to the first time a chain returns to as the first return time to

Definition 3.1 (Expected recurrence time).

Given a Markov chain, we define the expected recurrence time of state as the expected first return time of starting in


A state is positive recurrent if its expected recurrence time is finite, i.e., .

Definition 3.2 (Maximal expected hitting time (Lee et al., 2013)).

Given a Markov chain, we define the maximal expected hitting time of state as the maximal expected first return time over starting states


3.2 Discounted total rewards

In RL, we are generally interested in some expected long-term rewards that will be collected by following a policy. In the infinite-horizon discounted total reward setting, following a Markov policy on an MDP induces an MRP and the state value of state is


where is the discount factor. Note that since the reward is bounded by , state values are also bounded by . A fundamental result relating values to the MRP parameters is the Bellman equation for each state (Sutton & Barto, 2018)


3.3 Problem statement

Suppose that we have a sample path of length from an MRP whose parameters are unknown. Given a state and discount factor , we want to estimate .

Assumption 3.1 (State is reachable).

We will assume that state is reachable from all states, i.e., .

Otherwise there is some non-negligible probability that state will not be visited from some starting state. This will prevent the convergence in probability (in the form of a PAC-style error bound) that we seek. Assumption 3.1 can be weakened to the assumption that is positive recurrent and the MRP starts in the recurrent class containing . However, we will adopt Assumption 3.1 in the rest of the article without much loss of generality,222We can recover similar results by restricting in the definition of to the recurrent class containing . so we do not need to worry about where the MRP starts. Note that Assumption 3.1 implies the positive recurrence of , i.e., , by definition, and that the MRP visits state for infinitely many times with probability 1.

3.4 Renewal theory and loops

Stochastic processes in general can exhibit complex dependencies between random variables at different steps, and thus often fall outside of the applicability of approaches that rely on independence assumptions. Renewal theory

(Ross, 1996) focuses on a class of stochastic processes where the process restarts after an “renewal” event. Such regenerative structure allows us to apply results from the independent and identical distribution (IID) settings.

In particular, we consider the visits to state as renewal events and define waiting times for , to be the number of steps before the -th visit


and the interarrival times to be the steps between the -th and -th visit

Remark 3.1.

The random times relate to each other in a few intuitive relations. The waiting time of the first visit is the same as the hitting time . Waiting times relate to interarrival times .

To justify treating visits to as renewal events, consider the sub-processes starting at and at —both MRPs start in state —due to Markov property of MRP, they are statistical replica of each other. Since semgents start and end in the same state, we call them loops. It follows that loops are independent of each other and obey the same statistical law. Intuitively speaking, an MRP is (probabilistically) specified by its starting state.

Definition 3.3 (Loop -discounted rewards).

Given a Markov reward process and a positive recurrent state , we define the -th loop -discounted rewards as the discounted total rewards over the -th loop

Definition 3.4 (Loop -discount).

Given a Markov reward process and a positive recurrent state , we define the -th loop -discount as the total discounting over the -th loop


forms a regenerative process that has nice independence relations. Specifically, , , and when . Furthermore, are identically distributed the same as . Similarly, are identically distributed. Note however that .

4 Main results

4.1 Bellman equations over loops

Given the regenerative process , we derive a new Bellman equation over the loops for state value .

Theorem 4.1 (Loop Bellman equations).

Suppose the expected loop -discount is and the expected loop -discounted rewards is , we can relate the state value to itself

Remark 4.1.

The key difference between the loop Bellman equations (10) and the classic Bellman equations (5) is the state values involved. Only state value appears on the right hand side of (10).

4.2 Loop estimator

We plug in the empirical means for the expected loop -discount and the expected loop -discounted rewards into the loop Bellman equation (10) and define the -th loop estimator for state value






Furthermore, we have visited state for times before step where is a random variable that counts the number of loops before step


and the estimate would be the last estimate before step . Hence, with a slight abuse of notations, we define


By using incremental updates to keep track of empirical means, Algorithm 1 implements the loop estimator with a space complexity of . Running -many copies of loop estimators, one for each state , takes a space complexity of .

1:  Input: discount factor , state , sample path of some length .
2:  Return: an estimate of the discounted value .
3:  Initialize the empirical mean of loop discounts .
4:  Initialize the empirical mean of loop discounted rewards .
5:  Initialize the loop count .
6:  for each loop in  do
7:     Increment visit count .
8:     Compute the length of the interarrival time .
9:     Compute the partial discounted sum of rewards, .
10:     Update the empirical means incrementally, , and .
11:  end for
12:  return  
Algorithm 1 Loop estimator (for a specific state)

4.3 Rates of convergence

Now we investigate the convergence of the loop estimator, first over visits, i.e., as , then over steps, i.e., as . By applying Hoeffding bound to the definition of loop estimator (11), we obtain a PAC-style upper bound on the estimation error.

Theorem 4.2 (Convergence rate over visits).

Given a sample path from an MRP , a discount factor , and a positive recurrent state , with probability of at least , the loop estimator converges to

To determine the convergence rate over steps, we need to study the concentration of waiting times which allows us to lower-bound the number of visits with high probability. As an intermediate step, we use the fact that the tail of the distribution of first return times is upper-bounded by an exponential distribution per the Markov property of MRP using Markov inequality 

(Lee et al., 2013; Aldous & Fill, 1999).

Lemma 4.3 (Exponential concentration of first return times (Lee et al., 2013; Aldous & Fill, 1999)).

Given a Markov chain defined on a finite state space , for any state and any , we have

Secondly, since by Remark 3.1 we have , we apply the union bound to upper-bound the tail of waiting times.

Corollary 4.4 (Upper bound on waiting times).

With probability of at least , .

Remark 4.2.

Note that the waiting time is nearly linear in with a dependency on the Markov chain structure via the maximal expected hitting time of , namely . In contrast, the expected waiting time scales with the expected recurrence time .

Finally, we put Corollary 4.4 and Theorem 4.2 together to establish the convergence rate of .

Theorem 4.5 (Convergence rate over steps).

With probability of at least , for any , the MRP visits state for at least many times, and the last loop estimate converges to

Suppose we run a copy of loop estimator to estimate each state’s value in

, and denote them with a vector

. Convergence of the estimation error in terms of the -norm follows immediately by applying the union bound.

Corollary 4.6 (Convergence rate over all states).

With probability of at least , for any , the MRP visits each state for at least many times, and the last loop estimates converge to state values

5 Numerical experiments

We consider RiverSwim, an MDP proposed by Strehl & Littman (2008) that is often used to illustrate the challenge of exploration in RL. The MDP consists of six states and two actions . Executing the “swim upstream” action often fails due to the strong current, while there is a high reward for staying in the most upstream state . For our experiments, we use the MRP induced by always taking the “swim upstream” action (see Figure 1 for numerical details).

The most relevant aspect of the induced MRP is that the maximal expected hitting times are very different for different states: , , , , , . Figure 2 shows a plot of the estimation errors of the loop estimator for each state over the square root of maximal expected hitting times of that state. The observed linear relationship between the two quantities (supported by a good linear fit) is consistent with the instance-dependence in our result of , c.f., Theorem 4.5.

Figure 1: The induced RiverSwim MRP. The arrows are labeled with transition probabilities. The rewards are all zero except for state , where .
Figure 2: With discount factor . The estimation error of each state (normalized by ) are plotted over the square root of maximal expected hitting times

of that state. The error bars show the standard deviations over 200 runs.

5.1 Alternative estimators

We define several alternative estimators for and briefly mention their relevance for comparison.

Model-based. We compute add-1 smoothed maximum likelihood estimates (MLE) of the MRP parameters from the sample path




We then solve for the discounted state values from the Bellman equation (5) for the MRP parameterized by , i.e., the (column) vector of estimated state values



is the identity matrix.

TD(k). -step temporal difference (or -step backup) estimators are commonly recursively defined (Kearns & Singh, 2000) with TD(0) being a textbook classic (Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 2018). Let

for all states . And for

where is the learning rates. A common choice is to set which statisfies the Robbins-Monro conditions (Bertsekas & Tsitsiklis, 1996). But it has been shown to lead to slower convergence than where (Even-Dar & Mansour, 2003).

It is more accurate to consider TD methods as a large family of estimators each with different choices of , . Choosing these parameters can create extra work and sometimes confusion for practitioners. Whereas the loop estimator, like the model-based estimator, has no parameters to tune. In any case, it is not our intention to compare with the TD family exhaustively (see more results on TD on (Kearns & Singh, 2000; Even-Dar & Mansour, 2003)). Instead, we will compare with , , both with , and with .

5.2 Comparative experiments

We experiment with different values for the discount factor , because, roughly speaking, sets the horizon beyond which rewards are discounted too heavily to matter. We compare the estimation errors measured in -norm, which is important in RL. The results are shown in Figure 3.

  • The model-based estimator dominates all other estimators for every discount setting we tested.

  • TD(k) estimators perform well if .

  • The loop estimator performs worse than, but is competitive with, the model-based estimator. Furthermore, similar to the model-based estimator and unlike the TD(k) estimators, its performance seems to be less influenced by discounting.

Figure 3: Estimation errors (normalized by to be comparable across discount factors) of different estimators at different discount factors . Shaded areas represent the standard deviations over 200 runs. Note the vertical log scale.

6 Discussions

The elementary identity below relates the expected first return times to the transition probabilities for a finite Markov chain. Using the matrix notations, suppose that the expected first return times are organized in a matrix , and the transition matrix of the Markov chain, then we have

where is a matrix with the same diagonal as and zero elsewhere, and is a matrix with all ones. Thus, knowing is equivalent to knowing the full model, as we can compute using this identity. Recall that by definition , which is exactly the diagonal of . But only knowing the diagonal is not sufficient to determine the entire set of model parameters, namely , the loop estimator based on indeed falls short of being a model-based method. It may be considered a semi-model-based method as it estimates some but not all of the model parameters.

We believe that regenerative structure can be further exploited in RL (particularly in the form of the loop Bellman equation (10)) and we think this article provides the fundemental results for future study in this direction.

Practically, in many iterative algorithms, some iteration lengths have to be specified and revisits provide a very natural choice as suggested by this work in the setting of policy evaluation. Fundamentally, recurrence is a natural and important behavor of Markov chains (Levin et al., 2008; Aldous & Fill, 1999). Perhaps renewal theory has a larger role to play in RL theory for simpler (if not sharper) analyses and new algorithms.

Another possible extension is to investigate how a loop estimator-like algorithm can be scaled to large problems where there are few positive recurrent states. A natural extension is to consider recurrence of features instead of states, e.g., a video game screen might not repeat itself completely but the same items might reappear. After all, without repetition exactly or approximately, it would not be possible for an agent to learn and improve its decisions.


We thank Mesrob I. Ohannessian for a helpful discussion on Markov chains.


  • Aldous & Fill (1999) Aldous, D. and Fill, J. Reversible Markov chains and random walks on graphs, 1999. Book in preparation (available at
  • Antos et al. (2008) Antos, A., Szepesvári, C., and Munos, R. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
  • Bertsekas & Tsitsiklis (1996) Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic programming. Athena Scientific Belmont, MA, 1996.
  • Dayan & Sejnowski (1994) Dayan, P. and Sejnowski, T. J. TD () converges with probability 1. Machine Learning, 14(3):295–301, 1994.
  • Even-Dar & Mansour (2003) Even-Dar, E. and Mansour, Y. Learning rates for Q-learning. Journal of machine learning Research, 5(Dec):1–25, 2003.
  • Fox & Glynn (1989) Fox, B. L. and Glynn, P. W. Simulating discounted costs. Management Science, 35(11):1297–1315, 1989.
  • Howard (1960) Howard, R. A. Dynamic programming and Markov processes. John Wiley, 1960.
  • Jaakkola et al. (1994) Jaakkola, T., Jordan, M. I., and Singh, S. P. Convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems, pp. 703–710, 1994.
  • Kearns & Singh (1999) Kearns, M. J. and Singh, S. P. Finite-sample convergence rates for Q-learning and indirect algorithms. In Advances in Neural Information Processing Systems, pp. 996–1002, 1999.
  • Kearns & Singh (2000) Kearns, M. J. and Singh, S. P.

    Bias-variance error bounds for temporal difference updates.

    In Conference on Learning Theory, pp. 142–147, 2000.
  • Lee et al. (2013) Lee, C. E., Ozdaglar, A., and Shah, D. Approximating the stationary probability of a single state in a Markov chain. arXiv preprint arXiv:1312.1986, 2013.
  • Levin et al. (2008) Levin, D. A., Peres, Y., and Wilmer, E. L. Markov chains and mixing times. American Mathematical Soc., 2008.
  • Puterman (1994) Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
  • Ross (1996) Ross, S. M. Stochastic processes. John Wiley, 2nd edition, 1996.
  • Strehl & Littman (2008) Strehl, A. L. and Littman, M. L. An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
  • Sutton (1988) Sutton, R. S. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44, 1988.
  • Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2nd edition, 2018.

Appendix A Detailed proofs

a.1 Proof of Theorem 4.1


Note that since , we have and . Since only state appears here, we will suppress from the random variables below to simplify the notations. We use Assumption 3.1 or the weaker assumption that is positive recurrent, i.e., , to guarantee that with probability 1.

a.2 Proof of Theorem 4.2


Since only state appears below, we will suppress it in the interest of conciseness. Consider

By the definition of MRP, we have and . Furthermore, implies that and . Hence the estimation error

Divide by

With failure probability of at most , from Hoeffding’s inequality, we have

and similarly

Applying the union bound and we have

a.3 Proof of Lemma 4.3

This proof largely follows the proof by Lee et al. (2013) and is presented here in the interest of self-containedness.


Suppose , consider the probability of the event that is not visited in the next steps given that it is not visited in the previous steps, that is


Let , and apply the above -many times to

Apply (19)

a.4 Proof of Corollary 4.4


For conciseness, we suppress in our notations here since only state appears. Suppose , by Remark 3.1, we have if and for . Note that and distribute identically as . Immediately from Lemma 4.3, we have with failure probability of at most , is bounded

Suppose each fail with probability of at most , then similarly, we have

Applying the union bound, and with probability of at least , we have

a.5 Proof of Theorem 4.5


First, we introduce the Lambert W function to invert Corollary 4.4. Recall that the Lambert W function is a transcedental function defined such that and thus it is a monotonically increasing function. At step , suppose

By the definition of
Exponentiate both sides

Use the fact that if , then . So given , we can lower bound the number of visits

Plug this into Theorem 4.2 and we obtain the desired

a.6 Proof of Corollary 4.6


We run many copies of loop estimators, one for each state . Following Theorem 4.5, with failure probability of at most , we can ensure that each estimator has an error of at most

The largest upper bound comes from the state with the largest maximal expected hitting time . Apply the union bound and we have