1 Introduction
Policy iteration is a key computational tool used in the study of Markov Decision Processes (MDPs) and Reinforcement Learning (RL) problems. In traditional policy iteration for MDPs, at each iteration, the value function associated with a policy is computed exactly and a new policy is chosen greedily with respect to this value function
[bertsekasvolI, bersekasvolII, bertsekastsitsiklis, suttonbarto]. It can be shown that using policy iteration, the value function decreases with each iteration. In the case of a finite state and action space, the optimal policy is reached in a finite number of iterations. However, computing the exact value function corresponding to each policy can be computationally prohibitive or impossible, especially in an RL setting where the MDP is unknown.To analyze these settings, optimistic policy iteration (OPI) methods have been studied which assume that at each iteration, only a noisy estimate of the exact value function for the current policy is available. We consider the variant studied in
[tsitsiklis2002convergence], where at each iteration, we only have access to a noisy, but unbiased, estimate of the value function associated with a policy. This estimate is obtained by simulation using a Monte Carlo approach. The Markov process corresponding to a particular policy is simulated and the corresponding value function is estimated by taking the infinite sum of discounted costs. The key idea in
[tsitsiklis2002convergence] is to use stochastic approximation to update the value function using the noisy estimates. Their main results consider a synchronous version of OPI where the value functions of all states are updated simultaneously, but extensions to cases where an initial state is chosen randomly are discussed.In this variant of OPI, we have a choice of updating the value associated with the initial state selected at each iteration or the values of all states visited in the Monte Carlo simulation at each iteration. In the former case, the results in [tsitsiklis2002convergence] apply almost directly. In this paper, we provide a convergence proof for the latter case under some structural assumptions about the MDP. We also extend the results to the following cases, (i) stochastic shortestpath problems (see [Yuanlong] for an extension of the work in [tsitsiklis2002convergence] to stochastic shortestpath problems), (ii) zerosum games (see [patekthesis] for extensions of MDP tools to zerosum games), and (iii) aggregation, when we know apriori which states have the same value functions.
2 Definitions and Assumptions
Let be a discounted Markov Decision Process (MDP) with discount factor and finite state space . Denote the finite action space associated with state by . When action is taken at state , we let
be the probability of transitioning from state
to state . For every state and action pair, we are also given a finite, deterministic cost , , of being in state and taking action .A policy is a mapping . Policy induces a Markov chain on with transition probabilities
where is the state of the Markov chain after time steps.
We assume that the distribution for the initial state is for all policies . The distribution and determine , the probability of Markov chain ever reaching state from state . In other words,
In order to ensure sufficient exploration of all of the states, we assume the following:
Assumption 1.
Since there are finitely many policies, there exists such that Furthermore, we make the following assumption about state transitions in our MDP:
Assumption 2.
For any states and actions , if and only if .
Thus, the set of states that can be reached from any state in one step is the same under any policy. The above assumptions are usually satisfied in practice since one explores all actions with at least some small probability in each state; examples of such exploration strategies include epsilongreedy and Boltzmann explorations. Given this assumption, we can define a onestep reachability graph of our MDP independently of any policy. We define the reachability graph as the directed graph where and .
We now further classify
into transient and recurrent classes as follows:Here, where is the set of transient states and are disjoint, irreducible, closed recurrent classes. Assumption 2 allows us to drop the dependence on policy in the decomposition.
We are now ready to state our third assumption, which is also illustrated in Figure1.
Assumption 3.
The subgraph of the reachability graph induced by the set of transient states which we denote by is acyclic.
Although restrictive, this assumption naturally arises through in some problems. For example, many existing works, such as [Jordan], assume a finite time horizon. They augment the state with a timedependent parameter, naturally making the state transitions acyclic, as it is impossible to transition to an earlier point in time.
3 Reinforcement Learning Preliminaries
To define and analyze our algorithm, we will need several standard definitions and results from dynamic programming and reinforcement learning. First, we define the costtogo or value function as the expected cumulative discounted cost when following policy , starting from state :
solves the Bellman equation:
(1) 
Now, we define an optimal policy, , to be a policy that solves . Under our assumptions, always exists. is known as the optimal value function and satisfies the following Bellman equation:
(2) 
For an arbitrary vector, we introduce the optimal Bellman operator:
(3) 
Our primary goal is to find and . Towards the objective, we introduce the Bellman operator where for the th component of is
(4) 
so that (1) can be written as .
Policy iteration is a basic iterative algorithm for finding and . Each iteration starts with a value function and then performs “policy improvement” to produce a policy and “policy evaluation” to produce the next value function . Policy improvement finds the greedy policy with respect to by solving . Policy evaluation finds the value function of the current policy by solving the Bellman equation (1), and sets . The key to convergence is that strictly improves at every step, in the sense that , with equality if and only if and . Since belongs to a finite set, policy iteration is guaranteed to converge in a finite number of iterations.
Calculating in each step of policy iteration can be computationally expensive and the results of policy iteration cannot be easily extended when the probabilities of transitioning between states and rewards are not known, so optimistic policy iteration refers to a variant of policy iteration where some approximation of is used instead of calculating directly. In [tsitsiklis2002convergence], assuming that are known for all and and that are known for all and , it was shown that an optimistic policy iteration algorithm using a Monte Carlo simulation for policy evaluation converges to . Here, we consider a variant of suggested in [tsitsiklis2002convergence] which can lead to faster convergence.
4 The Algorithm
The algorithm we consider is as follows. Like policy iteration, we start with an initial vector and iteratively update . For each update at time , we take vector and obtain
(5) 
the greedy policy with respect to . Then, the algorithm independently selects a state according to nonuniform probabilities . We then simulate a trajectory that starts at state and follows policy at time . The trajectory is a realization of a Markov chain where and .
Instead of using (1) to compute , we use this trajectory to generate an unbiased estimate of using the tail costs of the first time each state is visited by the trajectory.
To formalize , we introduce the hitting time of state in the trajectory as follows:
When is finite, can be defined in terms of as
Otherwise, . Then, for every state visited by the trajectory, , we update as follows:
(6) 
where
is a componentdependent step size. In order to analyze this algorithm, it is helpful to rewrite it in a form similar to a stochastic approximation iteration. We introduce a random variable
to capture the noise present in . When , we define . Otherwise, we let . With this choice, we can rewrite our iterates as(7) 
We now introduce a random variable which incorporates the randomness present in the event , similar to the random variable used in [tsitsiklis2002convergence], and rewrite (7) as
(8) 
where
Recall that is the probability of ever reaching node using policy .
5 Main Result
The main result of our paper is establishing the convergence of the above algorithm. However, in order to establish convergence, we have to specify the step size We consider two choices of step sizes: deterministic, stateindependent step sizes and statedependent step sizes which decrease when state is visited. These step sizes are assumed to satisfy fairly standard assumptions for stochastic approximation algorithms. We assume there is some function such that
and we assume that there exists some constant such that is nonincreasing for . Then, our choices of step sizes are:

Deterministic step size : This choice is simple to implement and does not depend on state , but may converge slower than necessary since even states which are rarely updated can have a small step size as time progresses. The condition that is nonincreasing for large can be relaxed for this case.

Statedependent step size . Here, is the number of times state was ever reached before time (), where represents the indicator function. Thus, we only change the step size for state when state is visited.
Given either choice of step size, we will show that our algorithm converges:
Theorem 1.
If is defined as in (6) and or , then converges almost surely to .
It turns out that proving the convergence of the second type of step size is more challenging than the corresponding proof for the first type of step size. However, in practice, the second type of step size leads to much faster convergence and hence, it is important to study it. We observed in simulations that the first step size rule is infeasible for problems with a large number of states since the convergence rate is very slow. Therefore, in our simulations, we use the second type of step size rule to compare the advantages of updating the value function for each state visited along a trajectory over updating the value function for just the first state in the trajectory.
[tsitsiklis2002convergence] considers a case where is nonuniform and the value for only the initial state is updated in each iteration. Our algorithm discards less information than that of [tsitsiklis2002convergence], but we require stronger assumptions on the structure of the Markov chains.
6 Proof of the Main Result
The key ideas behind our proof are the following. Once a state in a recurrent class is reached in an iteration, then every state in that class will be visited with probability one in that iteration. Thus, if there is a nonzero probability of reaching every recurrent class, then each recurrent class is visited infinitely many times, and the results in [tsitsiklis2002convergence] for the synchronous version of the OPI can be applied to each recurrent class to show the convergence of the values of the states in each such class. Next, since the rest of the graph is an acyclic graph, by a wellknown property of such graphs, the nodes (states of the Markov chain) can be arranged in a hierarchy such that one can inductively show the convergence of the values of these nodes. At each iteration, we have to show that the conditions required for the convergence of stochastic approximation are satisfied. If the stepsizes are chosen to be stateindependent, then they immediately satisfy the assumptions required for stochastic approximation. If the stepsizes are statedependent, then a martingale argument shows that they satisfy the required conditions. We also verify that the noise sequence in the stochastic approximation algorithm satisfies the required conditions.
6.1 Convergence for recurrent states
Recall that our states can be decomposed as , where the are closed, irreducible recurrent classes under any policy. To show convergence of our algorithm, we will first show that the algorithm converges for each recurrent class , then use this fact to show convergence for the transient states . The proof will differ slightly for our two choices of the step size , so we will consider each case separately.
6.1.1 Step size
Consider our iterative updates, restricted to the set of states . Since is a closed, irreducible recurrent class, once any state in is visited, so will every other state. Recall the version of our state update without given by (7) under policy . Using our choice of , the update has exactly the same step size for every state in . We define as the shared for each state , and then for states , (7) becomes:
Now, consider only the steps of the algorithm such that is visited by the trajectory , so . Given our choice of step size, the above update becomes
where the noise only depends on the evolution of in the recurrent class . This is identical to the algorithm considered by Tsitsiklis in [tsitsiklis2002convergence]. Noting that and by our assumptions on , by Proposition 1 from Tsitsiklis, we have that for all .
6.1.2 Step size
Again, consider our iterative updates restricted to . We define as the common probability of reaching any state in . Then, we adapt the version of the update containing the noise term from (8) into an update for each state in using our choice of :
The convergence of the above algorithm essentially follows from [tsitsiklis2002convergence] with a minor modification. Since we have assumed that is lower bounded, even though the step sizes are random here, the stochastic approximation results needed for the result in [tsitsiklis2002convergence] continue to hold.
6.2 Convergence for transient states
Since the reachability graph restricted to transient states is a directed acyclic graph, it admits a reverse topological sort of its vertices , such that for each , if then (for reference, see [topological]). We will inductively prove that for all .
We begin our induction with . Since is transient, it must have at least one neighbor, and because it is first in the topological sort, its only neighbors in are members of recurrent classes. From the previous section, we know that for all such neighbors , . Since these neighboring value functions converge to the optimal value, one can show that the greedy policy at state converges to an optimal policy. For convenience, we present this result as a lemma. A similar result is proved in Proposition 4.5 and Corollary 4.5.1 in [bertsekas1978stochastic].
Lemma 1.
For any state , let be the set of its neighbors in the reachability graph . Suppose that for all . Then, there exists a finite time T for which for all .
Now, using Lemma 1, let be the minimum time after which for any optimal policy . Now, let be the event that for . Since converges almost surely for all neighbors of , . We examine the probability that does not converge to . The method is similar to the method in the errata of [tsitsiklis2002convergence].
We now analyze . For each integer , define a sequence for such that and
(9) 
is now in a standard form for a stochastic approximation. We will use the following standard theorem adapted from Lemma 1 of [singh2000convergence] to prove convergence of (9) to :
Lemma 2.
Let and be three sequences of scalar random variables such that , , and are measurable. Consider the update
Assume the following conditions are met:

There exist finite constants such that for all .

for all .

.

w.p. 1.

w.p. 1.
Then, the sequence converges almost surely to :
To use Lemma 2, we define our . It is straightforward to establish the following result, which we state without proof:
Lemma 3.
and for some constant .
Finally, we need to demonstrate that for our step sizes and , the effective step size almost surely satisfies
(10) 
Towards this, we introduce the following:
Lemma 4.
For and , (10) holds almost surely for each state .
Proof.
Since , it is sufficient to show that and for all almost surely. This is true by definition for , so it remains to show this for .
First we show that almost surely. Observe that for all since represents the number of trajectories in the first trajectories where state was visited. For sufficiently large , is nonincreasing, so . Furthermore, since we have that
We now show that Recall that
so
We will apply the martingale convergence theorem to show that almost surely. Define sequences and as follows:
Clearly, Also, and for , so . Thus, is a martingale and satisfies the conditions of the martingale convergence theorem, and therefore converges almost surely to some welldefined random variable , i.e., Since
is finite almost surely, by Kronecker’s lemma, we have
almost surely. Since for all and , we almost surely have
This implies that for sufficiently large , . We have assumed that, for sufficiently large , is nonincreasing, so which implies Finally, using there is almost surely some (which may depend on the sample path), such that
The second inequality in the previous line follows from the fact that the value of changes only at . This implies that almost surely. ∎
Thus, the recurrence in (9) takes the form required by Lemma 2, with step size and noise term . Conditions 1 and 2 in Lemma 2 are satisfied by Lemma 3. Condition 3 is clearly satisfied, because . Conditions 4 and 5 are satisfied due to Lemma 4. Therefore, by Lemma 2, for all positive integers . Now, we are ready to complete the proof. Conditioned on , we have for all . Therefore:
(Lemma 2) 
This completes the proof that . We then only need to complete the induction. For any , suppose that for all . We define analogously to above, so and:
By the inductive assumption and because of convergence for every recurrent class, the for all converge almost surely. If we define in the same way as with , then with probability 1, is finite. By the same reasoning as the base case, then
7 Numerical Experiments
The primary difference between the algorithm we have analyzed and the variant previously analyzed in [tsitsiklis2002convergence] is the update step. In [tsitsiklis2002convergence], only the value of a single, randomlyselected state is updated at each time step. However, we update every state visited by the trajectory sampled each time step. Because we update each visited state, we expect the variant we have analyzed to converge more quickly. In order to support this claim, we have performed two experiments which demonstrate faster convergence.
In the first experiment, we have a Markov chain with a single absorbing state shown in Figure LABEL:sub@fig:sub1a, where the absorbing state has label 0. All edges in the figure represent a possible transition from node to . At each state , there is an action associated with edge out of state , such that taking action transitions to state with probability and transitions to a different random neighbor of node chosen uniformly at random with probability . If there is only edge out of state , then the only action deterministically transitions along that edge. For all nonzero states in Figure LABEL:sub@fig:sub1a, the label of the state corresponds to the reward of taking any action in that state (equivalently, the cost is the negation of the reward). The red arrows correspond to the optimal action in each state. This example is similar to taking greedy actions in an MDP with deterministic state transitions.
We implement both our algorithm given in (7) and the variant studied in [tsitsiklis2002convergence] which only updates a single state each iteration, and compare the number of iterations required for convergence. The results over 100 trials, assuming a discount factor of and a step size of , can be found in Figure LABEL:sub@fig:sub1b. The distribution of the starting state for each iteration was assumed to be uniformly random for both algorithms. Each algorithm was run until the first time that , and we graphed the empirical distributions of the number of iterations required. On average, our algorithm (updating along the entire trajectory) required only about 854 iterations, compared to the algorithm from [tsitsiklis2002convergence], which required 7172 iterations on average when updating only the starting state of the trajectory each time step.
In the second example, we consider a different stochastic shortest path problem on the acyclic graph, shown in Figure LABEL:sub@fig:sub2a. In this example, there are two actions, and , associated with each edge . If action is taken, then the reward in the label for node is accrued and a transition occurs as in the previous example, where the edge is taken with probability 0.6 and a different uniformly random edge is taken with probability . The action allows for a more certain reward, at a cost; the probability of taking the chosen edge is increased to 0.8, but the reward is decreased by 1.
Again, we compare our algorithm to the variant studied in [tsitsiklis2002convergence] for this problem. The optimal policy is given by the red and yellow arrows in Figure LABEL:sub@fig:sub2a, where yellow arrows are associated with and red arrows with . The distribution of iterations required for convergence can be found in Figure LABEL:sub@fig:sub2b. Again, updating the entire trajectory (300 iterations on average) is more efficient than updating a single state (455 iterations on average).
8 Extensions
Thus far, we have presented a proof of convergence for a certain class of discounted MDPs with deterministic costs. However, the same ideas we have used can be easily extended to a number of related settings. In this section, we will discuss extensions to stochastic shortest path and game theoretic versions of the problem. We will also extend the results to a setting where we assume knowledge of clusters of states with the same value function.
8.1 Stochastic Shortest Path Problem
In a stochastic shortest path (SSP) problem, the goal is to minimize the cumulative cost over all policies. It is the undiscounted MDP problem, where the discount factor is set to 1 and the costtogo becomes
To account for the lack of a discount factor, we will need to adjust our assumptions accordingly. We again assume that the state and action spaces are finite and we assume that Assumptions 1 and 2 hold as in the discounted case. However, instead of allowing the cost to infinitely accumulate in one of several recurrent classes, we require a different structural assumption, which combines all recurrent classes into one absorbing state and guarantees that the cost remains finite under every policy:
Assumption 4.
There is a unique absorbing state 0, which incurs a cost of 0 under every action. For notational convenience, we will denote the state space for the SSP as , with as before. We assume the subgraph of the reachability graph induced by is acyclic.
We define our algorithm identically to the discounted case, but with . The update proceeds using (6). This procedure can be shown to converge, similarly to the discounted case:
Theorem 2.
Proof.
The proof for this result follows the proof given in section 6.2, of the convergence for transient states in the discounted case. Due to our assumptions, the nonzero states of the SSP form an acyclic graph, so they admit a reverse topological sort , where in the reachability graph , implies . Thus, state can only transition to the absorbing state 0, and for all time , we have . It is straightforward to show that Lemmas 3 and 4 continue to hold for the SSP problem. Therefore, by a simple stochastic approximation argument, .
The proof proceeds by induction in the same manner as in the undiscounted case. For any , assuming for all , we examine . It is straightforward to show that Lemma 1 holds for the SSP problem. By an argument analogous to the one used above for , then . ∎
8.2 Alternating ZeroSum Game
We consider a finitestate stochastic shortest path game with two players: player 1 and player 2. Player 1 seeks to minimize the cumulative cost, while player 2 works to maximize the cost. In general, player 1 and 2 can take simultaneous actions and , respectively, in state . Accordingly, transitions and costs depend on both actions. These action spaces are often not finite, for example, to allow for mixed strategies for each player. Given a policy for player 1 and for player 2, we can define a cost function :
The goal in solving stochastic shortest path games is to find a Nash equilibrium solution , such that
When the value of a game exists, it can be found as the solution to the minimax Bellman equation , where is the minimax Bellman operator defined by
If such a solution exists, then is the optimal value function for the game. One category of games where an equilibrium always exists is alternating games, which we consider in this section (for more details, see section 2.3.3 of [patekthesis]). In an alternating (also known as sequential) game, players take “turns” performing actions. The state space, outside of a single absorbing terminating state , can be partitioned into two sets of states and , where is the set of states where player 1 takes actions and is the set of states where player 2 acts. For states , the choice of action for player 2 is trivial and therefore . Similarly, for states , . Without loss of generality, we can combine states to assume if and are either both in or both in , so no player ever takes two turns in a row.
For the purposes of this section, we assume that the action spaces in each state are finite. In an alternating game, there is no need for mixed strategies, as at each step, the onestep minimax problem reduces to a simple minimum or maximum, depending on the current turn. Thus, we can combine the action pair into a single action and simplify the Bellman operator to a statedependent min or max:
(11) 
The following still holds:
for the operator in (11). Thus, we have the following:
(12) 
We define the following:
and
Substituting and