# On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning

A simple and natural algorithm for reinforcement learning is Monte Carlo Exploring States (MCES), where the Q-function is estimated by averaging the Monte Carlo returns, and the policy is improved by choosing actions that maximize the current estimate of the Q-function. Exploration is performed by "exploring starts", that is, each episode begins with a randomly chosen state and action and then follows the current policy. Establishing convergence for this algorithm has been an open problem for more than 20 years. We make headway with this problem by proving convergence for Optimal Policy Feed-Forward MDPs, which are MDPs whose states are not revisited within any episode for an optimal policy. Such MDPs include all deterministic environments (including Cliff Walking and other gridworld examples) and a large class of stochastic environments (including Blackjack). The convergence results presented here make progress for this long-standing open problem in reinforcement learning.

There are no comments yet.

## Authors

• 5 publications
• 5 publications
• ### On the Convergence of Reinforcement Learning with Monte Carlo Exploring Starts

A basic simulation-based reinforcement learning algorithm is the Monte C...
07/21/2020 ∙ by Jun Liu, et al. ∙ 0

• ### Active Reinforcement Learning with Monte-Carlo Tree Search

Active Reinforcement Learning (ARL) is a twist on RL where the agent obs...
03/13/2018 ∙ by Sebastian Schulze, et al. ∙ 0

• ### Renewal Monte Carlo: Renewal theory based reinforcement learning

In this paper, we present an online reinforcement learning algorithm, ca...
04/03/2018 ∙ by Jayakumar Subramanian, et al. ∙ 0

• ### Provably Efficient Q-learning with Function Approximation via Distribution Shift Error Checking Oracle

Q-learning with function approximation is one of the most popular method...
06/14/2019 ∙ by Simon S. Du, et al. ∙ 0

• ### Factoring Exogenous State for Model-Free Monte Carlo

Policy analysts wish to visualize a range of policies for large simulato...
03/28/2017 ∙ by Sean McGregor, et al. ∙ 0

• ### Action Selection for MDPs: Anytime AO* vs. UCT

In the presence of non-admissible heuristics, A* and other best-first al...
09/26/2019 ∙ by Blai Bonet, et al. ∙ 0

• ### Reinforcement Learning for Heterogeneous Teams with PALO Bounds

We introduce reinforcement learning for heterogeneous teams in which rew...
05/23/2018 ∙ by Roi Ceren, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In the classic book on reinforcement learning by Sutton & Barto (2018), the authors describe Monte Carlo Exploring Starts (MCES), a Monte Carlo algorithm to find optimal policies in (tabular) reinforcement learning problems. MCES is a simple and natural Monte Carlo algorithm for reinforcement learning. In the MCES algorithm, the Q-function is estimated by averaging the Monte Carlo returns, and the policy is improved by choosing actions that maximize the current estimate of the Q-function. Exploration is performed by “exploring starts,” that is, each episode begins with a randomly chosen state and action and then follows the current policy. Both for the case of non-discounted returns and for the case of non-uniform exploring starts distribution, it has not been proved that the sequence of policies produced by this algorithm converges to an optimal policy. Sutton and Barto write at the end of Section 5.3: “In our opinion, this is one of the most fundamental open theoretical questions in reinforcement learning”.

In this paper we make some headway with this problem by proving convergence for Optimal Policy Feed-Forward (OPFF) environments, where states are not revisited within any episode under optimal policies. Such MDPs include all deterministic MDPs (including Cliff Walking and other gridworld examples) and a large class of stochastic MDPs (which include Blackjack). Many of the example episodic environments in Sutton & Barto (2018)

are OPFF. Our convergence results allow for discounting or no discounting, and allow for non-uniform exploring start distributions. Moreover, the proofs are relatively simple and straightforward, relying only on basic concepts from graph theory and probability theory, and do not employ more sophisticated mathematical tools such as contraction mappings, stochastic approximations, or martingales. The convergence results presented here make important progress for this long-standing open problem in reinforcement learning.

In this paper we provide two theorems. Theorem 1 establishes convergence for deterministic environments when the Q-function is approximated with the most recent or highest return. Theorem 2 establishes convergence for the more general OPFF environments when the Q-function is approximated with the average of all returns.

## 2 Related Work

Watkins & Dayan (1992) established convergence of Q-learning in expectation. By showing that Q-learning is a form of stochastic approximations, Tsitsiklis (1994) and Jaakkola et al. (1994) showed that Q-learning converges to the optimal Q-function when damping follows the standard Monro-Robbins conditions (Robbins & Monro, 1951).

The MCES algorithm was introduced in Sutton et al. (1998) without a proof of convergence. Using stochastic approximations, Tsitsiklis (Tsitsiklis, 2002) proved convergence of MCES when both of the following two conditions are satisified: The returns are discounted with a discount factor strictly less than one; Every state-action pair is used to initialize the episodes with the same frequency. The original algorithm as stated in Sutton et al. (1998) does not require either of these conditions.

In this paper we take a graph and proof-by-induction approach to show that the MCES algorithm convergences. For initializing episodes, we only require every state-action pair is chosen infinitely often. Furthermore, our proofs allow for discounting and no discounting. However, our results do require that the the underlying MDP be “Optimal Policy Feed Forward (OPFF),” which is satisfied by many environments, including by many of the episodic example environments in Sutton et al. (1998).

## 3 Classes of MDPs and the MCES Algorithm

Following the notation of Sutton & Barto (2018)

, a finite Markov decision process is defined by a finite state space

, a finite action space , a finite reward space , and a dynamics function

 ¯p(s′,r|s,a):=P(St=s′,Rt=r|St−1=s,At−1=a) (1)

The state-transition probability function is then given by

 p(s′|s,a):=P(St=s′|St−1=s,At−1=a)=∑r∈R¯p(s′,r|s,a) (2)

A (deterministic and stationary) policy is a mapping from the state space to the action space. We denote for the action selected under policy when in state . For any given MDP, we define its MDP graph as follows. The nodes in the graph are the states in the MDP. There is a directed edge from node to if there is an action such that .

We briefly note that in the MDP literature there are two ways of defining the underlying sample space and probability measure for an MDP. One way defines a different probability measure over the sample space for each policy

; in this case, the state and action random variables at a given time

are functions of the sample (from the sample space) and not of the policy. The other way is to use a sample space with a fixed probability measure, but define the state random variables as functions of both the sample and the policy. These two formulations are equivalent (Bertsekas, 2005; Bertsekas & Tsitsiklis, 1996). In this paper, as in the papers (Tsitsiklis, 1994) and (Tsitsiklis, 2002), we take the latter approach in order to state convergence results with probability one using a fixed probability measure. Henceforth we write , and for the state, action, and reward at time under policy .

### 3.1 Optimizing the Episodic Return

As indicated in Chapters 4 and 5 of Sutton & Barto (2018), for RL algorithms based on Monte Carlo methods, we need to assume that the task is episodic, that is “experience is divided into episodes, and all episodes eventually terminate no matter what actions are selected.” Examples of episodic tasks include “plays of a game, trips through a maze, or any sort of repeated interaction”. Chapter 4 of Sutton & Barto (2018) further states: “Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states”.

The “Cliff Walking” example in Sutton & Barto (2018) is an example of an “episodic MDP”. Here the terminal state is the union of the goal state and the cliff state. Although the terminal state will not be reached by all policies due to cycling, it will clearly be reached by the optimal policy. Another example from Sutton and Barto of an episodic MDP is “Blackjack”. Here we can create a terminal state which is entered whenever the player sticks or goes bust. For Blackjack, the terminal state will be reached by all policies. Throughout this paper we assume that the task is episodic. Let denote the terminal state. (If there are multiple terminal states, without loss in generality they can be lumped into one state.)

When using policy to generate an episode, let

 Tπ=min{t:Sπt=~s} (3)

be the time when the episode ends. Our goal is find a policy that maximizes the expected episodic return:

 Vπ(s)=E[Tπ−1∑t=0γtRπt+1|Sπ0=s] (4)

for all .

We consider both the non-discounted case and the discounted case . For the discounted case, the optimization criterion (4) is a special case of the standard infinite-horizon discounted criterion (with the reward in the terminal state set to zero), and thus there is an optimal policy that is both stationary and deterministic. For the non-discounted case, the optimization problem (4) corresponds to the stochastic shortest path problem, for which there also exists an optimal policy that is both stationary and deterministic (for example, see Proposition 2.2 of Bertsekas (2012); Bertsekas & Tsitsiklis (1996)). For both the discounted and non-discounted cases, the optimal policy may not be unique.

We note that it is possible to construct MDPs such that the terminal state is never reached under optimal policies. In such cases, Monte Carlo algorithms, such as Monte Carlo Exploring States, do not make sense. We therefore define an MDP to be episodic if under every optimal policy the terminal state is reached with probability one. Throughout this paper we assume that the underlying MDP is episodic. It is easily seen that the MDP graph of an episodic MDP has the following property: for any state there is a directed path in the graph to the terminal state. (However, the reverse statement is not necessarily true.)

Let be the optimal action-value function, and let be the optimal value function. Finally, let denote the set of all optimal actions in state .

### 3.2 Classes of MDPs

We will prove convergence results for important classes of MDPs. An environment is said to be a deterministic environment if for any state and chosen action , the reward and subsequent state are given by two (unknown) deterministic functions and . Many natural environments are deterministic. For example, in Sutton & Barto (2018), environments Tic-Tac-Toe, Gridworld, Golf, Windy Gridworld, and Cliff Walking are all deterministic. Moreover, many natural environments with continuous state and action spaces are deterministic, such as the Mujoco robotic locomotion environments (Todorov et al., 2012). Thus the class of deterministic environments is a large and important special case of environments.

We say an environment is Stochastic Feed-Forward (SFF) if a state cannot be revisited within any episode. More precisely, the MDP is Stochastic Feed-Forward (SFF) if its MDP graph has no cycles, that is, the MDP graph is a Directed Acyclic Graph (DAG). Note that transitions are permitted to be stochastic in a SFF MDP. SFF environments occur naturally in practice. For example, the Blackjack environment in Sutton & Barto (2018) is SFF.

We say an MDP is Optimal Policy Feed-Forward (OPFF) if the directed graph engendered by the optimal policies is acyclic. More precisely, construct a sub-graph of the MDP graph as follows: each state is a node in the graph, and there is a directed edge from node to if the MDP can go from state directly to for some optimal action, i.e., if for some action . We refer to this graph as the optimal policy MDP graph. We say the MDP is Optimal Policy Feed-Forward (OPFF) if the optimal policy MDP graph is acyclic.

The following lemma shows that all deterministic and SFF MDPs are special cases of OPFF MDPs.

###### Lemma 1.

(a) All deterministic episodic MDPs are OPFF. (b) All SFF MDPs are OPFF.

###### Proof.

(a) Consider the optimal policy MDP graph. Since the MDP is deterministic, if the graph has cycles, then in some starting states, an optimal policy would not reach the terminal state. But this violates the episodic MDP assumption. Thus all deterministic episodic MDPs are OPFF. (b) If the original MDP graph does not have cycles, then clearly its optimal policy MDP sub-graph also does not have cycles. Thus all SFF MDPs are OPFF. ∎

### 3.3 Monte Carlo with Exploring Starts

We generalize slightly the MCES algorithm in Sutton and Barto. Specifically, instead of setting to the average of the returns, we consider three variants of the MCES algorithms, by using a function in line 12, which is a function of the list of returns. Define to be a function that takes in a list of returns and outputs the most recent return value in the list . Define to be a function that takes in a list of returns and outputs the maximum value in the list . Define to be a function that takes in a list of returns and outputs the average of the values in the list . The original algorithm in Sutton and Barto uses .

We also need to modify the algorithm to handle the case when the episode never reaches the terminal state for some policies (for example, due to cycling). To this end, let be some upper bound on the number of states in our MDP. We assume that the algorithm designer has access to such an upper bound (which may be very loose). With the introduction of the variable, each episode is guaranteed to end in no more than steps.

In line 13, if there is more than one argument maximum, we set to any one of them.

## 4 Proof of Convergence for Deterministic MDPs

In this section we consider deterministic MDPs for the cases and . In the subsequent section we will consider OPFF MDPs (which includes deterministic MDPs) for .

The following theorem implies that the sequence of policies and Q functions generated by the MCES algorithm for deterministic MDPs converges to the optimal policy and Q function, respectively.

###### Theorem 1.

Consider a deterministic MDP using either or . W.p.1 after a finite number iterations of the MCES algorithm we have

 Q(s,a)=Q∗(s,a),a∈A,s∈S (5)

and

 π(s)∈A∗(s),s∈S (6)
###### Proof.

We provide the proof for . The proof for is similar.

From Lemma 1 we know that the optimal policy MDP graph is a DAG. For any DAG, we can re-order the states such that from state we can only transition to a state in with being the terminal state.

Now consider the MCES algorithm. Note that during each iteration of the algorithm there is a data generation phase, during which one episode is generated using exploring starts, and then an update phase, during which and are updated for states and actions along the episode. Denote and for the values of and just after the th iteration.

We first make the following claim: For every , w.p.1 there is a finite such that after the th iteration of the MCES algorithm we have for all :

 Qu(sj,a)=V∗(sj) for all a∈A∗(sj) (7)

and

 πu(sj)∈A∗(sj) (8)

We prove the above claim by backward induction: First we show the statement is true for the base case where . By providing the terminal state with one action that leads to itself with 0 reward, it is easily seen that the claim trivially holds with .

We now assume that the claim holds for and show it continues to hold for . By the inductive assumption there is a finite such that for any we have for all :

 Qu(sj,a)=V∗(sj) for all a∈A∗(sj) (9)

and

 πu(sj)∈A∗(sj) (10)

Note that (10) implies that if the MDP enters a state in within any episode occurring after the th iteration, it will subsequently take optimal actions and eventually reach the terminal state .

We first establish (7) for . Let be an action in and let be any iteration after iteration for which the episode includes followed by selecting action .

Because is an optimal action for state , after selecting , the MDP will enter a state in and then take optimal actions until reaching the terminal state. Thus the portion of the episode beginning at will have return . Thus during the update phase following this episode, will get updated to . Subsequently, will never get updated to anything other than . Thus

 Qu(sk,a∗)=V∗(sk) (11)

for all . Let where the maximum is taken over all . Due to exploring starts, is finite w.p.1. It follows that (11) holds for all whenever . This establishes (7) for .

Now we establish (8) for . For a deterministic MDP, during any iteration , when starting in state and taking any action , the resulting return satisfies . Furthermore, if , we must have . Therefore, after any iteration we have

 Qu(sk,a)

Combining (11) and (12) implies that for any , with equality holding if and only if . Since , it follows that for all . This establishes (8) for . By induction the claim is true for all .

So far we have proved that w.p.1 after a finite number of iterations we have

 π(s)∈A∗(s),s∈S (13)

and

 Q(s,a)=Q∗(s,a),a∈A∗(s),s∈S (14)

It remains to show that after a finite number of iterations

 Q(s,a)=Q∗(s,a),a∉A∗(s),s∈S (15)

To this end, let be such that after iteration then (13) holds. Consider an pair with . Due to exploring starts, this pair will occur at the beginning of some episode after iteration . After taking action and entering a new state , the MDP will follow the optimal policy. Thus the return for is by definition of . Thus will be updated to and will subsequently never change. Thus, if we let be the first iteration by which all state-action pairs are selected by exploring starts after iteration , then (15) also holds starting at iteration , completing the proof.

## 5 Proof of Convergence for OPFF MDPs

In this section, for the case of , we show that the MCES algorithm converges to an optimal policy for all OPFF MDPs, which by Lemma 1 includes all deterministic MDPs and all stochastic feed-forward MDPs. For this case, as done in Sutton & Barto (2018), we modify MCES algorithm so that only first-visit returns are used when estimating the action-value function, which is required when applying the law of large of numbers in the subsequent proof.

###### Theorem 2.

Consider the MCES algorithm with average return, that is, . Suppose the MDP is Optimal Policy Feed Forward (OPFF). For any , w.p.1 after a finite number of iterations we have

 |Q(s,a)−Q∗(s,a)|<ϵ,a∈A,s∈S (16)

and

 π(s)∈A∗(s),s∈S (17)

Consequently, the Q function and the policy converge to the optimal Q function and an optimal policy, respectively, w.p.1.

###### Proof.

The proof parallels the proof of Theorem 1 with some important differences. We only provide the portions of the proof that are different.

Because the MDP is OPFF, its optimal policy MDP graph is a DAG, so we can re-order the states such that from state and selecting an optimal action , we can only transition to a state in .

We make the following claim: For every , and , w.p.1 there exists a such that after the th iteration of the MCES algorithm we have for all :

 |Qu(sj,a)−Q∗(sj,a)|<ϵ for all a∈A∗(sj) (18)

and

 πu(sj)∈A∗(sj) (19)

We again prove the claim by backward induction. As in the proof of Theorem 1, it holds trivially for . We now assume that the claim holds for and show it continues to hold for . By the inductive assumption we know that for every there exists such that for any we have for all :

 |Qu(sj,a)−Q∗(sj,a)|<ϵ for all a∈A∗(sj) (20)

and

 πu(sj)∈A∗(sj) (21)

Note that (21) implies that if the MDP enters a state in within any episode occurring after the th iteration, it will subsequently take optimal actions and eventually reach the terminal state .

We first establish (18) for . Let be an action in and consider any iteration after iteration for which the episode includes followed by selecting action . Because the MDP is OPFF, after selecting the optimal action , the MDP will enter a state in and then follow optimal actions until reaching the terminal state. The portion of the episode beginning in state will therefore take a path in the DAG from to the terminal state. Let denote the (random) return for this portion of episode. By definition of , we have . Because of Exploring Starts, we are ensured to visit the state-action pair infinitely often after

, and by the law of large numbers we know that the average of the returns obtained during these visits converges to

w.p.1. Combining this with the fact that is the average of all returns from up through the th iteration, we know for any , there is a such that for all iterations we have:

 |Qu(sk,a∗)−Q∗(sk,a∗)|<ϵ for all a∗∈A∗(sk) (22)

This establishes (18) for .

We now show (19) for . Since all actions in are optimal, we have:

 Q∗(sk,a∗)≥Q∗(sk,a)+ϵ′ for some ϵ′>0,for all a∗∈A∗(sk),a∉A∗(sk) (23)

Consider state and an arbitrary action in . From the MCES algorithm, we have , where is the total number of returns used to compute up through the th iteration, and is the return value for the th such iteration. Note that, due to the first-visit condition, the ’s are independent; however, since they can be generated with different policies, they may have different distributions. Let denote the (finite) set of all deterministic policies, denote the number of returns used to compute up through the th iteration when using policy , and , , denote the th return value when policy is used. We have

 Qu(sk,a) =1LuLu∑l=1Gl (24) =∑π∈ΠLπuLu(1LπuLπu∑l=1Gπl) (25)

By the law of large numbers, we know that for any policy such that we have

 limu→∞1LπuLπu∑l=1Gπl=Qπ(sk,a)≤Q∗(sk,a) (26)

where is the action-value function for policy . The inequality in (26) follows from the definition of . Further, for any policy such that we have

 limu→∞LπuLu(1LπuLπu∑l=1Gπl)=0 (27)

It follows from (24)-(27) that for any , there is a such that for all we have

 Qu(sk,a)≤Q∗(sk,a)+ϵ (28)

Combining (22), (23) and (28), we obtain

 Qu(sk,a∗) ≥Q∗(sk,a∗)−ϵ (29) ≥Q∗(sk,a)+ϵ′−ϵ (30) ≥Qu(sk,a)+ϵ′−2ϵ (31)

We can then choose , so that we have for all and :

 Qu(sk,a∗) >Qu(sk,a) (32)

From the MCES algorithm, by definition . Thus (32) implies , which establishes (19) for . By induction the claim is true for all .

It remains to show that after a finite number of iterations

 |Q(s,a)−Q∗(s,a)|<ϵ,a∉A∗(s),s∈S (33)

To this end, let be such that after iteration (17) holds. Let . Consider an pair with . After taking action and entering a new state , the MDP will follow the optimal policy. Thus the expected value of the return for is . Because of Exploring Starts, we are ensured to visit the pair infinitely often, and therefore by the law of large numbers we know that converges to w.p.1. That is, for any , there is a such that if then

 |Qu(s,a)−Q∗(s,a)|<ϵ,a∉A∗(s),s∈S (34)

So (33) also holds, completing the proof.

Note that the proofs of Theorems 1 and 2 only require that all state-action pairs be chosen infinitely often when starting the episodes.

## 6 Conclusion

Theorem 2 of this paper shows that as long as the episodic MDP is OPFF, then the MCES algorithm converges to the optimal policy. Many environments of practical interest are OPFF.

Combining the results of Tsitsiklis (2002) and the results here gives Figure 1, which summarizes what is now known about convergence of the MCES algorithm.

It still remains an open problem to prove convergence of MCES algorithm for environments that are non-OPFF for when either or when non-uniform exploring starts are employed.

The results in this paper along with the paper of Tsitsiklis (2002) make significant progress in establishing the convergence of the MCES algorithm. Many cases of practical interest are covered by the conditions in these two papers.

## References

• Bertsekas (2005) Bertsekas, D. P. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 3 edition, 2005.
• Bertsekas (2012) Bertsekas, D. P. Dynamic programming and optimal control, volume 2. Athena scientific Belmont, MA, 4 edition, 2012.
• Bertsekas & Tsitsiklis (1996) Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic programming, volume 5. Athena Scientific Belmont, MA, 1996.
• Jaakkola et al. (1994) Jaakkola, T., Jordan, M. I., and Singh, S. P. Convergence of stochastic iterative dynamic programming algorithms. In Advances in neural information processing systems, pp. 703–710, 1994.
• Robbins & Monro (1951) Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.
• Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
• Sutton et al. (1998) Sutton, R. S., Barto, A. G., et al. Introduction to reinforcement learning, volume 2. MIT press Cambridge, 1998.
• Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
• Tsitsiklis (1994) Tsitsiklis, J. N. Asynchronous stochastic approximation and q-learning. Machine learning, 16(3):185–202, 1994.
• Tsitsiklis (2002) Tsitsiklis, J. N. On the convergence of optimistic policy iteration. Journal of Machine Learning Research, 3(Jul):59–72, 2002.
• Watkins & Dayan (1992) Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8(3-4):279–292, 1992.