DeepAI

# On the convergence of cycle detection for navigational reinforcement learning

We consider a reinforcement learning framework where agents have to navigate from start states to goal states. We prove convergence of a cycle-detection learning algorithm on a class of tasks that we call reducible. Reducible tasks have an acyclic solution. We also syntactically characterize the form of the final policy. This characterization can be used to precisely detect the convergence point in a simulation. Our result demonstrates that even simple algorithms can be successful in learning a large class of nontrivial tasks. In addition, our framework is elementary in the sense that we only use basic concepts to formally prove convergence.

• 2 publications
• 23 publications
11/24/2021

### Learning State Representations via Retracing in Reinforcement Learning

We propose learning via retracing, a novel self-supervised approach for ...
08/28/2018

### Cycle-of-Learning for Autonomous Systems from Human Interaction

We discuss different types of human-robot interaction paradigms in the c...
07/21/2020

### On the Convergence of Reinforcement Learning with Monte Carlo Exploring Starts

A basic simulation-based reinforcement learning algorithm is the Monte C...
10/28/2020

### Understanding the Pathologies of Approximate Policy Evaluation when Combined with Greedification in Reinforcement Learning

Despite empirical success, the theory of reinforcement learning (RL) wit...
11/07/2018

### Policy Certificates: Towards Accountable Reinforcement Learning

The performance of a reinforcement learning algorithm can vary drastical...
08/05/2021

### An Elementary Proof that Q-learning Converges Almost Surely

Watkins' and Dayan's Q-learning is a model-free reinforcement learning a...

## 1 Introduction

Reinforcement Learning (RL) is the subfield of Artificial Intelligence concerned with agents that have to learn a task-solving policy by exploring state-action pairs and observing rewards

(Sutton and Barto, 1998). Off-policy algorithms such as Q-learning, or on-policy algorithms such as Sarsa, are well-understood and can be shown to converge towards optimal policies under quite general assumptions. These algorithms do so by updating, for every state-action pair

, an estimate

of the expected value of doing in .

Our aim in this article is expressly not to propose a more efficient or more powerful new RL algorithm. In contrast, we want to show that convergence can occur already with very simplistic algorithms. The setting of our result is that of tasks where the agent has to reach a goal state in which a reward action can be performed. Actions can be nondeterministic. We like to refer to this setting as navigational learning.

The learning algorithm we consider is for a simplistic agent that can only remember the states it has already visited. The algorithm is on-policy; its only update rule is that, when a state is revisited, the policy is revised and updated with an arbitrary new action for that state. We refer to this algorithm as the cycle-detection algorithm. Our main result is that this algorithm converges for all tasks that we call reducible. Intuitively, a task is reducible if there exists a policy that is guaranteed to lead to reward. We also provide a test for convergence that an outside observer could apply to decide when convergence has happened, which can be used to detect convergence in a simulation. We note that the final policy is allowed to explore only a strict subset of the entire state space.

A first motivation for this work is to understand how biological organisms can be successful in learning navigational tasks. For example, animals can learn to navigate from their nest to foraging areas and back again (Geva-Sagiv et al., 2015). Reward could be related to finding food or returning home. As in standard RL, the learning process might initially exhibit exploration, after which eventually a policy is found that leads the animal more reliably to reward. In the context of biologically plausible learning, Frémaux et al. (2013) make the following interesting observations. First, navigational learning is not restricted to physical worlds, but can also be applied to more abstract state spaces. Second, the formed policy strongly depends on the experiences of the agent, and therefore the policy is not necessarily optimal. We elaborate these observations in our formal framework. We consider a general definition of tasks, which can be used to represent both physically-inspired tasks and more abstract tasks. Furthermore, we do not insist on finding (optimal) policies that generate the shortest path to reward, but we are satisfied with learning policies that avoid cycles.

A secondary motivation for this work is to contribute towards filling an apparent gap that exists between the field of Reinforcement Learning and the more logic-based fields of AI and computer science. Indeed, on the structural level, the notion of task as used in RL is very similar to the notion of interpretation in description logics (Baader et al., 2010), or the notion of transition system used in verification (Baier and Katoen, 2008). Yet, the methods used in RL to establish convergence are largely based on techniques from numerical mathematics and the theory of optimization. Our aim was to give proofs of convergence that are more elementary and are more in the discrete-mathematics style common in the above logic-based fields, as well as in traditional correctness proofs of algorithms (Cormen et al., 2009).

Standard RL convergence proofs assume the condition that state-action pairs are visited (and thus updated) infinitely often, see e.g. (Watkins and Dayan, 1992). Conditions of this kind are known as fairness conditions in the theory of concurrent processes (Francez, 1986). Also for our convergence proof we need an appropriate fairness assumption to the effect that when the agent repeats some policy-updating configuration infinitely often, it must also explore all possible updates infinitely often.

We note that the cycle-detection learning algorithm could be remotely related to biologically plausible mechanisms. In some models of biological learning (Potjans et al., 2011; Frémaux et al., 2013)

, a policy is represented by synaptic connections from neurons encoding (the perception of) states to neurons encoding actions. Connections are strengthened when pre-before-post synaptic activity is combined with reward

(Schultz, 2013), causing an organism to remember action preferences for encountered states. If an organism would initially have a policy that frequently leads to cycles in the task, there is a (slow) way to still unlearn that policy, as follows.111In this discussion, we purposely do not mention mechanisms of disappointment, i.e., the opposite of reward, because the framework in this article does not contain such mechanisms. Consider a pair of a state and its preferred action in the policy. Due to noise (Maass, 2014), a neuron participating in the encoding of action could become activated just before state effectively occurs. Possibly, this post-before-pre synaptic activity leads to long-term-depression (Gerstner et al., 2014), i.e., connections from to are weakened.222 If neuron is activated by noise just before state occurs, refractoriness could prevent state from subsequently activating  (Gerstner et al., 2014). The resulting absence of a postsynaptic spike at fails to elicit long-term-potentiation, i.e., connections from to are not strengthened. So, the mentioned weakening effect is not compensated.

At some synapses, the weakening effect is aided by a longer time window for long-term-depression compared to long-term-potentiation

(Markram et al., 2011). So, if reward would remain absent for longer periods, as in cycles without reward, noise could gradually unlearn action preferences for states. In absence of such preferences, noise could generate random actions for states. The unlearning phase followed by new random action proposals, would resemble our cycle-detection algorithm.

##### Outline

This article is organized as follows. We discuss related work in Section 2. We formalize important concepts in Section 3. We present and prove our results in Section 4. We discuss examples and simulations in Section 5, and we conclude in Section 6.

## 2 Related Work

Some previous work on reinforcement learning algorithms is focused on learning a policy efficiently, say, using a polynomial number of steps in terms of certain input parameters of the task (Kearns and Singh, 2002; Brafman and Tennenholtz, 2002; Strehl et al., 2009). There is also a line of work in reinforcement learning that is not necessarily aimed towards efficiently bounding the learning time. In that case, convergence of the learning process happens in the limit, by visiting task states infinitely often. Some notable examples are temporal-difference learning (Sutton, 1988; Sutton and Barto, 1998) and Q-learning (Watkins, 1989; Watkins and Dayan, 1992). Temporal-difference learning has become an attractive foundation for biological learning models (Potjans et al., 2011; Frémaux et al., 2013; Schultz, 2013, 2015).

Most previous works in numerical reinforcement learning try to find optimal policies, and their related optimal value functions (Sutton, 1988; Watkins and Dayan, 1992; Dayan, 1992; Dayan and Sejnowski, 1994; Jaakkola et al., 1994; Tsitsiklis, 1994); an optimal policy gives the highest reward in the long run. This has motivated the design of numerical learning techniques. The corresponding proof techniques do not always clearly illuminate how properties of the task state space interplay with a particular learning algorithm. With the framework introduced in this article, we hope to shed more light on properties of the task state space, in particular on the way that paths could be formed in the graph structure of the task. Although our graph-oriented framework has a different viewpoint compared to standard numerical reinforcement learning, we believe that our Theorem 4.1

, showing that convergence always occurs on reducible tasks, and its proof contribute to making the fascinating idea of reinforcement learning more easily accessible to a wider audience. Our convergence result has a similar intent as previous results showing that numerical learning algorithms converge with probability one.

Various papers study models of reinforcement learning in the context of neuronal agents that learn to navigate a physical environment (Vasilaki et al., 2009; Potjans et al., 2011; Frémaux et al., 2013). Interestingly, Frémaux et al. (2013) study both physical and more abstract state spaces. As an example of a physical state space, they consider a navigation task in which a simulated mouse has to swim to a hidden platform where it can rest, where resting corresponds to reward; each state contains only the and coordinate. As an example of an abstract state space, they consider an acrobatic swinging task where reward is given when the tip of a double pendulum reaches a certain height; this space is abstract because each state contains two angles and two angular velocities, i.e., there are four dimensions. Conceptually it does not matter how many dimensions a state space has, because the agent is always just seeking paths in the graph structure of the task.

This idea of finding paths in the task state space is also explored by Bonet and Geffner (2006), in a framework based on depth-first search. Their framework has a more global perspective where learning operations have access to multiple states simultaneously and where the overall search is strongly embedded in a recursive algorithm with backtracking. Our algorithm acts from the local perspective of a single agent, where only one state can be observed at any time.

As remarked by Sutton and Barto (1998, p. 104), a repeated theme in reinforcement learning is to update the policy (and value estimation) while the agent visits states. This theme is also strongly present in the current article, because for each visited state the policy always remembers the lastly tried action for that state. The final aim for convergence, as studied in this article, is to eventually not choose any new actions anymore for the encountered states.

The notion of reducibility discussed in this article is related to the principles of (numerical) dynamic programming, upon which a large part of reinforcement learning literature is based (Sutton and Barto, 1998). Indeed, in reducibility, we defer the responsibility of obtaining reward from a given state to one of the successor states under a chosen action. This resembles the way in dynamic programming that reward prediction values for a given state can be estimated by looking at the reward prediction values of the successor states. In settings of standard numerical reinforcement learning, dynamic programming finds an optimal policy in a time that is worst-case polynomial in the number of states and actions. This time complexity is also applicable to our iterative reducibility procedure given in Section 3.1.

## 3 Navigational Reinforcement Learning

We formalize tasks and the notion of reducibility in Section 3.1. Next, in Section 3.2, we use an operational semantics to formalize the interaction between a task and our cycle-detection learning algorithm. In Section 3.3, we define convergence as the eventual stability of the policy. Lastly, in Section 3.4, we impose certain fairness restrictions on the operational semantics.

### 3.1 Tasks and Reducibility

To formalize tasks, we use nondeterministic transition systems where some transitions are labeled as being immediately rewarding, where reward is only an on-off flag. Formally, a task is a five-tuple

 T=(Q,Q0,A,rewards,δ)

where , , and are nonempty finite sets; ; is a nonempty subset of ; and, is a function that maps each to a nonempty subset of . The elements of , , and are called respectively states, start states, and actions. The set tells us which pairs of states and actions give immediate reward. Function describes the possible successor states of applying actions to states.

###### Remark 3.1.

Our formalization of tasks keeps only the graph structure of models previously studied in reinforcement learning; essentially, compared to finite Markov decision processes

(Sutton and Barto, 1998), we omit transition probabilities and we simplify the numerical reward signals to boolean flags. We do not yet study negative feedback signals, so performed actions give either reward or no reward, i.e., the feedback is either positive or neutral. In our framework, the agent can observe states in an exact manner, which is a commonly used assumption (Sutton and Barto, 1998; Kearns and Singh, 2002; Brafman and Tennenholtz, 2002; Bonet and Geffner, 2006). We mention negative feedback signals and partial information as topics for further work in Section 6.

##### Reducibility

Let be a task as above. We define the set

 goals(T)={q∈Q∣∃a∈A with (q,a)∈rewards}.

We refer to the elements of as goal states. Intuitively, for a goal state there is an action that reliably gives immediate reward. Each task has at least one goal state because the set is always nonempty. The agent could learn a strategy to reduce all encountered states to goal states, and then perform a rewarding action at goal states. This intuition is formalized next.

Let . We formalize how states can be reduced to . Let denote the set of natural numbers without zero. First, we define the infinite sequence

 L1,L2,…

of sets where , and for each ,

 Li=Li−1∪{q∈Q∣∃a∈A with δ(q,a)⊆Li−1}.

We call , , etc, the (reducibility) layers. We define . Note that . Because is finite, there is an index for which , i.e., is a fixpoint. Letting , we say that is reducible to if . Intuitively, each state in can choose an action to come closer to . We also say that a single state is reducible to if .

Now, we say that task is reducible (to reward) if the state set is reducible to . We use the abbreviation where . Reducibility formalizes a sense of solvability of tasks.

We illustrate the notion of reducibility with the following example.

###### Example 3.2.

We consider the task defined as follows: ; ; ; ; and, regarding , we define

 δ(1,a) ={1,3}, δ(1,b) ={2}, δ(2,a) ={1,3}, δ(2,b) ={3}, δ(3,a)=δ(3,b) ={3}.

Task is visualized in Figure 3.1. Note that the task is reducible, by assigning the action to both state and state . The reducibility layers up to and including the fixpoint, are:

 L1 =goals(T)={3}, L2 ={3,2}, L3 ={3,2,1}.

For simplicity, the assignments and form a deterministic strategy to reward. But we could easily extend task to a task in which the strategy to reward is always subjected to nondeterminism, by adding a new state with the new mappings , .

###### Remark 3.3.

Reducibility formalizes the intuition of an acyclic solution. This appears to be a natural notion of solvability, even in state graphs that contain cycles (Bonet and Geffner, 2006).

We would like to emphasize that reducibility is a notion of progress in the task transition graph, but it is not the same as determinism because each action application, i.e., transition, remains inherently nondeterministic. We may think of reducibility as onion layers in the state space: the core of the onion consists of the goal states, where immediate reward may be obtained, and, for states in outer layers there is an action that leads one step down to an inner layer, closer to reward. When traveling from an outer layer to an inner layer, the nondeterminism manifests itself as unpredictability on the exact state that is reached in the inner layer.

### 3.2 Cycle-detection Algorithm

We describe a cycle-detection learning algorithm that operates on tasks, by means of an operational semantics that describes the steps taken over time. We first give the intuition behind the cycle-detection algorithm, and then we proceed with the formal semantics.

#### 3.2.1 Intuition

We want to formally elaborate the intuition of path learning. Our aim therefore is not necessarily to design another efficient learning algorithm. It seems informative to seek only the bare ingredients necessary for navigational learning. How would such a simple algorithm look like?

As a first candidate, let us consider the algorithm that is given some random initial policy and that always follows the policy during execution. There would not be any exploration, and no learning, since the policy is always followed and never modified. In general, the policy might not even lead to any reward at all, and the agent might run around in cycles without obtaining reward.

At the opposite end of the spectrum, there could be a completely random process, that upon each visit to a task state always chooses some random action. If the agent is lucky then the random movement through the state space might occasionally, but unreliably, lead to reward. There is no sign of learning here either, because there is no storage of previously gained knowledge about where reward can be obtained.

Now we consider the following in-between strategy: the algorithm could only choose random actions when it detects a cycle in the state space before reaching reward. If the agent does not escape from the cycle then it might keep running around indefinitely without ever reaching reward. More concretely, we could consider a cycle-detection algorithm, constituted by the following directives:

• Starting from a given start state, we continuously remember all encountered states. Each time when reward is obtained, we again forget about which states we have seen.

• Whenever we encounter a state that we have already seen before, we perform some random action, and we store that action in the policy (for that state ).

The cycle-detection algorithm is arguably amongst the simplest learning algorithms that one could conceive. The algorithm might be able to gradually refine the policy to avoid cycles, causing the agent to eventually follow an acceptable path to reward. The working memory is a set containing all states that are visited before obtaining reward. The working memory is reset whenever reward is obtained.

#### 3.2.2 Operational Semantics

We now formalize the cycle-detection algorithm. In the following, let be a task.

##### Configurations

A configuration of is a triple , where ; maps each to an element of ; and . The function is called the policy. The set is called the working memory and it contains the states that are already visited during the execution, but we will reset to whenever reward is obtained. We refer to as the current state in the configuration, and we also say that contains the state . Note that there are only a finite number of possible configurations. The aim of the learning algorithm is to refine the policy during trials, as we formalize below.

##### Transitions

We formalize how to go from one configuration to another, to represent the steps of the running algorithm. Let be a configuration. We say that is branching if ; this means that configuration represents a revisit to state , and that we want to generate a new action for the current state . Next, we define the set as follows: letting if is branching and otherwise, we define

 opt(c)={(a,q′)∣a∈A′ and q′∈δ(q,a)}.

Intuitively, contains the options of actions and successor states that may be chosen directly after . If is branching then all actions may be chosen, and otherwise we must restrict attention to the action stored in the policy for the current state. Note that the successor state depends on the chosen action.

Next, for a configuration and a pair , we define the successor configuration that results from the application of to , as follows:

• ;

• and for all ; and,

• .

We emphasize that only the action and visited-status of the state is modified, where is the state that is departed from. We denote the successor configuration as .

A transition is a four-tuple , also denoted as , where and are two configurations, , and . We refer to and as the source configuration and target configuration, respectively. We say that is a reward transition if . Note that there are only a finite number of possible transitions because there are only a finite number of possible configurations.

##### Trials and Runs

A chain is a sequence of transitions where for each pair of subsequent transitions, the target configuration of is the source configuration of . Chains could be finite or infinite.

A trial is a chain where either (i) the chain is infinite and contains no reward transitions; or, (ii) the chain is finite, ends with a reward transition, and contains no other reward transitions. To rephrase, if a trial is finite then it ends at the first occurrence of reward; and, if there is no reward transition than the trial must be infinite.

In a trial, we say that an occurrence of a configuration is terminal if that occurrence is the last configuration of the trial, i.e., the occurrence is the target configuration of the only reward transition. Note that an infinite trial contains no terminal configurations.

A start configuration is any configuration where and ; no constraints are imposed on the policy .

Now, a run on the task is a sequence of trials, where

1. the run is either an infinite sequence of finite trials, or the run consists of a finite prefix of finite trials followed by one infinite trial;

2. the first configuration of the first trial is a start configuration;

3. whenever one (finite) trial ends with a configuration and the next trial starts with a configuration , we have (i) ; (ii) ; and, (iii) ;333Note that satisfies the definition of start configuration. and,

4. if the run contains infinitely many trials then each state is used at the beginning of infinitely many trials.

We put condition (3) in words: when one trial ends, we start the next trial with a start state, we reuse the policy, and we again reset the working memory. By resetting the working memory, we forget which states were visited before obtaining the reward. The policy is the essential product of a trial. Condition (4), saying that each start state is used at the beginning of infinitely many trials, expresses that we want to learn the whole task, with all possible start states.

To refer to a precise occurrence of a trial in a run, we use the ordinal of that occurrence, which is a nonzero natural number.

###### Remark 3.4.

In the above operational semantics, the agent repeatedly navigates from start states to goal states. After obtaining immediate reward at a goal state, the agent’s location is always reset to a start state. One may call such a framework episodic (Sutton and Barto, 1998). We note that our framework can also be used to study more continuing operational processes, that do not always enforce a strong reset mechanism from goal states back to remote start states. Indeed, a task could define the set of start states simply as the set of all states. In that case, there are runs possible where some trials start at the last state reached by the previous trial, as if the agent is trying to obtain a sequence of rewards; but we still reset the working memory each time when we begin a new trial.

### 3.3 Convergence

We now define a convergence property to formalize when learning has stopped in a run. Consider a task . Let be a run on .

###### Definition 3.5.

We say that a state (eventually) becomes stable in if there are only finitely many non-terminal occurrences of branching configurations containing .

An equivalent definition is to say that after a while there are no more branching configurations at non-terminal positions containing . Intuitively, eventual stability of means that after a while there is no risk anymore that is paired with new actions, so will definitely stay connected to the same action.444If a branching configuration is terminal in a trial, can not influence the action of its contained state because there is no subsequent transition anymore. Note that states appearing only a finite number of times in always become stable under this definition.

We say that the run converges if (i) all trials terminate (with reward), and (ii) eventually all states become stable. We say that the task is learnable if all runs on converge.

###### Remark 3.6.

In a run that converges, note that the policy will eventually become fixed because the only way to change the policy is through branching configurations at non-terminal positions. The lastly formed policy in a run is called the final policy, which is studied in more detail in Section 4.2. We emphasize that a converging run never stops, because runs are defined as being infinite; the final policy remains in use indefinitely, but it is not updated anymore.

We would also like to emphasize that in a converging run, eventually, the trials contain no cycles before reaching reward: the only moment in a trial where a state could be revisited, is in the terminal configuration, i.e., in the target configuration of the reward transition.

### 3.4 Fairness

There are two choice points in each transition of the operational semantics:

• if the source configuration of the transition is branching, i.e., the current state is revisited, then we choose a new random action for the current state; and,

• whenever we apply an action to a state , we can in general choose among several possible successor states in .

Fairness assumptions are needed to give the learning algorithm sufficient opportunities to detect problems and try better policies (Francez, 1986). Intuitively, in both choice points, the choice should be independent of what the policy and working memory say about states other than the current state. This intuition is related to the Markov assumption, or independence of path assumption (Sutton and Barto, 1998). Below, we formalize this intuition as a fairness notion for the operational semantics of Section 3.2.2.

We say that a run is fair if for each configuration that occurs infinitely often at non-terminal positions, for each , the following transition occurs infinitely often:

 ca,q′−−→apply(c,a,q′).

We say that a task is learnable under fairness if all fair runs of converge.

###### Remark 3.7.

There is always a fair run for any task, as follows. For each possible configuration , we could conceptually order the set . During a run, we could also keep track for each occurrence of a configuration how many times we have already seen configuration in the run, excluding the current occurrence; we denote this number as .

We begin the first trial with a random start configuration , i.e., we choose a random start state and a random policy. We next choose the option with the first ordinal in the now ordered set . Now, for all the subsequent occurrences of a configuration in the run, we choose the option with ordinal in the set . So, if a configuration occurs infinitely often at non-terminal positions then we continually rotate through all its options. Naturally, trials end at the first occurrence of reward, and then we choose another start state; taking care to use all start states infinitely often.

## 4 Results

The cycle-detection learning algorithm formalized in Section 3.2.2 continually marks the encountered states as visited. At the end of trials, i.e., after obtaining reward, each state is again marked as unvisited. If the algorithm encounters a state that is already visited within the same trial, the algorithm proposes to generate a new action for . Intuitively, if the same state is encountered in the same trial, the agent might be running around in cycles and some new action should be tried for to escape from the cycle. It is important to avoid cycles if we want to achieve an eventual upper bound on the length of a trial, i.e., an upper bound on the time it takes to reach reward from a given start state.

Repeatedly trying a new action for revisited states might eventually lead to reward, and thereby terminate the trial. In this learning process, the nondeterminism of the task can be both helpful and hindering: nondeterminism is helpful if transitions choose successor states that are closer to reward, but nondeterminism is hindering if transitions choose successor states that are further from reward or might lead to a cycle. Still, on some suitable tasks, like reducible tasks, the actions that are randomly tried upon revisits might eventually globally form a policy that will never get trapped in a cycle ever again (see Theorem 4.1 below).

The outline of this section is as follows. In Section 4.1, we present a sufficient condition for tasks to be learnable under fairness. In Section 4.2 we discuss how a simulator could detect that convergence has occurred in a fair run. In Section 4.3 we present necessary conditions for tasks to be learnable under fairness.

### 4.1 Sufficient Condition for Convergence

Intuitively, if a task is reducible then we might be able to obtain a policy that on each start state leads to reward without revisiting states in the same trial. As long as revisits occur, we keep searching for the acyclic flow of states that is implied by reducibility. We can imagine that states near the goal states, i.e., near immediate reward, tend to more quickly settle on an action that leads to reward. Subsequently, states that are farther removed from immediate reward can be reduced to states near goal states, and this growth process propagates through the entire state space. This intuition is confirmed by the following convergence result:

###### Theorem 4.1.

All reducible tasks are learnable under fairness.

###### Proof.

Let be a reducible task. Let be a fair run on . We show convergence of . In Part 1 of the proof, we show that all trials in terminate (with reward). In Part 2, we show that eventually all states become stable in .

Part 1: Trials terminate. Let

 L1,L2,…

be the reducibility layers for as defined in Section 3.1, where . Let be a trial in . To show finiteness of , and thus termination of , we show by induction on that the states in occur only finitely many times in . Because is reducible, there is an index for which , and therefore our inductive proof shows that every state only occurs a finite number of times in the trial ; hence, is finite.

Before we continue, we recall that a state is marked as visited after its first occurrence in the trial; any occurrence of after its first occurrence is therefore in a branching configuration.

Base case. Let . Towards a contradiction, suppose occurs infinitely often in trial , making infinite. Because there are only a finite number of possible configurations, there is a configuration containing that occurs infinitely often in at non-terminal positions (because the trial is now infinite). Configuration is branching because it occurs more than once.555To see this, take for instance the second occurrence of in the trial. That occurrence represents a revisit to , so is in the working memory set of . By definition of , there is an action such that . Since always , we can choose some . We have because is branching. By fairness, the following transition must occur infinitely often in the trial:

 ca,q′−−→apply(c,a,q′).

But this transition is a reward transition, so the trial would have already ended at the first occurrence of this transition. Hence can not occur infinitely many times; this is the desired contradiction.

Inductive step. Let , and let us assume that states in occur only finitely many times in . Let . By definition of , there is some action such that . Towards a contradiction, suppose occurs infinitely often in , making infinite. Like in the base case, there must be a branching configuration containing that occurs infinitely often in the trial (at non-terminal positions). Since always , we can choose some . We have because is branching. By fairness, the following transition must occur infinitely often in the trial:

 ca,q′−−→apply(c,a,q′).

But then would appear infinitely often in trial . This is the desired contradiction, because the induction hypothesis says that all states in (including ) occur only finitely many times in .

Part 2: Stability of states. We now show that all states eventually become stable in the fair . Let

 L1,L2,…

again be the reducibility layers for as above, where . We show by induction on that states in become stable in . Since is reducible, there is an index such that , so our inductive proof shows that all states eventually become stable.

Before we continue, we recall that Part 1 of the proof has shown that all trials are finite. So, whenever we say that a configuration occurs infinitely often in the run, this means that the configuration occurs in infinitely many trials. Similarly, if a transition occurs infinitely often in the run, this means that the transition occurs in infinitely many trials.

Base case. Let . Towards a contradiction, suppose would not become stable. This means that there are infinitely many non-terminal occurrences of branching configurations containing .666For completeness, we recall that if would occur only a finite number of times in the run then we can immediately see in the definition of stability that becomes stable. Because there are only finitely many possible configurations, there must be a branching configuration containing that occurs infinitely often at non-terminal positions.

By definition of , there is an action such that . Since always , we can choose some . We have because is branching. By fairness, the following transition must occur infinitely often in the run:

 ca,q′−−→apply(c,a,q′).

Transition is a reward transition because . Let be the index of a trial containing transition ; this implies that is the last transition of trial . We now show that any non-terminal occurrences of after trial must be in a non-branching configuration. Hence, becomes stable; this is the desired contradiction.

Consider the first trial index after in which occurs again at a non-terminal position. Let configuration with be the first occurrence of in trial (which is at a non-terminal position). Note that because (i) trial ends with the assignment of action to (through transition ), and (ii) the trials between and could not have modified the action of . Further, configuration is not branching because is not yet flagged as visited at its first occurrence in trial . This means that at any occurrence of , trial must select an option , with action and , and perform the corresponding transition :

 c1a,q′′−−→apply(c1,a,q′′).

Again, since , trial ends directly after transition ; no branching configuration containing can occur in trial at a non-terminal position.777Although it is possible that is directly revisited from itself, it does not matter whether the terminal configuration of the trial is branching or not. This reasoning can now be repeated for all following trials to see that there are no more non-terminal occurrences of branching configurations containing .

Inductive step. Let . We assume for each that eventually becomes stable. Now, let . By definition of , there is an action such that . Towards a contradiction, suppose that does not become stable. Our aim is to show that now also at least one does not become stable, which would contradict the induction hypothesis.

Regarding terminology, we say that a chain is a -chain if (i) the chain contains only non-reward transitions and (ii) the chain has the following desired form:

 c1a1,q2−−−→c2a2,q3−−−→…an−1,qn−−−−−→cn,

denoting for each , where and . Note that such a chain starts and ends with an occurrence of , so is revisited in the chain. Moreover, the first transition performs the action from above. Next, we say that a trial is a -trial if the trial contains a -chain. In principle, each -trial could embed a different -chain.

To see that there are infinitely many occurrences of -trials in , we distinguish between the following two cases.

• Suppose that in there are infinitely many occurrences of trials that end with a policy where , i.e., action is assigned to . Let be the index of such a trial occurrence. Because by assumption does not become stable, we can consider the first trial index after in which occurs in a branching configuration at a non-terminal position. Note that trials between trial and trial do not modify the action of . Now, the first occurrence of in trial is always non-branching, and thus we perform action there. The subsequence in trial starting at the first occurrence of and ending at some branching configuration of at a non-terminal position, is a -chain: the chain starts and ends with , its first transition performs action , and it contains only non-reward transitions because it ends at a non-terminal position. Hence, trial is a -trial.

• Conversely, suppose that in there are only finitely many occurrences of trials that end with a policy where . Let be an (infinite) suffix of in which no trial ends with action assigned to . Because by assumption does not become stable, and because the number of possible configurations is finite, there is a branching configuration containing that occurs infinitely often at non-terminal positions in . Choose some . We have because is branching. By fairness, the following transition occurs infinitely often in :

 ca,q′−−→apply(c,a,q′).

Let be the index of a trial occurrence in that contains transition ; there are infinitely many such indexes because all trials in are finite (see Part 1 of the proof). Since transition attaches action to , we know by definition of that any occurrence of in trial is followed by at least one other transition from state that attaches an action to with ; this implies that after each occurrence of transition in trial there is a branching configuration of at a non-terminal position. In trial , a subsequence starting at any occurrence of and ending with the first subsequent branching configuration of at a non-terminal position, is a -chain: the chain starts and ends with , its first transition performs action , and the chain contains only non-reward transitions because it ends at a non-terminal position. Hence, trial is a -trial.

We have seen above that there are infinitely many occurrences of -trials in . Because there are only a finite number of possible configurations, there is a configuration containing that is used in infinitely many occurrences of -trials as the last configuration of a -chain. Note that occurs infinitely often at non-terminal positions since -chains contain no reward transitions.

Next, we can choose from some occurrence of a -trial in the run some -chain where in particular the last configuration of is the configuration . Formally, we write as

 c1a1,q2−−−→c2a2,q3−−−→…an−1,qn−−−−−→cn,

where , and denoting for each , where and . We recall that all transitions of are non-reward transitions. Note that : we have because and .

In chain , we have certainly marked state as visited after its first occurrence, causing configuration to be branching. This implies , where is the same option as taken by the first transition of , since . Also, since , we have certainly marked state as visited after its first occurrence in ; this implies . Next, since the configuration occurs infinitely often at non-terminal positions (see above), the following transition also occurs infinitely often by fairness:

 cna,q2−−→cn+1,

where . Because and , configuration is branching. Moreover, we know that since no transition of is a reward transition, including the first transition. So, the branching configuration occurs infinitely often at non-terminal positions. Hence, would not become stable. Yet, , and the induction hypothesis on says that does become stable; this is the desired contradiction.

###### Remark 4.2.

By Theorem 4.1, the trials in a fair run on a reducible task eventually contain a number of non-terminal configurations that is at most the number of states; otherwise at least one state would never become stable.888If there would be infinitely many trials that contain more non-terminal configurations than states, then in infinitely many trials there is a revisit to a state (in a branching configuration) on a non-terminal position. Since there are only finitely many states, there would be at least one state that in infinitely many trials occurs in a branching configuration on a non-terminal position; this state does not become stable by definition. So, we get a relatively good eventual upper bound on trial length. However, Theorem 4.1 provides no information on the waiting time before that upper bound will emerge, because that waiting time strongly depends on the choices made by the run regarding start states of trials, tried actions, and successor states (see also Section 6).

Because we seek a policy that avoids revisits to states in the same trial, an important intuition implied by Theorem 4.1 is that for reducible tasks eventually the trials of a run follow paths without cycles through the state space. The followed paths are still influenced by nondeterminism, but they never contain a cycle. Also, a path followed in a trial is not necessarily the shortest possible path to reward, because the discovery of paths depends on experience, i.e., on the order in which actions were tried during the learning process. The experience dependence was experimentally observed, e.g. by Frémaux et al. (2013).

###### Remark 4.3.

The order in which states become stable in a fair run does not necessarily have to follow the order of the reducibility layers of Section 3.1. In general, it seems possible that some states that are farther removed from goal states could become stable faster than some states nearer to goal states; but, to become stable, the farther removed states probably should first have some stable strategy to the goal states.

To see that simulations do not exactly follow the inductive reasoning of the proof of Theorem 4.1, one could compare, in the later Section 5, the canonical policy implied by reducibility in Figure 5.2 with an actual final policy in Figure 5.4.

The following example illustrates the necessity of the fairness assumption in Theorem 4.1. So, although the convergence result for reducible tasks appears natural, the example reveals that subtle notions, like the fairness assumption, should be taken into account to understand learning.

###### Example 4.4.

Consider again the task from Example 3.2, that is also visualized in Figure 3.1. In the following, for ease of notation, we will denote configurations as triples , where is the current state; is the action assigned by the policy to the specific state , with action assigned to all other states; and is the set of visited states as before.

Consider now the following trial where the initial policy has assigned action to all states, including the start state :

 (1,a,{ })a,1−→(1,a,{1})b,2−→(2,b,{1})a,3−→(3,b,{1,2})a,3−→(3,b,{1,2,3}).

This is indeed a valid trial because the last transition is a reward transition. Note also that a revisit to state occurs in the first transition. The configuration is thus branching, which implies that the option may be chosen there. At the end of trial , action is assigned to state and action is assigned to the other states.

Consider also the following trial where the initial policy has assigned action to state and to all other states:

 (1,b,{ })b,2−→(2,b,{1})a,1−→(1,b,{1,2})a,3−→(3,a,{1,2})a,3−→(3,a,{1,2,3}).

The last transition is again a reward transition. Note that a revisit occurs to state in the second transition. The configuration is therefore branching, which implies that the option may be chosen there. At the end of trial , action is assigned to all states, including state .

Now, let be the run that alternates between trials and and that starts with trial . The state never becomes stable in because we assign action and action to state in an alternating fashion. So, run does not converge because there are infinitely many non-terminal occurrences of branching configurations containing state .

Although run satisfies all requirements of a valid run, is not fair. For example, although the configuration occurs infinitely often (due to repeating trial ), this configuration is never extended with the valid option that could propagate revisits of state to revisits of state in the same trial; revisits to state could force state to use the other action , which in turn could aid state in becoming stable.

In conclusion, because task is reducible and yet the valid (but unfair) run does not converge, we see that Theorem 4.1 does not hold in absence of fairness.

### 4.2 Detecting the Final Policy

We refer to the lastly formed policy of a run as the final policy. For an increased understanding of what convergence means, it appears interesting to say something about the form of the final policy. In particular, we would like to understand what kind of paths are generated by the final policy. As an additional benefit, recognizing the form of the final policy allows us to detect the convergence point in a simulation.999Precise convergence detection is possible because our framework does not model reward numerically and thus there are no numerical instability issues near convergence. The convergence detection enables some of the simulation experiments in Section 5.

We syntactically characterize the final policy in Theorem 4.5. In general, verifying the syntactical property of the final policy requires access to the entire set of task states. In this subsection, we do not require that tasks are reducible.

We first introduce the two key parts of the syntactical characterization, namely, the so-called forward and backward sets of states induced by a policy. As we will see below, the syntactical property says that the forward set should be contained in the backward set.

##### Forward and Backward

Let be a task that is learnable under fairness. To make the notations below easier to read, we omit the symbol from them. It will always be clear from the context which task is meant.

Let be a policy, i.e., each is assigned an action from . First, we define

 ground(π)={q∈goals(T)∣(q,π(q))∈rewards};

this is the set of all goal states that are assigned a rewarding action by the policy. Next, we define two sets and , as follows. For the set , we consider the infinite sequence , , … of sets, where and for each ,

 Fi=Fi−1∪⋃q∈Fi−1∖ground(π)δ(q,π(q)).

We define . Note that . Intuitively, the set contains all states that are reachable from the start states by following the policy. In the definition of with , we remove from the extending states because we only want to add states to that can occur at non-terminal positions of trials.101010Possibly, some states directly reachable from are still in because those states are also reachable from states outside .

For the set , we consider the infinite sequence , , … of sets, where and for each ,

 Bi=Bi−1∪{q∈Q∣δ(q,π(q))⊆Bi−1}.

We define . Note that . Intuitively, is the set of all states that are reduced to the goal states in by the policy.

For completeness, we remark that the infinite sequences , , …, and , , …, each have a fixpoint because is finite.

##### Final Policy

We formalize the final policy. Let be a task that is learnable under fairness. Let be a fair run on , which implies that converges. We define the convergence-trial of as the smallest trial index for which the following holds: trial terminates and after trial there are no more branching configurations at non-terminal positions.111111With this definition of convergence-trial, a run converges if and only if the run contains a convergence-trial. This implies that after trial the policy can not change anymore, because to change the action assigned to a state , the state would have to occur again in branching configuration at a non-terminal position. We define the final policy of to be the policy at the end of the convergence-trial. In principle, different converging runs can have different final policies.

Now, we can recognize the final policy with the following property, that intuitively says that any states reachable by the policy are also safely reduced by the policy to reward:

###### Theorem 4.5.

Let be a task that is learnable under fairness.121212In contrast to Theorem 4.1, we do not require that is reducible. Let be a converging fair run of . A policy occurring in run at the end of a trial is the final policy of if and only if

 forward(π)⊆backward(π).
###### Proof.

We show in two separate parts that is (i) a sufficient and (ii) a necessary condition for to be the final policy of run .

Part 1: Sufficient condition. Let be a policy occurring in run at the end of a trial. Assume that . We show that is the final policy of .

Concretely, we show that any trial starting with policy will (i) use in all its configurations, including the terminal configuration; and, (ii), does not contain branching configurations at non-terminal positions. This implies that the first trial ending with is the convergence-trial, so is the final policy.

Let be a trial in that begins with policy . We explicitly denote trial as the following finite chain of transitions:

 c1a1,q2−−−→…an−1,qn−−−−−→cn.

For each , we denote . Let , , … be the infinite sequence of sets previously defined for . We show by induction on that

1. ;

2. ;

3. is non-branching.

At the end of the induction, we can also see that : first, we have because configuration is non-branching by property (c);131313Because configuration is non-branching, we have , which, combined with for each , gives . second, by property (a).

Base case. Let . For property (a), we have because the trial starts with policy . For property (b), we see that . For property (c), we know that is non-branching because the first configuration in a trial still has an empty working memory of visited states.

Inductive step. Let , with . Assume that the induction properties are satisfied for the configurations , …, . We now show that the properties are also satisfied for .

Property (a)

By applying the induction hypothesis for property (c) to , namely that is non-branching, we know . By subsequently applying the induction hypothesis for property (a) to , namely , we know .

Property (b)

To start, we note that because is non-branching by the induction hypothesis for property (c). By subsequently applying the induction hypothesis for property (a) to , namely , we know . Moreover, since , the transition , where , is a non-reward transition. Hence, and thus .

Lastly, by applying the induction hypothesis for property (b) to , we overall obtain that . Combined with , we see that .

Property (c)

Towards a contradiction, suppose that configuration is branching. This means that state is revisited in .141414Recall that, by definition of branching configuration, we have . Let . Note that , which implies . By applying the induction hypothesis for property (b) to configurations , …, , we know that . We now show that , which would imply ; this is the desired contradiction.

Let , , …be the infinite sequence of sets defined for above. We show by induction on that , which then overall implies .

• Base case: . By definition, . Let . Let be the smallest index for which , i.e., configuration represents the first occurrence of in the trial. By applying the outer induction hypothesis for properties (a) and (c) to , we know that . But since , we know that transition is not a reward transition, implying . Hence, , and overall .

• Inductive step. Let . Assume . Towards a contradiction, suppose . Take some . If then we would immediately have a contradiction with the induction hypothesis. Henceforth, suppose , which, by definition of , means . We will now show that , which would give ; this is the desired contradiction.

Since , there is some smallest such that . Using a similar reasoning as in the base case (), by applying the outer induction hypothesis for properties (a) and (c) to configuration , we can see that . This implies . As a last step, we show that , which gives . We distinguish between the following cases:

• If then , and surely by definition of .

• If then we know because configuration revisits state (see above).

Part 2: Necessary condition. Let be the final policy of . We show that . By definition of final policy, is the policy at the end of the convergence-trial, whose trial index we denote as . By definition of convergence-trial, after trial there are no more branching configurations at non-terminal positions. Note in particular that the policy no longer changes after trial .

Towards a contradiction, suppose that . Let . Note that . We show that there is a state that occurs at least once in a branching configuration at a non-terminal position after the convergence-trial ; this would be the desired contradiction.

We provide an outline of the rest of the proof. The reasoning proceeds in two steps. First, we show for each that . This means that if we are inside set , we have the option to stay longer inside if we follow the policy . Now, the second step of the reasoning is to show that we can stay arbitrarily long inside even after the convergence-trial , causing at least one state of to occur in a branching configuration at a non-terminal position after trial .

Step 1. Let . We show . Towards a contradiction, suppose that . Our strategy is to show that , which, by definition of , implies that there is some index such that . Therefore . But that is false because ; this is the desired contradiction.

We are left to show that . First, we show