Consider a simple game, some solitaire variation, for example, or a board game against a fixed opponent. Assume that neither draws nor unending matches are possible in this game and that the only payouts are on winning and on losing. If each new game position (state) depends stochastically only on the preceding one and the move (action) made, but not on the history of positions and moves before that, then the game can be modelled as a Markov decision process. Further, since all matches terminate, the process is episodic.
Learning the game, or solving the decision process, is equivalent to finding the best playing strategy (policy
), that is, determining what moves to make on each position in order to maximize the probability of winning / expected payout. This is the type of problem commonly solved by cumulative-reward reinforcement learning methods.
Now, assume that, given the nature of this game, as is often the case, the winning probability is optimized by some cautious policy, whose gameplay favours avoiding risks and hence results in relatively long matches. For example, assume that for this game a policy is known to win with certainty in 100 moves. On the other hand, typically some policies trade performance (winning probability) for speed (shorter episode lengths). Assume another known policy, obviously sub-optimal in the sense of expected payout, has a winning probability of just 0.6, but it takes only 10 moves in average to terminate.
If one is going to play a single episode, doubtlessly the first strategy is the best available, since following it winning is guaranteed. However, over a sequence of games, the second policy may outperform the ‘optimal’ one in a very important sense: if each move costs the same (for instance, if all take the same amount of time to complete), whereas the policy that always wins receives in average $0.01/move, the other strategy earns twice as much. Thus, over an hour, or a day, or a lifetime of playing, the ostensibly sub-optimal game strategy will double the earnings of the apparent optimizer. This is a consequence of the fact that the second policy has a higher average reward, receiving a larger payout per action taken. Finding policies that are optimal in this sense is the problem solved by average-reward reinforcement learning.
In a more general case, if each move has associated a different cost, such as the time it would take to move a token on a board the number of steps dictated by a die, then the problem would be average-reward semi-Markov, and the goal would change to finding a policy, possibly different from either of the two discussed above, that maximizes the expected amount of payout received per unit of action cost.
Average-reward and semi-Markov tasks arise naturally in the areas of repeated episodic tasks, as in the example just discussed, queuing theory, autonomous robotics, and quality of service in communications, among many others.
This paper presents a new algorithm to solve average-reward and semi-Markov decision processes. The traditional solutions to this kind of problems require a large number of samples, where a sample is usually the observation of the effect of taking an action from a state: the cost of the action, the reward received and the resulting next state. For each sample, the algorithms basically update the gain (average reward) of the task and the gain-adjusted value of, that is, what a good idea is it, taking that action from that state.
Some of the methods in the literature that follow this solution template are R-learning (Schwartz, 1993), the Algorithms 3 and 4 by Singh (1994), SMART (Das et al., 1999), and the “New Algorithm” by Gosavi (2004). Table 1 in Section 3.2 introduces a more comprehensive taxonomy of solution methods for average-reward and semi-Markov problems.
The method introduced in this paper, optimal nudging, operates differently. Instead of rushing to update the gain after each sample, it is temporarily fixed to some value, resulting in a cumulative-reward task that is solved (by any method), and then, based on the solution found, the gain is updated in a way that minimizes the uncertainty range known to contain its optimum.
The main contribution of this paper is the introduction of a novel algorithm to solve semi-Markov (and simpler average-reward) decision processes by reducing them to a minimal sequence of cumulative-reward tasks, that can be solved by any of the fast, robust existing methods for that kind of problems. Hence, we refer to the method used to solve this tasks as a ‘black box’.
The central step of the optimal nudging algorithm is a rule for updating the gain between calls to the ‘black-box’ solver, in such a way that after solving the resulting cumulative-reward task, the worst case for the associated uncertainty around the value of the optimal gain will be the smallest possible.
The update rule exploits what we have called a “Bertsekas split” of each task as well as the geometry of the space, a mapping of the policy space into the interior of a small triangle, in which the convergence of the solutions of the cumulative-reward tasks to the optimal solution of the average-reward problem can be easily intuited and visualized.
In addition to this, the derivation of the optimal nudging update rule yields an early termination condition, related to sign changes in the value of a reference state between successive iterations for which the same policy is optimal. This condition is unique to optimal nudging, and no equivalent is possible for the preceding algorithms in the literature.
The complexity of optimal nudging, understood as the number of calls to the “black-box” routine, is shown to be at worst logarithmic on the (inverse) desired final uncertainty and on an upper bound on the optimal gain. The number of samples required in each call is in principle inherited from the “black box”, but also depends strongly on whether, for example, transfer learning is used and state values are not reset between iterations.
Among other advantages of the proposed algorithm over other methods discussed, two key ones are requiring adjustment of less parameters, and having to perform less updates per sample.
Finally, the experimental results presented show that the performance of optimal nudging, even without fine tuning, is similar or better to that of the best usual algorithms. The experiments also illuminate some particular features of the algorithm, particularly the great advantage of having the early termination condition.
The rest of the paper is structured as follows. Section 2 formalizes the problem, defining the different types of Markov decision processes (cumulative- and average-reward and semi-Markov) under an unified notation, introducing the important unichain condition and describing why it is important to assume that it holds.
Section 3 presents a summary of solution methods to the three types of processes, emphasizing the distinctions between dynamic programming and model-based and model-free reinforcement learning algorithms. This section also introduces a new taxonomy of the average-reward algorithms from the literature that allows us to propose a generic algorithm that encompasses all of them. Special attention is given in this Section to the family of stochastic shortest path methods, from which the concept of the Bertsekas split is extracted. Finally, a motivating example task is introduced to compare the performance of some of the traditional algorithms and optimal nudging.
In Section 4, the core derivation of the optimal nudging algorithm is presented, starting from the idea of nudging and the definition of the space and enclosing triangles. The early termination condition by zero crossing is presented as an special case of reduction of enclosing triangles, and the exploration of optimal reduction leads to the main Theorem and final proposition of the algorithm.
Section 5 describes the complexity of the algorithm by showing that it outperforms a simpler version of nudging for which the computation of complexity is straightforward.
Finally, Section 6 presents results for a number of experimental set-ups and in Section 7 some conclusions and lines for future work are discussed.
2 Problem Definition
In this section, Markov decision processes are described and three different reward maximization problems are introduced: expected cumulative reinforcement, average-reward, and average reward semi-Markov models. Average-reward problems are a subset of semi-Markov average reward problems. This paper introduces a method to solve both kinds of average-reward problems as a minimal sequence of cumulative-reward, episodic processes. Two relevant common assumptions in average reward models, that the unichain condition holds(Ross, 1970) and a recurrent state exists (Bertsekas, 1998; Abounadi et al., 2002), are described and discussed at the end of the section.
In all cases, an agent in an environment observes its current state and can take actions that, following some static distribution, lead it to a new state and result in a real-valued reinforcement/reward. It is assumed that the Markov property holds, so the next state depends only on the current state and action taken, but not on the history of previous states and actions.
2.1 Markov Decision Processes
An infinite-horizon Markov decision process (MDP, Sutton and Barto, 1998; Puterman, 1994) is defined minimally as a four-tuple . is the set of states in the environment. is the set of actions, with equal to the subset of actions available to take from state and . We assume that both and are finite. The stationary function defines the transition probabilities of the system. After taking action from state , the resulting state is with probability . Likewise, denotes the real-valued reward observed after taking action and transitioning from state to . For notational simplicity, we define
. At decision epoch, the agent is in state , takes action , transitions to state and receives reinforcement , which has expectation .
If the task is episodic, there must be a terminating state, defined as transitioning to itself with probability 1 and reward 0. Without loss of generality, multiple terminating states can be treated as a single one.
An element of the policy space is a rule or strategy that dictates for each state which action to take, . We are only concerned with deterministic policies, in which each state has associated a single action, to take with probability one. This is not too restrictive, since Puterman (1994) has shown that, if an optimal policy exists, an optimal deterministic policy exists as well. Moreover, policies are assumed herein to be stationary. The value of a policy from a given state, , is the expected cumulative reward observed starting from and following ,
where is a discount factor with corresponding to no discount.
The goal is to find a policy that maximizes the expected reward. Thus, an optimal policy has maximum value for each state; that is,
The discount factor ensures convergence of the infinite sum in the policy values Equation (1), so it is used to make value bounded if rewards are bounded, even in problems and for policies for which episodes have infinite duration. Ostensibly, introducing it makes rewards received sooner more desirable than those received later, which would make it useful when the goal is to optimize a measure of immediate (or average) reward. However, for the purposes of this paper, the relevant policies for the discussed resulting MDPs will be assumed to terminate eventually with probability one from all states, so the infinite sum will converge even without discount. Furthermore, the perceived advantages of discount are less sturdy than initially apparent (Mahadevan, 1994), and discounting is not guaranteed to lead to gain optimality (Uribe et al., 2011). Thus, no discount will be used in this paper ().
2.2 Average Reward MDPs
The aim of the average reward model in infinite-horizon MDPs is to maximize the reward received per step (Puterman, 1994; Mahadevan, 1996). Without discount all non-zero-valued policies would have signed infinite value, so the goal must change to obtaining the largest positive or the smallest negative rewards as frequently as possible. In this case, the gain of a policy from a state is defined as the average reward received per action taken following that policy from the state,
A gain-optimal policy, , has maximum average reward, , for all states,
A finer typology of optimal policies in average-reward problems discriminates bias-optimal policies which, besides being gain-optimal, also maximize the transient reward received before the observed average approaches . For a discussion of the differences, see the book by Puterman (1994). This paper will focus on the problem of finding gain-optimal policies.
2.3 Discrete Time Semi-MDPs
In the average-reward model all state transitions weigh equally. Equivalently, all actions from all states are considered as having the same—unity—duration or cost. In semi-Markov decision processes (SMDPs, Ross, 1970) the goal remains maximizing the average reward received per action, but all actions are not required to have the same weight.
2.3.1 Transition Times
The usual description of SMDPs assumes that, after taking an action, the time to transition to a new state is not constant (Feinberg, 1994; Das et al., 1999; Baykal-Gürsoy and Gürsoy, 2007; Ghavamzadeh and Mahadevan, 2007). Formally, at decision epoch the agent is in state and only after an average of seconds of having taken action it evolves to state and observes reward .
The transition time function, then, is (where is the set of positive real numbers). Also, since reward can possibly lump between decision epochs, its expectation, , is marginalized over expected transition times as well as new states. Consequently, the gain of a policy from a state becomes
2.3.2 Action Costs
We propose an alternative interpretation of the SMDP framework, in which taking all actions can yield constant-length transitions, while consuming varying amounts of some resources (for example time, but also energy, money or any combination thereof). This results in the agent observing a real-valued action cost , which is not necessarily related to the reward , received after taking action from state . As above, the assumption is that cost depends on the initial and final states and the action taken and has an expectation of the form . In general, all costs are supposed to be positive, but for the purposes of this paper this is relaxed to requiring that all policies have positive expected cost from all states. Likewise, without loss of generality it will be assumed that all action costs either are zero or have expected magnitude greater than or equal to one,
In this model, a policy has expected cost
Observe that both definitions are analytically equivalent. That is, and have the same role in the gain. Although their definition and interpretations varies—expected time to transition versus expected action cost—both give origin to identical problems with gain .
Naturally, if all action costs or transition times are equal, the semi-Markov model reduces to average rewards, up to scale, and both problems are identical if the costs/times are unity-valued. For notational simplicity, from now on we will refer to the gain in both problems simply as .
2.3.3 Optimal policies
A policy with gain
|is gain-optimal if|
similarly to the way it was defined for the average-reward problem.
Observe that the gain-optimal policy does not necessarily maximize , nor does it minimize . It only optimizes their ratio.
The following two sections discuss two technical assumptions that are commonly used in the average-reward and semi-Markov decision process literature to simplify analysis, guaranteeing that optimal stationary policies exist.
2.4 The Unichain Assumption
The transition probabilities of a fixed deterministic policy. In that embedded chain, a state is called transient if, after a visit, there is a non-zero probability of never returning to it. A state is recurrent if it is not transient. A recurrent state will be visited in finite time with probability one. A recurrent class is a set of recurrent states such that no outside states can be reached by states inside the set. (Kemeny and Snell, 1960)
An MDP is called multichain if at least one policy has more than one recurrent class, and if every policy has only one recurrent class. In an unichain problem, for all , the state space can be partitioned as
where is the single recurrent class and is a (possibly empty) transient set. Observe that these partitions can be unique to each policy; the assumption is to have a single recurrent class per policy, not a single one for the whole MDP.
If the MDP is multichain, a single optimality expression may not suffice to describe the gain of the optimal policy, stationary optimal policies may not exist, and theory and algorithms are more complex. On the other hand, if it is unichain, clearly for any given all states will have the same gain, , which simplifies the analysis and is a sufficient condition for the existence of stationary, gain-optimal policies. (Puterman, 1994)
Consequently, most of the literature on average reward MDPs and SMDPs relies on the assumption that the underlying model is unichain (see for example Mahadevan, 1996; Ghavamzadeh and Mahadevan, 2007, and references thereon). Nevertheless, the problem of deciding whether a given MDP is unichain is not trivial. In fact, Kallenberg (2002) posed the problem of whether a polynomial algorithm exists to determine if an MDP is unichain, which was answered negatively by Tsitsiklis (2007), who proved that it is NP-hard.
2.5 Recurrent States
The term is also used, confusingly, to describe a state of the decision process that belongs to a recurrent class of every policy. The expression “recurrent state” will only be used in this sense from now on in this paper. Multi- and unichain processes may or may not have recurrent states. However, Feinberg and Yang (2008) proved that a recurrent state can be found or shown not to exist in polynomial time (on and ), and give methods for doing so. They also proved that, if a recurrent state exists, the unichain condition can be decided in polynomial time to hold, and proposed an algorithm for doing so.
Instead of actually using those methods, which would require a full knowledge of the transition probabilities that we do not assume, we emphasize the central role of the recurrent states, when they exist or can be induced, in simplifying analysis.
Provisionally, for the derivation below we will require all problems to be unichain and to have a recurrent state. However these two requirements will be further qualified in the experimental results.
3 Overview of Solution Methods and Related Work
This section summarizes the most relevant solution methods for cumulative-reward MDPs and average-reward SMDPs, with special emphasis on stochastic shortest path algorithms. Our proposed solution to average-reward SMDPs will use what we call a Bertsekas split from these algorithms, to convert the problem into a minimal sequence of MDPs, each of which can be solved by any of the existing cumulative-reward methods. At the end of the section, a simple experimental task from Sutton and Barto (1998) is presented to examine the performance of the discussed methods and to motivate our subsequent derivation.
3.1 Cumulative Rewards
Cumulative-reward MDPs have been widely studied. The survey of Kaelbling et al. (1996) and the books by Bertsekas and Tsitsiklis (1996), Sutton and Barto (1998), and Szepesvári (2010) include comprehensive reviews of approaches and algorithms to solve MDPs (also called reinforcement learning problems). We will present a brief summary and propose a taxonomy of methods that suit our approach of accessing a “black box” reinforcement learning solver.
In general, instead of trying to find policies that maximize state value from Equation (1) directly, solution methods seek policies that optimize the state-action pair (or simply ‘action’) value function,
which is defined as the expected cumulative reinforcement after taking action in state and following policy thereafter.
The action value of an optimal policy corresponds to the solution of the following, equivalent, versions of the Bellman optimality equation (Sutton and Barto, 1998),
Dynamic programming methods assume complete knowledge of the transitions and rewards and seek to solve Equation (4) directly. An iteration of a type algorithm finds (policy evaluation) or approximates (value iteration) the value of the current policy and subsequently sets as current a policy that is greedy with respect to the values found (generalized policy iteration). Puterman (1994)
provides a very comprehensive summary of dynamic programming methods, including the use of linear programming to solve this kind of problems.
If the transition probabilities are unknown, in order to maximize action values it is necessary to sample actions, state transitions, and rewards in the environment. Model-based methods use these observations to approximate , and then that approximation to find and using dynamic programming. Methods in this family usually rely on complexity bounds guaranteeing performance after a number of samples (or sample complexity, Kakade, 2003) bounded by a polynomial (that is, efficient) on the sizes of the state and action sets, as well as other parameters. 111These other parameters often include the expression , where is the discount factor, which is obviously problematic when, as we assume, . However, polynomial bounds also exist for undiscounted cases.
The earliest and most studied model-based methods are PAC-MDP algorithms (efficiently probably approximately correct on Markov decision processes), which minimize with high probability the number of future steps on which the agent will not receive near-optimal reinforcements. (Kearns and Singh, 1998, 2002), sparse sampling (Kearns et al., 2002), (Brafman and Tennenholtz, 2003), MBIE (Strehl and Littman, 2005), and (Rao and Whiteson, 2012) are notable examples of this family of algorithms. Kakade’s (2003) and Strehl’s (2007) dissertations, and the paper by Strehl et al. (2009) provide extensive theoretical discussions on a broad range of PAC-MDP algorithms.
Another learning framework for which model-based methods exists is KWIK (knows what it knows, Li et al., 2008; Walsh et al., 2010). In it, at any decision epoch, the agent must return an approximation of the transition probability corresponding to the observed state, action and next state. This approximation must be arbitrarily precise with high probability. Alternatively, the agent can acknowledge its ignorance, produce a “” output, and, from the observed, unknown transition, learn. The goal in this framework is to find a bound on the number of outputs, and for this bound to be polynomial on some appropriate parameters, including and .
Model-free methods, on the other hand, use transition and reward observations to approximate action values replacing expectations by samples in Equation (5). The two main algorithms, from which most variations in the literature derive, are SARSA and Q-learning (Sutton and Barto, 1998). Both belong to the class of temporal difference methods. SARSA is an on-policy algorithm that improves approximating the value of the current policy, using the update rule
where is an action selected from for state . Q-learning is an off-policy algorithm that, while following samples obtained acting from the current values of , approximates the value of the optimal policy, updating with the rule
In both cases, is a learning rate.
Q-learning has been widely studied and used for practical applications since its proposal by Watkins (1989). In general, for an appropriately decaying learning rate such that
and under the assumption that all states are visited and all actions taken infinitely often, it is proven to converge asymptotically to the optimal value with probability one (Watkins and Dayan, 1992). Furthermore, in discounted settings, PAC convergence bounds exist for the case in which every state–action pair keeps an independent learning rate of the form (Szepesvári, 1998), and for -updates in the case when a parallel sampler (Kearns and Singh, 1999), which on every call returns transition/reward observations for every state–action pair, is available (Even-Dar and Mansour, 2004; Azar et al., 2011).
An additional PAC-MDP, model-free version of Q-learning of interest is delayed Q-learning (Strehl et al., 2006). Although the main derivation of it is for discounted settings, as is usual for this kind of algorithms, building on the work of Kakade (2003) a variation is briefly discussed in which there is no discount but rather a hard horizon assumption, in which only the next action-choices of the agent contribute to the value function. In this case, the bound is that, with probability , the agent will follow an -optimal policy, that is a policy derived from an approximation of that does not differ from the optimal values more than , on all but
steps, where is a logarithmic function on the appropriate parameters (). Usually, this latter term is dropped and the bound is instead expressed as
Our method of solving SMDPs will assume access to a learning method for finite-horizon, undiscounted MDPs. The requirements of the solution it should provide are discussed in the analysis of the algorithm.
3.2 Average Rewards
As mentioned above, average-reward MDP problems are the subset of average-reward SMDPs for which all transitions take one time unit, or all actions have equal, unity cost. Thus, it would be sufficient to consider the larger set. However, average-reward problems have been the subject of more research, and the resulting algorithms are easily extended to the semi-Markov framework, by multiplying gain by cost in the relevant optimality equations, so both will be presented jointly here.
In this section we introduce a novel taxonomy of the differing parameters in the update rules of the main published solution methods for average-reward–including semi-Markov–tasks. This allows us to propose and discuss a generic algorithm that covers all existing solutions, and yields a compact summary of them, presented in Table 1 below.
which measures “how good” the state is, under , with respect to the average . The corresponding Bellman equation, whose solutions include the gain-optimal policies, is
where the expectation on the right hand side is over following the optimal policy for any .
Puterman (1994) and Mahadevan (1996) present comprehensive discussions of dynamic programming methods to solve average-reward problems. The solution principle is similar to the one used in cumulative reward tasks: value evaluation followed by policy iteration. However, an approximation of the average rewards of the policy being evaluated must be either computed or approximated from successive iterates. The parametric variation of average-reward value iteration due to Bertsekas (1998) is central to our method and will be discussed in depth below. For SMDPs, Das et al. (1999) discuss specific synchronous and asynchronous versions of the relative value iteration algorithm due to White (1963).
Among the model-based methods listed above for cumulative reward problems, and originally have definitions on average reward models, including in their PAC-MDP bounds polynomial terms on a parameter called the optimal -mixing time, defined as the smallest time after which the observed average reward of the optimal policy actually becomes -close to .
In a related framework, also with probability as in PAC-MDP, the UCRL2 algorithm of Jaksch et al. (2010) attempts to minimize the total regret (difference with the accumulated rewards of a gain-optimal policy) over a -step horizon. The regret of this algorithm is bounded by , where the diameter parameter of the MDP is defined as the time it takes to move from any state to any other state using an appropriate policy. Observe that for to be finite, any state must be reachable from any other, so the problem must be communicating, which is a more rigid assumption than the unichain condition (Puterman, 1994). Similarly, the REGAL algorithm of Bartlett and Tewari (2009) has a regret bound , where
is a bound on the span of the optimal bias vector. In this case, the underlying process is required to beweakly communicating, that is, for the subsets and of recurrent and transient states in Equation (2) to be the same for all . This is also a more rigid assumption than the unichain condition.
Regarding PAC-MDP methods, no “model free” algorithms similar to delayed-Q are known at present for average reward problems. Mahadevan (1996) discusses, without complexity analysis, a model-based approach due to Jalali and Ferguson (1989), and further refined by Tadepalli and Ok (1998) into the H-learning algorithm, in which relative value iteration is applied to transition probability matrices and gain is approximated from observed samples.
Model-free methods for average reward problems, with access to observations of state transitions and associated rewards, are based on the (gain-adjusted) Q-value update
where is a learning rate and
is the current estimation of the average reward.
A close analysis of the literature reveals that H-learning and related model-based algorithms, as well as methods based on the update in Equation (10) can be described using the generic Algorithm 1. The “Act” step corresponds to the observation of (usually) one tuple following the current version of . A degree of exploration is commonly introduced at this stage; instead of taking a best-known action from , a suboptimal action is chosen. For instance, in the traditional -greedy action selection method, with probability a random action is chosen uniformly. Sutton and Barto (1998) discuss a number of exploration strategies. The exploration/exploitation trade-off, that is, when and how to explore and learn and when to use the knowledge for reward maximization, is a very active research field in reinforcement learning. All of the PAC-MDP and related algorithms listed above are based on an “optimism in the face of uncertainty” scheme (Lai and Robbins, 1985), initializing or to an upper bound on value for all states, to address more or less explicitly the problem of optimal exploration.
In H-learning, the learning stage of Algorithm 1 includes updating the approximate transition probabilities for the state and action just observed and then estimating the state value using a version of Equation (9) with the updated probabilities. In model-free methods, summarized in Table 1, the learning stage is usually the 1-step update of Equation (10).
The update is also commonly done after one step, but there are a number of different update rules. Algorithms in the literature vary in two dimensions. The first is when to update. Some compute an updated approximation of after every action while others do it only if a best-known action was taken, . The second dimension is the way the updates are done. A natural approach is to compute as the ratio of the sample rewards and the sample costs,
where, may indicate all decision epochs or only those on which greedy actions were taken, depending on when the algorithm updates. We refer to this as the ratio update.
Alternatively, the corrected update is of the form
whereas in the term-wise corrected update, separately
In the last two cases, is a learning rate. In addition to when and how to perform the updates, algorithms in the literature also vary in the model used for the learning rates, and . The simplest models take both parameters to be constant, equal to and for all (or ). As is the case for Q-learning, convergence is proved for sequences of (and now ) for which the conditions in Equations (7) and (8) hold. We call these decaying learning rates. A simple decaying learning rate is of the form . It can be easily shown that this rate gives raise to the ratio updates of Equation (11). Some methods require keeping an individual (decaying) learning rate for each state-action pair. A type of update for which Equations (7) and (8)—and the associated convergence guarantees—hold, and which may have practical advantages is the “search-then-converge” procedure of Darken et al. (1992), called DCM after its authors. A DCM update would be, for example,
where and are constants.
if greedy ?
Das et al. (1999)
Jalali and Ferguson (1989)
Tadepalli and Ok (1998)
Ghavamzadeh and Mahadevan (2001)
Ghavamzadeh and Mahadevan (2007)
Table 1 describes the updates and learning rates of the model-free average reward algorithms found in the literature, together, when applicable, with those for two model-based (AAC and H-learning) and two hierarchical algorithms (MAXQ and HAR).
3.3 Stochastic Shortest Path H and Q-Learning
We focus our interest on an additional model-free average-reward algorithm due to Abounadi et al. (2002), suggested by a dynamic programming method by Bertsekas (1998), which connects an average-reward problem with a parametrized family of (cumulative reward) stochastic shortest path problems.
The fundamental observation is that, if the problem is unichain, the average reward of a stationary policy must equal the ratio of the expected total reward and the expected total cost between two visits to a reference (recurrent) state. Thus, the idea is to “separate” those two visits to the start and the end of an episodic task. This is achieved splitting a recurrent state into an initial and a terminal state. Assuming the task is unichain and is a recurrent state, we refer to a Bertsekas split as the resulting problem with
State space , where is an artificial terminal state that, as defined above, once reached transitions to itself with probability one, reward zero, and, for numerical stability, cost zero.
Action space .
In the restricted setting of average-reward MDPs, Bertsekas (1998) proved the convergence of the dynamic programming algorithm with coupled iterations
where is set to zero for all epochs (that is, is terminal) and is a decaying learning rate. The derivation of this algorithm includes a proof that, when equals the optimal gain , the corrected value of the initial state is zero. This is to be expected, since is subtracted from all , and if it equals the expected average reward, the expectation for vanishes. Observe that, when this is the case, stops changing between iterations. We provide below an alternative derivation of this fact, from the perspective of fractional programming.
Abounadi et al. (2002) extended the ideas behind this algorithm to model-free methods with the stochastic shortest path Q-learning algorithm (synchronous), SSPQ, with and updates, after taking an action,
where is the projection to an interval known to contain .
Both SSP methods just described belong to generic family described by Algorithm 1. Moreover, the action value update of SSPQ is identical to the MDP version of average corrected Q-updates, Equation (10).
The convergence proof of SSPQ makes the relationship between the two learning rates explicit, requiring that
making the gain update considerably slower than the value update. This is necessary so the Q-update can provide sufficient approximation of the value of the current policy for there to be any improvement. If, in the short term, the Q-update sees as (nearly) constant, then the update actually resembles that of a cumulative reward problem, with rewards . The method presented below uses a Bertsekas split of the SMDP, and examines the extreme case in which the updates occur only when the value of the best policy for the current gain can be regarded as known.
3.4 A Motivating Example
We will use a simple average-reward example from the discussion of Schwartz’s R-learning by Sutton and Barto (1998, see section 6.7) to study the behaviour of some of the algorithms just described and compare them with the method proposd in this paper.
In the access control queuing task, at the head of a single queue that manages access to servers, customers of priorities arrive with probabilities , respectively. At each decision epoch, the customer at the head of the queue is either assigned to a free server (if any are available), with a pay-off equal to the customer’s priority; or rejected, with zero pay-off. Between decision epochs, servers free independently with probability . Naturally, the goal is to maximize the expected average reward. The states correspond to the combination of the priority of the customer at the head of the queue and the number of free servers, and the actions are simply “accept” and “reject”. For simplicity, there is a single state corresponding to no free servers for any priority, with only the ”reject” action available an reward zero.
To ensure that our assumptions hold for this task, we use the following straightforward observation: In the access control queuing task, all optimal policies must accept customers of priority 8 whenever servers are available.
Thus, if we make “accept” the only action available for states with priority 8 and any number of free servers, the resulting task will have the same optimal policies as the original problem, and the state with all servers occupied will become recurrent. Indeed, for any state with free servers, and any policy, there is a nonzero probability that all of the next customers will have priority 8 and no servers will free in the current and the next decision epochs. Since the only available action for customers with priority 8 is to accept them, all servers would fill, then, so for any state and policy there is a nonzero probability of reaching the state with no free servers, making it recurrent. Moreover, since this recurrent state can be reached from any state, there must be a single recurrent class per policy, containing the all-occupied state and all other states that can be reached from it under the policy, so the unichain condition also holds for the resulting task.
The unique optimal policy for this task, its average-adjusted value and average reward () can be easily found using dynamic programming, and will be used here to measure algorithm performance. Sutton and Barto (1998) show the result of applying R-learning to this task using -greedy action selection with (which we will keep for all experiments), and parameters222The on-line errata of the book corrects to 0.01, instead of the stated 0.1. . We call this set-up “R-Learning 1.” However, instead of a single run of samples, which results in states with many free servers being grossly undersampled, in order to study the convergence of action values over all states we reset the process to an uniformly-selected random state every 10 steps in all experiments.
The other algorithms tested are: “R-learning 2,” with a smaller rate for the updates (, ), “SMART” (Das et al., 1999), with parameters and for the DCM update of , the “New Algorithm” of Gosavi (2004) with individual decaying learning rates equal to the inverse of the number of past occurrences of each state-action pair, and “SSPQ”-learning with rates and . In all cases, and are initialized to 0.
Figure 1 shows the results of one run (to avoid flattening the curves due to averaging) of the different methods for five million steps. All algorithms reach a neighbourhood of relatively quickly, but this doesn’t guarantee an equally fast approximation of the optimal policy or its value function. The value of used in “R-learning 1” and by Sutton and Barto causes a fast approximation followed by oscillation (shown in a separate plot for clarity). The approximations for the smaller and the other approaches are more stable.
Overall, both R-learning set-ups, corresponding to solid black and grey lines in the plots in Figure 1, achieve a policy closer to optimal and a better approximation of its value, considerably faster than the other algorithms. On the other hand, almost immediately SMART, Gosavi’s “New Algorithm”, and SSPQ reach policies that differ from the optimal in about 5-10 states, but remain in that error range, whereas after a longer transient the R-learning variants find policies with less than 5 non-optimal actions and much better value approximations.
Remarkably, “R-learning 2” is the only set-up for which the value approximation and differences with the optimal policy increase at the start, before actually converging faster and closer to the optimal than the other methods. Only for that algorithm set-up is the unique optimal policy visited sometimes. This suggests that, for the slowest updating , a different part of the policy space is being visited during the early stages of training. Moreover, this appears to have a beneficial effect, for this particular task, on both the speed and quality of final convergence.
Optimal nudging, the method introduced below, goes even further in this direction, freezing the value of for whole runs of some reinforcement learning method, and only updating the gain when the value of the current best-known policy is closely approximated.
Figure 2 shows the result of applying a vanilla version of optimal nudging to the access control queuing task, compared with the best results above, corresponding to “R-learning 2”. After a 0-step–required to compute a parameter–which takes 500.000 samples, the optimal nudging steps proper are a series of blocks of 750.000 transition observations for a fixed . The taking of each action is followed by a Q-learning update with the same and values as the R-learning experiments. The edges of plateaus in the black curve to the left of the plot signal the points at which the gain was updated to a new value following the optimal nudging update rule.
The figure shows that both algorithms have similar performance, reaching comparably good approximations to the value of the optimal policy in roughly the same number of iterations. Moreover, optimal nudging finds a similarly good approximation to the optimal policy, differing from it at the end of the run in only one state. However, optimal nudging has a number of further advantages: it has one parameter less to adjust, the learning rate, and consequently it performs one update less per iteration/action taking. This can have a dramatic effect in cases, such as this one, when obtaining sample transitions is quick and the updates cannot be done in parallel with the transitions between states. Additionally, there is no fine tuning in our implementation of the underlying Q-learning method; whereas in “R-learning” was adjusted by Sutton and Barto to yield the best possible results, and we did the same when setting in “R-learning 2”, the implementation of optimal nudging simply inherits the relevant parameters, without any adjustments or guarantees of best performance. Even setting the number of samples to 750.000 between changes of is a parameter that can be improved. The plateaus to the left of the plot suggest that in these early stages of learning good value approximations could be found much faster than that, so possibly an adaptive rule for the termination of each call to the reinforcement learning algorithm might lead to accelerated learning or free later iterations for finer approximation to optimal value.
4 Optimal nudging.
In this section we present our main algorithm, optimal nudging. While belonging to the realm of generic Algorithm 1, the philosophy of optimal nudging is to disentangle the gain updates from value learning, turning an average-reward or semi-Markov task into a sequence of cumulative-reward tasks. This has the dual advantage of letting us treat as a black box the reinforcement learning method used (inheriting its speed, convergence and complexity features), while allowing for a very efficient update scheme for the gain.
Our method favours a more intuitive understanding of the term, not so much as an approximation to the optimal average reward (although it remains so), but rather as a punishment for taking actions, which must be compensated by the rewards obtained afterwards. It exploits the Bertsekas split to focus on the value and cost of successive visits to a reference state, and their ratio. The space is introduced as an arena in which it is possible to update the gain and ensure convergence to the solution. Finally we show that updates can be performed optimally in a way that requires only a small number of calls to the black-box reinforcement learning method.
Summary of Assumptions.
The algorithm derived below requires the average-reward semi-Markov decision process to be solved to have finite state and action sets, to contain at least one recurrent state and to be unichain. Further, it is assumed that the expected cost of every policy is positive and (without loss of generality) larger than one, and that the magnitude of all non-zero action costs is also larger than one.
To avoid a duplication of the cases considered in the mathematical derivation that would add no further insight, it will also be assumed that at least one policy has a positive average reward. If this is the case, naturally, any optimal policy will have positive gain.
4.1 Fractional Programming.
Under our assumptions, a Bertsekas split is possible on the states, actions, and transition probabilities of the task. We will refer to the value of a policy as the expected reward from the induced initial state, , following the policy,
to its cost as the expected cumulative action costs,
and, as customary, to its gain as the value/cost ratio,
The following is a restatement of the known fractional programming result (Charnes and Cooper, 1962), linking in our notation the optimization of the gain ratio to a parametric family of linear problems on policy value and cost.
[Fractional programming] The following average-reward and linear-combination-of-rewards problems share the same optimal policies,
for an appropriate value of such that
The Lemma is proved by contradiction. Under the stated assumptions, for all policies, both and are finite, and . Let a gain-optimal policy be
If and , let
Now, assume there existed some policy with corresponding and that had a better fitness in the linear problem,
It must then follow (since all are positive) that
which would contradict the optimality of .
This result has deep implications. Assume is known. Then, we would be interested in solving the problem
which is equivalent to a single cumulative reward problem with rewards , where, as discussed above, the rewards and costs are functions of .
Naturally, corresponds to the optimal gain and is unknown beforehand. In order to compute it, we propose separating the problem in two parts: finding by reinforcement learning the optimal policy and its value for some fixed gain, and independently doing the gain-update. Thus, value-learning becomes method-free, so any of the robust methods listed in Section 3.1 can be used for this stage. The original problem can be then turned into a sequence of MDPs, for a series of temporarily fixed . Hence, while remaining within the bounds of the generic Algorithm 1, we propose not to hurry to update after every step or Q-update. Additionally, as a consequence of Lemma 4.1, the method comes with the same solid termination condition of SSP algorithms: the current optimal policy is gain-optimal if .
This suggests the nudged version of the learning algorithm, Algorithm 2. The term nudged comes from the understanding of as a measure of the punishment given to the agent after each action in order to promote receiving the largest rewards as soon as possible. The remaining problem is to describe a suitable -update rule.
4.3 The Space.
We will present an variation of the space, originally introduced by Uribe et al. (2011) for the restricted setting of Markov decision process games, as the realm to describe uncertainty and to propose a method to reduce it optimally.
Under the assumptions stated above, the only additional requirement for the definition of the space is a bound on the value of all policies:  Let be a bound on unsigned, unnudged reward, such that
Observe that is a—possibly very loose—bound on . However, it can become tight in the case of a task and gain-optimal policy in which termination from occurs in one step with reward of magnitude . Importantly, under our assumptions and Definition 4.3, all policies will have finite real expected value and finite positive cost .
4.3.1 The Mapping.
We are now ready to propose the simple mapping of a policy , with value and cost , to the 2-dimensional space using the transformation equations:
The following properties of this transformation can be easily checked from our assumptions:
[Properties of the space.] For all policies ,
If , then .
If , then .
4.3.2 Value and Cost in .
Cross-multiplying the equations in (13), it is easy to find an expression for value in ,
All of the policies of the same value (for instance ) lie on a level set that is a line with slope and intercept at the origin. Thus, as stated in Proposition 4.3.1, policies of value lie on the and axes and, further, policies of expected value 0 lie on the line. Furthermore, geometrically, the value-optimal policies must subtend the smallest angle with the axis and vertex at the origin. Figure 4 (left) shows the level sets of Equation (14).
On the other hand, adding both Equations in (13), cost in the space is immediately found as
This function also has line level sets, in this case all of slope . The edge of the triangle corresponds to policies of expected cost one and, as stated in the properties Proposition 4.3.1, policies in the limit of infinite cost should lie on the origin. Figure 4 (right) shows the cost level sets in the space. An interesting question is whether the origin actually belongs to the space. Since it would correspond to policies of infinite expected cost, that would contradict either the unichain assumption or the assumption that is a recurrent state, so the origin is excluded from the space.
The following result derives from the fact that, even though the value and cost expressions in Equations (14) and (15) are not convex functions, both value and cost level sets are lines, each of which divides the triangle in two polygons:
[Value and cost-optimal policies in ] The policies of maximum or minimum value and cost map in the space to vertices of the convex hull of the point cloud. All cases are proved using the same argument by contradiction. Consider for instance a policy of maximum value , with value . The level set line splits the space triangle in two, with all points corresponding to higher value below and to lower value above the level set. If the mapping is an interior point of the convex hull of the policy cloud, then some points on an edge of the convex hull, and consequently at least one of its vertices, must lie on the region of value higher than . Since all vertices of this cloud of points correspond to actual policies of the task, there is at least one policy with higher value than , which contradicts its optimality. The same argument extends to the cases of minimum value, and maximum and minimum cost.
4.3.3 Nudged Value in .
Recall, from the fractional programming Lemma 4.1, that for an appropriate these two problems are optimized by the same policies:
By substituting on the left hand side problem in Equation (16) the expressions for and , the original average-reward semi-Markov problem becomes in the space the simple linear problem
Figure 5 illustrates the slope-one level sets of this problem in the space. Observe that, predictably, the upper bound of these level sets in the triangle corresponds to the vertex at , which, as discussed above, in fact would correspond to a policy with value and unity cost, that is, a policy that would receive the highest possible reinforcement from the recurrent state and would return to it in one step with probability one.
Conversely, for some (not necessarily the optimal ), the problem on the right hand side of Equation (16) becomes, in the space,
where we opt not to drop the constant . We refer to the nudged value of a policy, as,
All policies sharing the same nudged value lie on the level set
which, again, is a line on the space whose slope, further, depends only on the common nudged value, and not on . Thus, for instance, for any , the level set corresponding to policies with zero nudged value, such as those of interest for the termination condition of the fractional programming Lemma 4.1 and Algorithm 2, will have unity slope.
There is a further remarkable property of the line level sets corresponding to policies of the same value, summarized in the following result:
[Intersection of nudged value level sets] For a given nudging , the level sets of all possible share a common intersection point, . This result is proved through straightforward algebra. Consider, for a fixed , the line level sets corresponding to two nudged values, and . Making the right hand sides of their expressions in the form of Equation (18) equal, and solving to find the component of their intersection yields:
Finally, replacing this in the level set for ,
Thus, for a set , all nudged level sets comprise what is called a pencil of lines, a parametric set of lines with a unique common intersection point. Figure 6 (right) shows an instance of such a pencil.
The termination condition states that the nudged and average-reward problems share the same solution space for such that the nudged value of the optimal policy is 0. Figure 6 (left) illustrates this case: the same policy in the example cloud simultaneously solves the linear problem with level sets of slope one (corresponding to the original average-reward task in Equation 17), and the nudged problem with zero-nudged value.
4.4 Minimizing -uncertainty.
We have now all the elements to start deriving an algorithm for iteratively enclosing quickly in the smallest neighborhood possible. Observe that from the outset, since we are assuming that policies with non-negative gain exist, the bounds on the optimal gain are the largest possible,
Geometrically, this is equivalent to the fact that the vertex of the pencil of lines for can be, a priori, anywhere on the segment of the line between , if the optimal policy has zero gain, and , if the optimal policy receives reinforcement and terminates in a unity-cost step (having gain ). For our main result, we will propose a way to reduce the best known bounds on as much as possible every time a new is computed.
4.4.1 Enclosing Triangles, Left and Right -uncertainty.
In this section we introduce the notion of an enclosing triangle. The method introduced below to reduce gain uncertainty exploits the properties of reducing enclosing triangles to optimally reduce uncertainty between iterations.
[Enclosing triangle] A triangle in the space with vertices , , and is an enclosing triangle if it is known to contain the mapping of the gain-optimal policy and, additionally,
; ; .
Figure 7 illustrates the geometry of enclosing triangles as defined. The first two conditions in Definition 4.4.1 ensure that the point is above and the slope of the line that joins them is unity. The third condition places all three points in the part of the space on or below the line, corresponding to policies with non-negative gain.
In the fourth condition, is defined as the -component of the intersection of the line with slope one that joins points and , and the line; and as the value of the intersection of the line with slope one that crosses point , and . Requiring is equivalent to forcing to be below the line that joins and .
The fifth and sixth conditions confine the possible location of to the triangle with vertices , , and the intersection of the lines that cross with slope zero and with slope minus one. This triangle is pictured with thick dashed lines in Figure 7.
[Degenerate enclosing triangles] Observe that, in the definition of an enclosing triangle, some degenerate or indeterminate cases are possible. First, if and are concurrent, then must be concurrent to them—so the “triangle” is in fact to a point—and some terms in the slope conditions (2, 5, 6) in Definition 4.4.1 become indeterminate. Alternately, if in condition 4, then , , and must be collinear. We admit both of these degenerate cases as valid enclosing triangles since, as is discussed below, they correspond to instances in which the solution to the underlying average-reward task has been found.
Since we assume that positive-gain policies, and thus positive-gain optimal policies exist, direct application of the definition of an enclosing triangle leads to the following Proposition:
[Initial enclosing triangle] , ,