Reinforcement learning (RL) is a paradigm in which an agent learns to interact with an environment in order to maximize reward RLtextbook . Many interesting RL problems concern ‘continuing’ tasks — the agent lives and learns over a single lifetime, rather than experiencing a sequence of distinct ‘episodes’. Examples of domains for continuing tasks include routing internet packets, managing the inventory in a warehouse, and controlling the temperature in a data center.
In continuing tasks, we often resort to temporally “discounting” rewards — immediate rewards are valued more highly than rewards farther in the future. One reason is that the sum of future rewards (which is finite for episodic tasks) can grow unboundedly if the agent-environment interaction continues indefinitely. With geometric discounting (for example) the sum of future rewards remains bounded even when the length of the interaction does not.111The notion of discounting discussed here is one that leads to a finite sum of future rewards. This occurs with geometric discounting, for instance, but not hyperbolic discounting Fedus19_HyperbolicDisc , which is applicable only in finite-horizon episodic problems.
RL is often phrased as an optimization problem: find a policy that maximizes reward. A standard optimization problem is defined by a set of feasible solutions and an objective function that describes the quality of each feasible solution with a real number. The feasible solutions in RL are policies and, at least for episodic tasks, the natural objective function is the sum of rewards over an episode. For continuing tasks, we argue that discounted reinforcement learning is not an optimization problem — it does not maximize any objective function.
A policy is optimal in the discounted formulation if it achieves a higher discounted sum of future rewards in each state than any other policy. Mathematically, an optimal policy for a fixed discount rate satisfies
Rather than an objective function, these inequalities produce a partial order on the set of policies. Importantly, some pairs of policies may be incomparable with each other, because each achieves a higher value in some states but a lower value in others. We will argue that the partial order fails to identify an optimal policy under function approximation.
There is no optimal representable policy when using function approximation with discounting
In many RL problems the set of all policies is too large to represent. For example, in domains with extremely large state spaces, it is common to resort to a compact state representation that cannot uniquely represent every possible state. Consequently, representable policies must behave identically in states with the same representation. We are willing to accept that the optimal policy given by equation (1) is not representable and the agent cannot hope to learn it; we would settle for the “best” representable policy.
However, even this goal is meaningless — there may no longer be any representable policy that is unambiguously “better” than all the other representable policies. In general, every representable policy will have higher value in some states and lower value in others, so the partial order will not be able to identify any policy as “optimal” — the notion of “best representable policy” is not well-defined (see Singh94_POMDP_AvgRew for examples). This is not an issue for episodic tasks because the sum of rewards is a natural objective function that can be used to compare any two policies.
An explicit objective function provides not just a partial order but a total order, capable of comparing any two policies. It is a simple matter (at least in principle) to restrict the feasible solutions to the set of representable policies but leave the objective function unchanged — there will be an optimal representable policy.
We are forced to conclude that discounting is fundamentally incompatible with function approximation for control in continuing tasks. We now argue that function approximation and continuing tasks are indispensable for reinforcement learning, and therefore discounting should be abandoned.
Continuing control with function approximation is the problem setting that matters
Reinforcement learning with tabular representations has served us well to build intuition, construct algorithms, and analyze their properties, but the world we live in is too large to be represented by tables of values or action probabilities. Furthermore, it is crucial togeneralize across similar states and actions. Function approximation serves both these purposes and is necessary for agents to compactly represent complex worlds and make sense of their complexity.
Continuing tasks are those in which the agent-environment interaction does not naturally break down into episodes. Some of the most interesting RL tasks being studied today involve extremely long episodes and the challenges that come with them (including sparse rewards and credit assignment over long time horizons). Such tasks are episodic in name only, and meeting the challenges they present will require us to deeply understand continuing tasks.
Function approximation is necessary in large-scale reinforcement learning. Since the discounting formulation is fundamentally incompatible with function approximation, it is also incompatible with large-scale reinforcement learning. We believe that whole-heartedly adopting other optimization approaches for continuing task formulations (for example, average reward) is the most appropriate direction for our discipline.
After covering some background in Section 2, we describe an alternative continuing task formulation in Section 3 which has a well-defined objective — maximizing the average reward per step — making it suitable for use with function approximation. We discuss its equivalence with the time-average of the discounted value function, and the resulting misconception that common discounted algorithms maximize the average reward irrespective of the choice of the discount factor. We summarize the arguments in Section 4 and give pointers to the existing literature involving the average reward formulation.
A Markov decision process (MDP) consists of a finite set of statesand a finite set of actions that can be performed in those states. When the agent takes an action in a state, the agent transitions to another state and receives a scalar reinforcement signal called the reward. Formally, an MDP can be represented by a 4-tuple , where upon performing action in state the agent receives reward in expectation and transitions to state with probability . Note that the transition probabilities and rewards are independent of the agent’s history before reaching state (the Markov assumption).
gives, for every state in an MDP, a probability distribution over actions to be taken in that state. A policyacting in an MDP produces a Markov reward process , where is the transition probability and is the reward function:
The average reward of a policy is an intuitively straightforward quantity — it is the reward the policy receives on average every time step as it acts in an environment. We define the average reward more formally as follows. Under certain mild conditions, a Markov reward process has a so-called stationary distribution over states, , that satisfies
In other words, if the agent’s state is distributed according to the stationary distribution and it acts according to its policy, then its next state will also follow the stationary distribution. Furthermore, under additional mild conditions, the state distribution converges to over the long run regardless of the starting state :
In continuing tasks, measures the fraction of time the agent spends in state . Frequently visited states have higher values of , and if then is called a transient state: it is only visited a finite number of times and never again.
The average reward of a policy is simply defined as the average one-step reward, weighted by the stationary distribution:
For the purposes of this paper, we consider policies that only depend on states. For more details on MDPs and average reward, refer to Puterman94_MDPbook .
3 An Optimization Objective for Continuing Tasks
We would like to find an objective for continuing tasks that is analogous to the sum of rewards objective in episodic tasks. Such an objective function could be used to evaluate and compare any policy with any other, so that our goal of finding the best policy amongst any class of policies becomes well-defined.
The founding principle of reinforcement learning is that an agent maximizes the rewards it receives over its lifetime (which is infinite in continuing problems). We could still contemplate evaluating policies using their discounted value from the start state, but this gives undue importance to the early part of an agent’s lifetime, going against the essence of a continuing task. Indeed, the start state may never be revisited in a continuing task, so it does not deserve special treatment.
The only factors that can influence how the reward is valued are the state , the action , and the time at which and . If we want to weight rewards independently of when they are received, we must weight them based on where they are received. In other words, the objective function must have the form , with a weighting that weights each reward based on the state and action but not the time .
For policy optimization to be meaningful, must depend on the policy . Setting (say) has the effect of weighting each state-action pair by how frequently it is visited, which brings us to the optimization problem
where is the set of representable policies. This objective function captures a core tenet of reinforcement learning — that the agent’s actions should cause it to visit states where it can earn large rewards. In continuing tasks, the agent should find policies that cause it to frequently visit these highly rewarding states.
The average reward objective in terms of discounted values
Interestingly, the average reward has several equivalent formulations. While we defined it as an average of the one-step reward weighted by the stationary distribution, it is equivalent to a weighted average of the discounted values:
Intuitively, if the initial state is sampled from , then all future states will have the same distribution and the expected reward on every future time step will be . Effectively, the stationary distribution allows any weighted average over time (in this case, the normalized value function ) to be replaced with a weighted average over states Singh94_POMDP_AvgRew ; RLtextbook .222Since the stationary distribution allows time-averages to be converted to state-averages, the average reward objective does not imply that agents have to average infinitely many rewards. Indeed, the same equivalence holds for any state-independent temporal discounting scheme (not just geometric) as long the discount factors sum to a constant; the factor is replaced by the reciprocal of that constant. (This equivalence does not hold for hyperbolic discounting Fedus19_HyperbolicDisc , for example, because the sum of discount factors diverges.)
Theoretically, equation (3) implies that the policy that maximizes the average of future discounted values (with any discount factor) also maximizes average reward. The discount rate thus becomes a hyper-parameter of the algorithm, rather than a parameter specifying the optimization objective RLtextbook .
Greedily maximizing discounted future value does not maximize average reward
At this point, it would be easy to conclude that any algorithm that maximizes discounted value must also maximize average reward regardless of the discount factor
, which seems absurd on the face of it. We would be right to be skeptical — common algorithms like Q-learning or Sarsa estimate an action-value functionand behave (close to) greedily with respect to it. Such a local greedification will not in general correspond to maximizing the average discounted value over time.
As an example, consider the two-choice MDP in Figure 1. State is the only state with meaningful actions: left and right. Going left leads to an immediate reward of , and going right leads to a delayed reward of . For small discount factors, the immediate reward from going left appears much more appealing as compared to the discounted delayed reward. When the discount factor is increased sufficiently, the right action becomes more appealing instead. For algorithms based on the local greedification operator, the discount factor is not just a hyper-parameter of the algorithm, it is still a part of the problem setting.
Increasing is not enough
Inspired by the previous example, one might consider using a discounted algorithm but taking the limit as .
Consider the optimization . This objective does indeed avoid the issues that come with a finite (discounted) time horizon. Furthermore, it is bounded, being the weighted average of bounded rewards. Finally, it does not depend on the initial state . Unsurprisingly, it turns out that this quantity is also equivalent to .
In practice, unfortunately, it is hard to evaluate an objective function containing a limit. Instead, algorithms find optimal policies for increasing discount factors — they exchange the maximization with the limit — for some start state .
In theory, the limiting policy does indeed exist and is called the Blackwell-optimal policy . For every MDP there is a critical discount rate , and maximizes for every and state . Intuitively, in any given MDP, making a decision that is optimal over a sufficiently large horizon is equivalent to optimizing average reward. Crucially, this horizon depends on the MDP and how long it takes for actions to have consequences (in terms of reward), which is not known beforehand. The critical discount rate is related to the maximum mixing time of the MDP. Puterman94_MDPbook
There are two major obstacles to exploiting this theory to convert discounted-value algorithms into average-value algorithms for real-world RL problems. First, most algorithms that learn discounted value functions become increasingly unstable as increases to . Second, it is hard to estimate the critical discount factor, because it is intimately tied to the unknown dynamics of the environment. Thus, to ensure we find an optimal policy we cannot settle for any fixed , forcing us into the regime where our algorithms become unusable. Algorithms that optimize the average reward objective do not rely on a discount factor and automatically make decisions at the appropriate time horizon.
Maximizing average reward is algorithmically feasible
The average reward formulation has been long studied, and there are several dynamic programming algorithms for finding optimal average reward policies; see Puterman Puterman94_MDPbook or Bertsekas Bertsekas05_DPbook . Mahadevan Mahadevan96_ARsurvey has reviewed some variants of the DP-based RL algorithms maximizing the average reward. Of note is RVI Q-learning AbounadiBB01_RVI-Q , a simple Q-learning-like algorithm with convergence guarantees. Furthermore, policy gradient methods are especially promising, since the gradient of the average reward objective has an elegant closed-form expression sutton2000policy . Konda and Borkar Konda99_ARPG have proposed an actor-critic method for average reward.
Discounted reinforcement learning is not an optimization problem. In continuing tasks, it does not correspond to the maximization of any objective function over a set of policies.
Function approximation is incompatible with discounting. Not being an optimization problem, the “best” representable policy is not well-defined.
Average reward RL is an optimization problem even with function approximation. Several relatively simple algorithms exist with convergence guarantees.
Greedily maximizing discounted future value does not maximize average reward. Common algorithms like Sarsa or Q-learning do not optimize the average reward, and they find different policies depending on the discount factor.
Increasing is not enough. The discount factor must be increased beyond a critical point which is problem-specific and cannot be determined a priori. Moreover, most algorithms become increasingly unstable as approaches unity.
There are many open problems in average reward RL. Further advances are needed in areas such as multi-step learning, off-policy learning, model-based learning, and options. Many of the recent advances in optimization can be applied to improve algorithms for average reward.
This work was partially done at Abhishek Naik’s internship at Huawei Technologies in Edmonton. Roshan Shariff was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and Alberta Innovates. The authors also wish to thank Muhammad Zaheer for valuable feedback on a preliminary draft of the work.
- (1) Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018.
- (2) William Fedus, Carles Gelada, Yoshua Bengio, Marc G. Bellemare, and Hugo Larochelle. Hyperbolic discounting and learning over multiple horizons. CoRR, abs/1902.06865, 2019.
- (3) Satinder P. Singh, Tommi S. Jaakkola, and Michael I. Jordan. Learning without state-estimation in partially observable Markovian decision processes. In ICML, 1994.
- (4) Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994.
- (5) Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, volume II. Athena Scientific, 2005.
- (6) Sridhar Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22(1-3):159–195, 1996.
- (7) Jinane Abounadi, Dimitri P. Bertsekas, and Vivek S. Borkar. Learning algorithms for Markov decision processes with average cost. SIAM J. Control and Optimization, 40(3):681–698, 2001.
- (8) Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NeurIPS, 2000.
- (9) Vijay R. Konda and John N. Tsitsiklis. Actor-critic algorithms. In NeurIPS, 1999.