The goal of an agent is to maximise its expected utility; but how do we measure utility? One method is to assign an instantaneous reward to particular events, such as having a good meal, or a pleasant walk. It would be natural to measure the utility of a plan (policy) by simply summing the expected instantaneous rewards, but for immortal agents this may lead to infinite utility and also assumes rewards are equally valuable irrespective of the time at which they are received.
One solution, the discounted utility (DU) model introduced by Samuelson in [Sam37], is to take a weighted sum of the rewards with earlier rewards usually valued more than later ones.
There have been a number of criticisms of the DU model, which we will not discuss. For an excellent summary, see [FOO02]. Despite the criticisms, the DU model is widely used in both economics and computer science.
A discount function is time-inconsistent if plans chosen to maximise expected discounted utility change over time. For example, many people express a preference for $110 in 31 days over $100 in 30 days, but reverse that preference 30 days later when given a choice between $110 tomorrow or $100 today [GFM94]. This behavior can be caused by a rational agent with a time-inconsistent discount function.
Unfortunately, time-inconsistent discount functions can lead to extremely bad behavior and so it becomes important to ask what discount functions are time-inconsistent.
Previous work has focussed on a continuous model where agents can take actions at any time in a continuous time-space. We consider a discrete model where agents act in finite time-steps. In general this is not a limitation since any continuous environment can be approximated arbitrarily well by a discrete one. The discrete setting has the advantage of easier analysis, which allows us to consider a very general setup where environments are arbitrary finite or infinite Markov decision processes.
Traditionally, the DU model has assumed a sliding discount function. Formally, a sequence of instantaneous utilities (rewards) starting at time , is given utility equal to where . We generalise this model as in [Hut06] by allowing the discount function to depend on the age of the agent. The new utility is given by . This generalisation is consistent with how some agents tend to behave; for example, humans becoming temporally less myopic as they grow older.
Strotz [Str55] showed that the only time-consistent sliding discount function is geometric discounting. We extend this result to a full characterisation of time-consistent discount functions where the discount function is permitted to change over time. We also show that discounting functions that are “nearly” time-consistent give rise to low regret in the anticipated future changes of the policy over time.
Another important question is what policy should be adopted by an agent that knows it is time-inconsistent. For example, if it knows it will become temporarily myopic in the near future then it may benefit from paying a price to pre-commit to following a particular policy. A number of authors have examined this question in special continuous cases, including [Gol80, PY73, Pol68, Str55]. We modify their results to our general, but discrete, setting using game theory.
The paper is structured as follows. First the required notation is introduced (Section 2). Example discount functions and the consequences of time-inconsistent discount functions are then presented (Section 3). We next state and prove the main theorems, the complete classification of discount functions and the continuity result (Section 4). The game theoretic view of what an agent should do if it knows its discount function is changing is analyzed (Section 5). Finally we offer some discussion and concluding remarks (Section 6).
2 Notation and Problem Setup
The general reinforcement learning (RL) setup involves an agent interacting sequentially with an environment where in each time-stepthe agent chooses some action , whereupon it receives a reward and observation
. The environment can be formally defined as a probability distributionwhere is the probability of receiving reward and observation having taken action after history . For convenience, we assume that for a given history and action , that is fixed (not stochastic). We denote the set of all finite histories and write to be a history of length , to be a history of length . , , and are the th action/reward/observation tuple of history and will be used without explicitly redefining them (there will always be only one history “in context”).
A deterministic environment (where every value of is either 1 or 0) can be represented as a graph with edges for actions, rewards of each action attached to the corresponding edge, and observations in the nodes. For example, the deterministic environment on the right represents an environment where either pizza or pasta must be chosen at each time-step (evening). An action leading to an upper node is eat pizza while the ones leading to a lower node are eat pasta. The rewards are for a consumer who prefers pizza to pasta, but dislikes having the same food twice in a row. The starting node is marked as . This example, along with all those for the remainder of this paper, does not require observations.
The following assumption is required for clean results, but may be relaxed if an of slop is permitted in some results.
We assume that and are finite and that .
Definition 2 (Policy).
A policy is a mapping giving an action for each history.
Given policy and history and then the probability of reaching history when starting from history is which is defined by,
If then we abbreviate and write .
Definition 3 (Expected Rewards).
When applying policy starting from history , the expected sequence of rewards , is defined by
If then .
Note while the set of all possible is uncountable due to the reward term, we sum only over the possible rewards which are determined by the action and previous history, and so this is actually a finite sum.
Definition 4 (Discount Vector).
A discount vector
is a vectorsatisfying for at least one .
The apparently superfluous superscript will be useful later when we allow the discount vector to change with time. We do not insist that the discount vector be summable, .
Definition 5 (Expected Values).
The expected discounted reward (or utility or value) when using policy starting in history and discount vector is
The sum can be taken to start from since for . This means that the value of for is unimportant, and never will be for any result in this paper. As the scalar product is linear, a scaling of a discount vector has no affect on the ordering of the policies. Formally, if then for all .
Definition 6 (Optimal Policy/Value).
In general, our agent will try to choose a policy to maximise . This is defined as follows.
If multiple policies are optimal then is chosen using some arbitrary rule. Unfortunately, need not exist without one further assumption.
For all and , .
Assumption 7 appears somewhat arbitrary. We consider:
For summable the assumption is true for all environments. With the exception of hyperbolic discounting, all frequently used discount vectors are summable.
For non-summable discount vectors the assumption implies a restriction on the possible environments. In particular, they must return asymptotically lower rewards in expectation. This restriction is necessary to guarantee the existence of the value function.
Theorem 8 (Existence of Optimal Policy).
The proof of the existence theorem is in the appendix.
An agent can use a different discount vector for each time . This motivates the following definition.
Definition 9 (Discount Matrix).
A discount matrix is a matrix with discount vector for the th column.
It is important that we distinguish between a discount matrix (written bold), a discount vector (bold and italics), and a particular value in a discount vector (just italics).
Definition 10 (Sliding Discount Matrix).
A discount matrix is sliding if for all .
Definition 11 (Mixed Policy).
The mixed policy is the policy where at each time step , the agent acts according to the possibly different policy .
We do not denote the mixed policy by as it is arguably not optimal as discussed in Section 5. While non-unique optimal policies at least result in equal discounted utilities, this is not the case for . All theorems are proved with respect to any choice .
Definition 12 (Time Consistency).
A discount matrix is time consistent if and only if for all environments , for all where .
This means that a time-consistent agent taking action at each time will not change its plans. On the other hand, a time-inconsistent agent may at time 1 intend to take action should it reach history (). However upon reaching , it need not be true that .
In this section we review a number of common discount matrices and give an example where a time-inconsistent discount matrix causes very bad behavior.
Constant Horizon. Constant horizon discounting is where the agent only cares about the future up to time-steps away, defined by .111 if is true and otherwise. Shortly we will see that the constant horizon discount matrix can lead to very bad behavior in some environments.
Fixed Lifetime. Fixed lifetime discounting is where an agent knows it will not care about any rewards past time-step , defined by
. Unlike the constant horizon method, a fixed lifetime discount matrix is time-consistent. Unfortunately it requires you to know the lifetime of the agent beforehand and also makes asymptotic analysis impossible.
Hyperbolic. . The parameter determines how farsighted the agent is with smaller values leading to more farsighted agents. Hyperbolic discounting is often used in economics with some experimental studies explaining human time-inconsistent behavior by suggesting that we discount hyperbolically [Tha81]. The hyperbolic discount matrix is not summable, so may be replaced by the following (similar to [Hut04]), which has similar properties for close to .
Geometric. with . Geometric discounting is the most commonly used discount matrix. Philosophically it can be justified by assuming an agent will die (and not care about the future after death) with probability at each time-step. Another justification for geometric discount is its analytic simplicity - it is summable and leads to time-consistent policies. It also models fixed interest rates.
No Discounting. . [LH07] and [Leg08] point out that discounting future rewards via an explicit discount matrix is unnecessary since the environment can capture both temporal preferences for early (or late) consumption, as well as the risk associated with delaying consumption. Of course, this “discount matrix” is not summable, but can be made to work by insisting that all environments satisfy Assumption 7. This approach is elegant in the sense that it eliminates the need for a discount matrix, essentially admitting far more complex preferences regarding inter-temporal rewards than a discount matrix allows. On the other hand, a discount matrix gives the “controller” an explicit way to adjust the myopia of the agent.
To illustrate the potential consequences of time-inconsistent discount matrices we consider the policies of several agents acting in the following environment. Let agent A use a constant horizon discount matrix with and agent B a geometric discount matrix with some discount rate .
In the first time-step agent A prefers to move right with the intention of moving up in the second time-step for a reward of . However, once in second time-step, it will change its plan by moving right again. This continues indefinitely, so agent A will always delay moving up and receives zero reward forever.
Agent B acts very differently. Let be the policy in which the agent moves right until time-step , then up and right indefinitely. . This value does not depend on and so the agent will move right until when it will move up and receive a reward.
The actions of agent A are an example of the worst possible behavior arising from time-inconsistent discounting. Nevertheless, agents with a constant horizon discount matrix are used in all kinds of problems. In particular, agents in zero sum games where fixed depth mini-max searches are common. In practise, serious time-inconsistent behavior for game-playing agents seems rare, presumably because most strategic games don’t have a reward structure similar to the example above.
The main theorem of this paper is a complete characterisation of time consistent discount matrices.
Theorem 13 (Characterisation).
Let be a discount matrix, then the following are equivalent.
is time-consistent (Definition 12)
For each there exists an such that for all .
Recall that a discount matrix is sliding if . Theorem 13 can be used to show that if a sliding discount matrix is used as in [Str55] then the only time-consistent discount matrix is geometric. Let be a time-consistent sliding discount matrix. By Theorem 13 and the definition of sliding, . Therefore and and similarly, with , which is geometric discounting. This is the analogue to the results of [Str55] converted to our setting.
The theorem can also be used to construct time-consistent discount rates. Let be a discount vector, then the discount matrix defined by for all will always be time-consistent, for example, the fixed lifetime discount matrix with if for some horizon . Indeed, all time-consistent discount rates can be constructed in this way (up to scaling).
Proof of Theorem 13.
: This direction follows easily from linearity of the scalar product.
as required. The last equality of (2) follows from the assumption that for all and because for all .
: Let and be the discount vectors used at times and respectively. Now let and consider the deterministic environment below where the agent has a choice between earning reward at time or at time . In this environment there are only two policies, and , where and with the infinite vector with all components zero except the th, which is .
Since is time-consistent, for all and we have:
Now if and only if . Therefore we have that,
Letting be the cosine of the angle between and then Equation (5) becomes . Choosing implies that and so . Therefore there exists such that
Let be a sequence for which . By the previous argument we have that, and . Therefore , and by induction, for all . Now if and then by equation (6). By symmetry, . Therefore for all as required. ∎
In Section 3 we saw an example where time-inconsistency led to very bad behavior. The discount matrix causing this was very time-inconsistent. Is it possible that an agent using a “nearly” time-consistent discount matrix can exhibit similar bad behavior? For example, could rounding errors when using a geometric discount matrix seriously affect the agent’s behavior? The following Theorem shows that this is not possible. First we require a measure of the cost of time-inconsistent behavior. The regret experienced by the agent at time zero from following policy rather than is . We also need a distance measure on the space of discount vectors.
Definition 14 (Distance Measure).
Let be discount vectors then define a distance measure by
Note that this is almost the taxicab metric, but the sum is restricted to .
Theorem 15 (Continuity).
Suppose and then
with , which for is guaranteed to exist by Assumption 7.
Theorem 15 implies that the regret of the agent at time zero in its future time-inconsistent actions is bounded by the sum of the differences between the discount vectors used at different times. If these differences are small then the regret is also small. For example, it implies that small perturbations (such as rounding errors) in a time-consistent discount matrix lead to minimal bad behavior.
The proof is omitted due to limitations in space. It relies on proving the result for finite horizon environments and showing that this extends to the infinite case by using the horizon, , after which the actions of the agent are no longer important. The bound in Theorem 15 is tight in the following sense.
For and and any sufficiently small there exists an environment and discount matrix such that
where and where for all .
Note that in the statement above is the same as that in the statement of Theorem 15. Theorem 16 shows that there exists a discount matrix, environment and where the regret due to time-inconsistency is nearly equal to the bound given by Theorem 15.
Proof of Theorem 16.
Observe that for all since for all except . Now consider the environment below.
For sufficiently small , the agent at time zero will plan to move right and then down leading to and .
To compute note that for all . Therefore the agent in time-step doesn’t care about the next instantaneous reward, so prefers to move right with the intention of moving down in the next time-step when the rewards are slightly better. This leads to . Therefore,
as required. ∎
5 Game Theoretic Approach
What should an agent do if it knows it is time inconsistent? One option is to treat its future selves as “opponents” in an extensive game. The game has one player per time-step who chooses the action for that time-step only. At the end of the game the agent will have received a reward sequence . The utility given to the th player is then . So each player in this game wishes to maximise the discounted reward with respect to a different discounting vector.
For example, let and and consider the environment on the right. Initially, the agent has two choices. It can either move down to guarantee a reward sequence of which has utility of or it can move right in which case it will receive a reward sequence of either with utility or with utility . Which of these two reward sequences it receives is determined by the action taken in the second time-step. However this action is chosen to maximise utility with respect to discount sequence and . This means that if at time the agent chooses to move right, the final reward sequence will be and the final utility with respect to will be . Therefore the rational thing to do in time-step 1 is to move down immediately for a utility of .
The technique above is known as backwards induction which is used to find sub-game perfect equilibria in finite extensive games. A variant of Kuhn’s theorem proves that backwards induction can be used to find such equilibria in finite extensive games [OR94]. For arbitrary extensive games (possibly infinite) a sub-game perfect equilibrium need not exist, but we prove a theorem for our particular class of infinite games.
A sub-game perfect equilibrium policy is one the players could agree to play, and subsequently have no incentive to renege on their agreement during play. It isn’t always philosophically clear that a sub-game perfect equilibrium policy should be played. For a deeper discussion, including a number of good examples, see [OR94].
Definition 17 (Sub-game Perfect Equilibria).
A policy is a sub-game perfect equilibrium policy if and only if for each , where is any policy satisfying where .
Theorem 18 (Existence of Sub-game Perfect Equilibrium Policy).
Many results in the literature of game theory almost prove this theorem. Our setting is more difficult than most because we have countably many players (one for each time-step) and exogenous uncertainty. Fortunately, it is made easier by the very particular conditions on the preferences of players for rewards that occur late in the game (Assumption 7). The closest related work appears to be that of Drew Fudenberg in [Fud83], but our proof (see appendix) is very different. The proof idea is to consider a sequence of environments identical to the original environment but with an increasing bounded horizon after which reward is zero. By Kuhn’s Theorem [OR94] a sub-game perfect equilibrium policy must exist in each of these finite games. However the space of policies is compact (Lemma 23) and so this sequence of sub-game perfect equilibrium policies contains a convergent sub-sequence converging to policy . It is not then hard to show that is a sub-game prefect equilibrium policy in the original environment.
Proof of Theorem 18.
Add an action to and such that if is taken at any time in then returns zero reward. Essentially, once in the agent takes action , the agent receives zero reward forever. Now if is a sub-game perfect equilibrium policy in this modified environment then it is a sub-game perfect equilibrium policy in the original one.
For each choose to be a sub-game perfect equilibrium policy in the further modified environment obtained by setting if . That is, the environment which gives zero reward always after time . We can assume without loss of generality that for all . Since is compact, the sequence has a convergent subsequence converging to and satisfying
is a sub-game perfect equilibrium policy in the modified environment with reward if .
We write for the value function in the modified environment. It is now shown that is a sub-game perfect equilibrium policy in the original environment. Fix a and let be a policy with for all where . Now define policies by
By point 1 above, for all where . Now for all we have
where (7) follows from arithmetic. (8) since . (9) since is a sub-game perfect equilibrium policy. (10) by arithmetic. We now show that the absolute value terms in (10) converge to zero. Since is continuous in and and , we obtain . Now if , so . Therefore taking the limit as goes to infinity in (10) shows that as required. ∎
In general, need not be unique, and different sub-game equilibrium policies can lead to different utilities. This is a normal, but unfortunate, problem with the sub-game equilibrium solution concept. The policy is unique if for all players the value of any two arbitrary policies is different. Also, if is true then the non-unique sub-game equilibrium policies have the same values for all agents. Unfortunately, neither of these conditions is necessarily satisfied in our setup. The problem of how players might choose a sub-game perfect equilibrium policy appears surprisingly understudied. We feel it provides another reason to avoid the situation altogether by using time-consistent discount matrices. The following example illustrates the problem of non-unique sub-game equilibrium policies.
Consider the example in Section 3 with an agent using a constant horizon discount matrix with . There are exactly two sub-game perfect equilibrium policies, and defined by,
Note that the reward sequences (and values) generated by and are different with and . If the players choose to play a sub-game perfect equilibrium policy then the first player can choose between and since they have the first move. In that case it would be best to follow by moving right as it has a greater return for the agent at time than .
For time-consistent discount matrices we have the following proposition.
If is time-consistent then for all and choices of and and .
Is it possible that backwards induction is simply expected discounted reward maximisation in another form? The following theorem shows this is not the case and that sub-game perfect equilibrium policies are a rich and interesting class worthy of further study in this (and more general) settings.
The result is proven using a simple counter-example. The idea is to construct a stochastic environment where the first action leads the agent to one of two sub-environments, each with probability half. These environments are identical to the example at the start of this section, but one of them has the reward (rather than ) for the history . It is then easily shown that is not the result of an expectimax expression because it behaves differently in each sub-environment, while any expectimax search (irrespective of discounting) will behave the same in each.
Summary. Theorem 13 gives a characterisation of time-(in)consistent discount matrices and shows that all time-consistent discount matrices follow the simple form of . Theorem 15 shows that using a discount matrix that is nearly time-consistent produces mixed policies with low regret. This is useful for a few reasons, including showing that small perturbations, such as rounding errors, in a discount matrix cannot cause major time-inconsistency problems. It also shows that “cutting off” time-consistent discount matrices after some fixed depth - which makes the agent potentially time-inconsistent - doesn’t affect the policies too much, provided the depth is large enough. When a discount matrix is very time-inconsistent then taking a game theoretic approach may dramatically decrease the regret in the change of policy over time.
Some comments on the policies (policy maximising expected -discounted reward), (mixed policy using at each time-step ) and (sub-game perfect equilibrium policy).
A time-consistent agent should play policy for any . In this case, every optimal policy is also a sub-game perfect equilibrium policy.
will be played by an agent that believes it is time-consistent, but may not be. This can lead to very bad behavior as shown in Section 3.
An agent may play if it knows it is time-inconsistent, and also knows exactly how (I.e, it knows for all at every time-step). This policy is arguably rational, but comes with its own problems, especially non-uniqueness as discussed.
Assumptions. We made a number of assumptions about which we make some brief comments.
Assumption 1, which states that and are finite, guarantees the existence of an optimal policy. Removing the assumption would force us to use -optimal policies, which shouldn’t be a problem for the theorems to go through with an additive slop term in some cases.
Assumption 7 only affects non-summable discount vectors. Without it, even -optimal policies need not exist and all the machinery will break down.
The use of discrete time greatly reduced the complexity of the analysis. Given a sufficiently general model, the set of continuous environments should contain all discrete environments. For this reason the proof of Theorem 13 should go through essentially unmodified. The same may not be true for Theorems 15 and 18. The former may be fixable with substantial effort (and perhaps should be true intuitively). The latter has been partially addressed, with a positive result in [Gol80, PY73, Pol68, Str55].
- [FOO02] Shane Frederick, George L. Oewenstein, and Ted O’Donoghue. Time discounting and time preference: A critical review. Journal of Economic Literature, 40(2), 2002.
- [Fud83] Drew Fudenberg. Subgame-perfect equilibria of finite and infinite-horizon games. Journal of Economic Theory, 31(2), 1983.
- [GFM94] Leonard Green, Nathanael Fristoe, and Joel Myerson. Temporal discounting and preference reversals in choice between delayed outcomes. Psychonomic bulletin and review, 1(3):383–389, 1994.
- [Gol80] Steven M. Goldman. Consistent plans. The Review of Economic Studies, 47(3):pp. 533–537, 1980.
Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2004.
- [Hut06] Marcus Hutter. General discounting versus average reward. In Proc. 17th International Conf. on Algorithmic Learning Theory (ALT’06), volume 4264 of LNAI, pages 244–258, Barcelona, 2006. Springer, Berlin.
- [Leg08] Shane Legg. Machine Super Intelligence. PhD thesis, University of Lugano, 2008.
- [LH07] Shane Legg and Marcus Hutter. Universal intelligence: A definition of machine intelligence. Minds & Machines, 17(4):391–444, 2007.
- [OR94] Martin J. Osborne and Ariel Rubinstein. A Course in Game Theory. The MIT Press, 1994.
- [Pol68] Robert A. Pollak. Consistent planning. The Review of Economic Studies, 35(2):pp. 201–208, 1968.
- [PY73] Bezalel Peleg and Menahem E. Yaari. On the existence of a consistent course of action when tastes are changing. The Review of Economic Studies, 40(3):pp. 391–401, 1973.
- [Sam37] Paul A. Samuelson. A note on measurement of utility. The Review of Economic Studies, 4(2):pp. 155–161, 1937.
- [Str55] Robert H. Strotz. Myopia and inconsistency in dynamic utility maximization. The Review of Economic Studies, 23(3):165–180, 1955.
- [Tha81] Richard Thaler. Some empirical evidence on dynamic inconsistency. Economics Letters, 8(3):201 – 207, 1981.
Appendix A Technical Proofs
Before the proof of Theorem 8 we require a definition and two lemmas.
Let be the set of all policies and define a metric on by