1 Introduction
The goal of an agent is to maximise its expected utility; but how do we measure utility? One method is to assign an instantaneous reward to particular events, such as having a good meal, or a pleasant walk. It would be natural to measure the utility of a plan (policy) by simply summing the expected instantaneous rewards, but for immortal agents this may lead to infinite utility and also assumes rewards are equally valuable irrespective of the time at which they are received.
One solution, the discounted utility (DU) model introduced by Samuelson in [Sam37], is to take a weighted sum of the rewards with earlier rewards usually valued more than later ones.
There have been a number of criticisms of the DU model, which we will not discuss. For an excellent summary, see [FOO02]. Despite the criticisms, the DU model is widely used in both economics and computer science.
A discount function is timeinconsistent if plans chosen to maximise expected discounted utility change over time. For example, many people express a preference for $110 in 31 days over $100 in 30 days, but reverse that preference 30 days later when given a choice between $110 tomorrow or $100 today [GFM94]. This behavior can be caused by a rational agent with a timeinconsistent discount function.
Unfortunately, timeinconsistent discount functions can lead to extremely bad behavior and so it becomes important to ask what discount functions are timeinconsistent.
Previous work has focussed on a continuous model where agents can take actions at any time in a continuous timespace. We consider a discrete model where agents act in finite timesteps. In general this is not a limitation since any continuous environment can be approximated arbitrarily well by a discrete one. The discrete setting has the advantage of easier analysis, which allows us to consider a very general setup where environments are arbitrary finite or infinite Markov decision processes.
Traditionally, the DU model has assumed a sliding discount function. Formally, a sequence of instantaneous utilities (rewards) starting at time , is given utility equal to where . We generalise this model as in [Hut06] by allowing the discount function to depend on the age of the agent. The new utility is given by . This generalisation is consistent with how some agents tend to behave; for example, humans becoming temporally less myopic as they grow older.
Strotz [Str55] showed that the only timeconsistent sliding discount function is geometric discounting. We extend this result to a full characterisation of timeconsistent discount functions where the discount function is permitted to change over time. We also show that discounting functions that are “nearly” timeconsistent give rise to low regret in the anticipated future changes of the policy over time.
Another important question is what policy should be adopted by an agent that knows it is timeinconsistent. For example, if it knows it will become temporarily myopic in the near future then it may benefit from paying a price to precommit to following a particular policy. A number of authors have examined this question in special continuous cases, including [Gol80, PY73, Pol68, Str55]. We modify their results to our general, but discrete, setting using game theory.
The paper is structured as follows. First the required notation is introduced (Section 2). Example discount functions and the consequences of timeinconsistent discount functions are then presented (Section 3). We next state and prove the main theorems, the complete classification of discount functions and the continuity result (Section 4). The game theoretic view of what an agent should do if it knows its discount function is changing is analyzed (Section 5). Finally we offer some discussion and concluding remarks (Section 6).
2 Notation and Problem Setup
The general reinforcement learning (RL) setup involves an agent interacting sequentially with an environment where in each timestep
the agent chooses some action , whereupon it receives a reward and observation. The environment can be formally defined as a probability distribution
where is the probability of receiving reward and observation having taken action after history . For convenience, we assume that for a given history and action , that is fixed (not stochastic). We denote the set of all finite histories and write to be a history of length , to be a history of length . , , and are the th action/reward/observation tuple of history and will be used without explicitly redefining them (there will always be only one history “in context”).A deterministic environment (where every value of is either 1 or 0) can be represented as a graph with edges for actions, rewards of each action attached to the corresponding edge, and observations in the nodes. For example, the deterministic environment on the right represents an environment where either pizza or pasta must be chosen at each timestep (evening). An action leading to an upper node is eat pizza while the ones leading to a lower node are eat pasta. The rewards are for a consumer who prefers pizza to pasta, but dislikes having the same food twice in a row. The starting node is marked as . This example, along with all those for the remainder of this paper, does not require observations.
The following assumption is required for clean results, but may be relaxed if an of slop is permitted in some results.
Assumption 1.
We assume that and are finite and that .
Definition 2 (Policy).
A policy is a mapping giving an action for each history.
Given policy and history and then the probability of reaching history when starting from history is which is defined by,
(1) 
If then we abbreviate and write .
Definition 3 (Expected Rewards).
When applying policy starting from history , the expected sequence of rewards , is defined by
If then .
Note while the set of all possible is uncountable due to the reward term, we sum only over the possible rewards which are determined by the action and previous history, and so this is actually a finite sum.
Definition 4 (Discount Vector).
The apparently superfluous superscript will be useful later when we allow the discount vector to change with time. We do not insist that the discount vector be summable, .
Definition 5 (Expected Values).
The expected discounted reward (or utility or value) when using policy starting in history and discount vector is
The sum can be taken to start from since for . This means that the value of for is unimportant, and never will be for any result in this paper. As the scalar product is linear, a scaling of a discount vector has no affect on the ordering of the policies. Formally, if then for all .
Definition 6 (Optimal Policy/Value).
In general, our agent will try to choose a policy to maximise . This is defined as follows.
If multiple policies are optimal then is chosen using some arbitrary rule. Unfortunately, need not exist without one further assumption.
Assumption 7.
For all and , .
Assumption 7 appears somewhat arbitrary. We consider:

For summable the assumption is true for all environments. With the exception of hyperbolic discounting, all frequently used discount vectors are summable.

For nonsummable discount vectors the assumption implies a restriction on the possible environments. In particular, they must return asymptotically lower rewards in expectation. This restriction is necessary to guarantee the existence of the value function.
From now on, including in theorem statements, we only consider environments/discount vectors satisfying Assumptions 1 and 7. The following theorem then guarantees the existence of .
Theorem 8 (Existence of Optimal Policy).
The proof of the existence theorem is in the appendix.
An agent can use a different discount vector for each time . This motivates the following definition.
Definition 9 (Discount Matrix).
A discount matrix is a matrix with discount vector for the th column.
It is important that we distinguish between a discount matrix (written bold), a discount vector (bold and italics), and a particular value in a discount vector (just italics).
Definition 10 (Sliding Discount Matrix).
A discount matrix is sliding if for all .
Definition 11 (Mixed Policy).
The mixed policy is the policy where at each time step , the agent acts according to the possibly different policy .
We do not denote the mixed policy by as it is arguably not optimal as discussed in Section 5. While nonunique optimal policies at least result in equal discounted utilities, this is not the case for . All theorems are proved with respect to any choice .
Definition 12 (Time Consistency).
A discount matrix is time consistent if and only if for all environments , for all where .
This means that a timeconsistent agent taking action at each time will not change its plans. On the other hand, a timeinconsistent agent may at time 1 intend to take action should it reach history (). However upon reaching , it need not be true that .
3 Examples
In this section we review a number of common discount matrices and give an example where a timeinconsistent discount matrix causes very bad behavior.
Constant Horizon. Constant horizon discounting is where the agent only cares about the future up to timesteps away, defined by .^{1}^{1}1 if is true and otherwise. Shortly we will see that the constant horizon discount matrix can lead to very bad behavior in some environments.
Fixed Lifetime. Fixed lifetime discounting is where an agent knows it will not care about any rewards past timestep , defined by
. Unlike the constant horizon method, a fixed lifetime discount matrix is timeconsistent. Unfortunately it requires you to know the lifetime of the agent beforehand and also makes asymptotic analysis impossible.
Hyperbolic. . The parameter determines how farsighted the agent is with smaller values leading to more farsighted agents. Hyperbolic discounting is often used in economics with some experimental studies explaining human timeinconsistent behavior by suggesting that we discount hyperbolically [Tha81]. The hyperbolic discount matrix is not summable, so may be replaced by the following (similar to [Hut04]), which has similar properties for close to .
Geometric. with . Geometric discounting is the most commonly used discount matrix. Philosophically it can be justified by assuming an agent will die (and not care about the future after death) with probability at each timestep. Another justification for geometric discount is its analytic simplicity  it is summable and leads to timeconsistent policies. It also models fixed interest rates.
No Discounting. . [LH07] and [Leg08] point out that discounting future rewards via an explicit discount matrix is unnecessary since the environment can capture both temporal preferences for early (or late) consumption, as well as the risk associated with delaying consumption. Of course, this “discount matrix” is not summable, but can be made to work by insisting that all environments satisfy Assumption 7. This approach is elegant in the sense that it eliminates the need for a discount matrix, essentially admitting far more complex preferences regarding intertemporal rewards than a discount matrix allows. On the other hand, a discount matrix gives the “controller” an explicit way to adjust the myopia of the agent.
To illustrate the potential consequences of timeinconsistent discount matrices we consider the policies of several agents acting in the following environment. Let agent A use a constant horizon discount matrix with and agent B a geometric discount matrix with some discount rate .
In the first timestep agent A prefers to move right with the intention of moving up in the second timestep for a reward of . However, once in second timestep, it will change its plan by moving right again. This continues indefinitely, so agent A will always delay moving up and receives zero reward forever.
Agent B acts very differently. Let be the policy in which the agent moves right until timestep , then up and right indefinitely. . This value does not depend on and so the agent will move right until when it will move up and receive a reward.
The actions of agent A are an example of the worst possible behavior arising from timeinconsistent discounting. Nevertheless, agents with a constant horizon discount matrix are used in all kinds of problems. In particular, agents in zero sum games where fixed depth minimax searches are common. In practise, serious timeinconsistent behavior for gameplaying agents seems rare, presumably because most strategic games don’t have a reward structure similar to the example above.
4 Theorems
The main theorem of this paper is a complete characterisation of time consistent discount matrices.
Theorem 13 (Characterisation).
Let be a discount matrix, then the following are equivalent.

is timeconsistent (Definition 12)

For each there exists an such that for all .
Recall that a discount matrix is sliding if . Theorem 13 can be used to show that if a sliding discount matrix is used as in [Str55] then the only timeconsistent discount matrix is geometric. Let be a timeconsistent sliding discount matrix. By Theorem 13 and the definition of sliding, . Therefore and and similarly, with , which is geometric discounting. This is the analogue to the results of [Str55] converted to our setting.
The theorem can also be used to construct timeconsistent discount rates. Let be a discount vector, then the discount matrix defined by for all will always be timeconsistent, for example, the fixed lifetime discount matrix with if for some horizon . Indeed, all timeconsistent discount rates can be constructed in this way (up to scaling).
Proof of Theorem 13.
: This direction follows easily from linearity of the scalar product.
(2)  
as required. The last equality of (2) follows from the assumption that for all and because for all .
: Let and be the discount vectors used at times and respectively. Now let and consider the deterministic environment below where the agent has a choice between earning reward at time or at time . In this environment there are only two policies, and , where and with the infinite vector with all components zero except the th, which is .
Since is timeconsistent, for all and we have:
(3)  
(4) 
Now if and only if . Therefore we have that,
(5) 
Letting be the cosine of the angle between and then Equation (5) becomes . Choosing implies that and so . Therefore there exists such that
(6) 
Let be a sequence for which . By the previous argument we have that, and . Therefore , and by induction, for all . Now if and then by equation (6). By symmetry, . Therefore for all as required. ∎
In Section 3 we saw an example where timeinconsistency led to very bad behavior. The discount matrix causing this was very timeinconsistent. Is it possible that an agent using a “nearly” timeconsistent discount matrix can exhibit similar bad behavior? For example, could rounding errors when using a geometric discount matrix seriously affect the agent’s behavior? The following Theorem shows that this is not possible. First we require a measure of the cost of timeinconsistent behavior. The regret experienced by the agent at time zero from following policy rather than is . We also need a distance measure on the space of discount vectors.
Definition 14 (Distance Measure).
Let be discount vectors then define a distance measure by
Note that this is almost the taxicab metric, but the sum is restricted to .
Theorem 15 (Continuity).
Theorem 15 implies that the regret of the agent at time zero in its future timeinconsistent actions is bounded by the sum of the differences between the discount vectors used at different times. If these differences are small then the regret is also small. For example, it implies that small perturbations (such as rounding errors) in a timeconsistent discount matrix lead to minimal bad behavior.
The proof is omitted due to limitations in space. It relies on proving the result for finite horizon environments and showing that this extends to the infinite case by using the horizon, , after which the actions of the agent are no longer important. The bound in Theorem 15 is tight in the following sense.
Theorem 16.
For and and any sufficiently small there exists an environment and discount matrix such that
where and where for all .
Note that in the statement above is the same as that in the statement of Theorem 15. Theorem 16 shows that there exists a discount matrix, environment and where the regret due to timeinconsistency is nearly equal to the bound given by Theorem 15.
Proof of Theorem 16.
Define by
Observe that for all since for all except . Now consider the environment below.
For sufficiently small , the agent at time zero will plan to move right and then down leading to and .
To compute note that for all . Therefore the agent in timestep doesn’t care about the next instantaneous reward, so prefers to move right with the intention of moving down in the next timestep when the rewards are slightly better. This leads to . Therefore,
as required. ∎
5 Game Theoretic Approach
What should an agent do if it knows it is time inconsistent? One option is to treat its future selves as “opponents” in an extensive game. The game has one player per timestep who chooses the action for that timestep only. At the end of the game the agent will have received a reward sequence . The utility given to the th player is then . So each player in this game wishes to maximise the discounted reward with respect to a different discounting vector.
For example, let and and consider the environment on the right. Initially, the agent has two choices. It can either move down to guarantee a reward sequence of which has utility of or it can move right in which case it will receive a reward sequence of either with utility or with utility . Which of these two reward sequences it receives is determined by the action taken in the second timestep. However this action is chosen to maximise utility with respect to discount sequence and . This means that if at time the agent chooses to move right, the final reward sequence will be and the final utility with respect to will be . Therefore the rational thing to do in timestep 1 is to move down immediately for a utility of .
The technique above is known as backwards induction which is used to find subgame perfect equilibria in finite extensive games. A variant of Kuhn’s theorem proves that backwards induction can be used to find such equilibria in finite extensive games [OR94]. For arbitrary extensive games (possibly infinite) a subgame perfect equilibrium need not exist, but we prove a theorem for our particular class of infinite games.
A subgame perfect equilibrium policy is one the players could agree to play, and subsequently have no incentive to renege on their agreement during play. It isn’t always philosophically clear that a subgame perfect equilibrium policy should be played. For a deeper discussion, including a number of good examples, see [OR94].
Definition 17 (Subgame Perfect Equilibria).
A policy is a subgame perfect equilibrium policy if and only if for each , where is any policy satisfying where .
Theorem 18 (Existence of Subgame Perfect Equilibrium Policy).
Many results in the literature of game theory almost prove this theorem. Our setting is more difficult than most because we have countably many players (one for each timestep) and exogenous uncertainty. Fortunately, it is made easier by the very particular conditions on the preferences of players for rewards that occur late in the game (Assumption 7). The closest related work appears to be that of Drew Fudenberg in [Fud83], but our proof (see appendix) is very different. The proof idea is to consider a sequence of environments identical to the original environment but with an increasing bounded horizon after which reward is zero. By Kuhn’s Theorem [OR94] a subgame perfect equilibrium policy must exist in each of these finite games. However the space of policies is compact (Lemma 23) and so this sequence of subgame perfect equilibrium policies contains a convergent subsequence converging to policy . It is not then hard to show that is a subgame prefect equilibrium policy in the original environment.
Proof of Theorem 18.
Add an action to and such that if is taken at any time in then returns zero reward. Essentially, once in the agent takes action , the agent receives zero reward forever. Now if is a subgame perfect equilibrium policy in this modified environment then it is a subgame perfect equilibrium policy in the original one.
For each choose to be a subgame perfect equilibrium policy in the further modified environment obtained by setting if . That is, the environment which gives zero reward always after time . We can assume without loss of generality that for all . Since is compact, the sequence has a convergent subsequence converging to and satisfying

where .

is a subgame perfect equilibrium policy in the modified environment with reward if .

.
We write for the value function in the modified environment. It is now shown that is a subgame perfect equilibrium policy in the original environment. Fix a and let be a policy with for all where . Now define policies by
By point 1 above, for all where . Now for all we have
(7)  
(8)  
(9)  
(10) 
where (7) follows from arithmetic. (8) since . (9) since is a subgame perfect equilibrium policy. (10) by arithmetic. We now show that the absolute value terms in (10) converge to zero. Since is continuous in and and , we obtain . Now if , so . Therefore taking the limit as goes to infinity in (10) shows that as required. ∎
In general, need not be unique, and different subgame equilibrium policies can lead to different utilities. This is a normal, but unfortunate, problem with the subgame equilibrium solution concept. The policy is unique if for all players the value of any two arbitrary policies is different. Also, if is true then the nonunique subgame equilibrium policies have the same values for all agents. Unfortunately, neither of these conditions is necessarily satisfied in our setup. The problem of how players might choose a subgame perfect equilibrium policy appears surprisingly understudied. We feel it provides another reason to avoid the situation altogether by using timeconsistent discount matrices. The following example illustrates the problem of nonunique subgame equilibrium policies.
Example 19.
Consider the example in Section 3 with an agent using a constant horizon discount matrix with . There are exactly two subgame perfect equilibrium policies, and defined by,
Note that the reward sequences (and values) generated by and are different with and . If the players choose to play a subgame perfect equilibrium policy then the first player can choose between and since they have the first move. In that case it would be best to follow by moving right as it has a greater return for the agent at time than .
For timeconsistent discount matrices we have the following proposition.
Proposition 20.
If is timeconsistent then for all and choices of and and .
Is it possible that backwards induction is simply expected discounted reward maximisation in another form? The following theorem shows this is not the case and that subgame perfect equilibrium policies are a rich and interesting class worthy of further study in this (and more general) settings.
Theorem 21.
.
The result is proven using a simple counterexample. The idea is to construct a stochastic environment where the first action leads the agent to one of two subenvironments, each with probability half. These environments are identical to the example at the start of this section, but one of them has the reward (rather than ) for the history . It is then easily shown that is not the result of an expectimax expression because it behaves differently in each subenvironment, while any expectimax search (irrespective of discounting) will behave the same in each.
6 Discussion
Summary. Theorem 13 gives a characterisation of time(in)consistent discount matrices and shows that all timeconsistent discount matrices follow the simple form of . Theorem 15 shows that using a discount matrix that is nearly timeconsistent produces mixed policies with low regret. This is useful for a few reasons, including showing that small perturbations, such as rounding errors, in a discount matrix cannot cause major timeinconsistency problems. It also shows that “cutting off” timeconsistent discount matrices after some fixed depth  which makes the agent potentially timeinconsistent  doesn’t affect the policies too much, provided the depth is large enough. When a discount matrix is very timeinconsistent then taking a game theoretic approach may dramatically decrease the regret in the change of policy over time.
Some comments on the policies (policy maximising expected discounted reward), (mixed policy using at each timestep ) and (subgame perfect equilibrium policy).

A timeconsistent agent should play policy for any . In this case, every optimal policy is also a subgame perfect equilibrium policy.

will be played by an agent that believes it is timeconsistent, but may not be. This can lead to very bad behavior as shown in Section 3.

An agent may play if it knows it is timeinconsistent, and also knows exactly how (I.e, it knows for all at every timestep). This policy is arguably rational, but comes with its own problems, especially nonuniqueness as discussed.
Assumptions. We made a number of assumptions about which we make some brief comments.

Assumption 1, which states that and are finite, guarantees the existence of an optimal policy. Removing the assumption would force us to use optimal policies, which shouldn’t be a problem for the theorems to go through with an additive slop term in some cases.

Assumption 7 only affects nonsummable discount vectors. Without it, even optimal policies need not exist and all the machinery will break down.

The use of discrete time greatly reduced the complexity of the analysis. Given a sufficiently general model, the set of continuous environments should contain all discrete environments. For this reason the proof of Theorem 13 should go through essentially unmodified. The same may not be true for Theorems 15 and 18. The former may be fixable with substantial effort (and perhaps should be true intuitively). The latter has been partially addressed, with a positive result in [Gol80, PY73, Pol68, Str55].
References
 [FOO02] Shane Frederick, George L. Oewenstein, and Ted O’Donoghue. Time discounting and time preference: A critical review. Journal of Economic Literature, 40(2), 2002.
 [Fud83] Drew Fudenberg. Subgameperfect equilibria of finite and infinitehorizon games. Journal of Economic Theory, 31(2), 1983.
 [GFM94] Leonard Green, Nathanael Fristoe, and Joel Myerson. Temporal discounting and preference reversals in choice between delayed outcomes. Psychonomic bulletin and review, 1(3):383–389, 1994.
 [Gol80] Steven M. Goldman. Consistent plans. The Review of Economic Studies, 47(3):pp. 533–537, 1980.

[Hut04]
Marcus Hutter.
Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability
. Springer, Berlin, 2004.  [Hut06] Marcus Hutter. General discounting versus average reward. In Proc. 17th International Conf. on Algorithmic Learning Theory (ALT’06), volume 4264 of LNAI, pages 244–258, Barcelona, 2006. Springer, Berlin.
 [Leg08] Shane Legg. Machine Super Intelligence. PhD thesis, University of Lugano, 2008.
 [LH07] Shane Legg and Marcus Hutter. Universal intelligence: A definition of machine intelligence. Minds & Machines, 17(4):391–444, 2007.
 [OR94] Martin J. Osborne and Ariel Rubinstein. A Course in Game Theory. The MIT Press, 1994.
 [Pol68] Robert A. Pollak. Consistent planning. The Review of Economic Studies, 35(2):pp. 201–208, 1968.
 [PY73] Bezalel Peleg and Menahem E. Yaari. On the existence of a consistent course of action when tastes are changing. The Review of Economic Studies, 40(3):pp. 391–401, 1973.
 [Sam37] Paul A. Samuelson. A note on measurement of utility. The Review of Economic Studies, 4(2):pp. 155–161, 1937.
 [Str55] Robert H. Strotz. Myopia and inconsistency in dynamic utility maximization. The Review of Economic Studies, 23(3):165–180, 1955.
 [Tha81] Richard Thaler. Some empirical evidence on dynamic inconsistency. Economics Letters, 8(3):201 – 207, 1981.
Appendix A Technical Proofs
Before the proof of Theorem 8 we require a definition and two lemmas.
Definition 22.
Let be the set of all policies and define a metric on by