1 Introduction
The standard treatment of the reinforcement learning (RL) problem is the Markov Decision Process (MDP) which includes a discount factor that exponentially reduces the present value of future rewards (Bellman, 1957; Sutton & Barto, 1998). A reward received in time steps is devalued to , a discounted utility model introduced by Samuelson (1937). This establishes a timepreference for rewards realized sooner rather than later. The decision to exponentially discount future rewards by leads to value functions that satisfy theoretical convergence properties (Bertsekas, 1995). The magnitude of also plays a role in stabilizing learning dynamics of RL algorithms (Prokhorov & Wunsch, 1997; Bertsekas & Tsitsiklis, 1996)
and has recently been treated as a hyperparameter of the optimization
(OpenAI, 2018; Xu et al., 2018).However, both the magnitude and the functional form of this discounting function implicitly establish priors over the solutions learned. The magnitude of chosen establishes an effective horizon for the agent, far beyond which rewards are neglected (Kearns & Singh, 2002). This effectively imposes a timescale of the environment, which may not be accurate. However, less wellknown and expanded on later, the exponential discounting of future rewards is consistent with a prior belief that there exists a known constant risk to the agent in the environment (Sozou (1998), Section 3.1). This is a strong assumption that may not be supported in richer environments.
Additionally, discounting future values exponentially and according to a single discount factor does not harmonize with the measured value preferences in humans and animals (Mazur, 1985, 1997; Ainslie, 1992; Green & Myerson, 2004; Maia, 2009). A wealth of empirical evidence has been amassed that humans, monkeys, rats and pigeons instead discount future returns hyperbolically, where , for some positive (Ainslie, 1975, 1992; Mazur, 1985, 1997; Frederick et al., 2002; Green et al., 1981; Green & Myerson, 2004).
As an example of hyperbolic timepreferences, consider the hypothetical: a stranger approaches with a simple proposition. He offers you $1M immediately with no risk, but if you can wait until tomorrow, he promises you $1.1M dollars. With no further information many are skeptical of this wouldbe benefactor and choose to receive $1M immediately. Most rightly believe the future promise holds risk. However, in an alternative proposition, he instead promises you $1M in 365 days or $1.1M in 366 days. Under these new terms many will instead choose the $1.1M offer. Effectively, the discount rate has decreased further out, indicating the belief that it is less likely for the promise to be reneged on the 366th day if it were not already broken on the 365th day. Note that discount rates in humans have been demonstrated to vary with the size of the reward so this timereversal might not emerge for $1 versus $1.1 (Myerson & Green, 1995; Green et al., 1997).
Hyperbolic discounting is consistent with these reversals in timepreferences (Green et al., 1994). Exponential discounting, on the other hand, always remains consistent between these choices and was shown in Strotz (1955) to be the only timeconsistent sliding discount function. This discrepancy between the timepreferences of animals from the exponential discounted measure of value might be presumed irrational. However, Sozou (1998) demonstrates that this behavior is mathematically consistent with the agent maintaining some uncertainty over the hazard rate in the environment. In this formulation, rewards are discounted based on the possibility the agent will succumb to a risk and will thus not survive to collect them. Hazard rate, defined in Section 3, measures the pertimestep risk the agent incurs as it acts in the environment.
Hazard and its associated discount function. Common RL environments are also characterized by risk, but in a narrower sense. In deterministic environments like the original Arcade Learning Environment (ALE) (Bellemare et al., 2013) stochasticity is often introduced through techniques like noops (Mnih et al., 2015) and sticky actions (Machado et al., 2018) where the action execution is noisy. Physics simulators may have noise and the randomness of the policy itself induces risk. But even with these stochastic injections the risk to reward emerges in a more restricted sense. Episodetoepisode risk may vary as the value function and resulting policy evolve. States once safely navigable may become dangerous through catastrophic forgetting (McCloskey & Cohen, 1989; French, 1999) or through exploration the agent may venture to new dangerous areas of the state space. However, this is still a narrow manifestation of risk as the environment is generally stable and repetitive. In Section 4 we show that a prior distribution reflecting the uncertainty over the hazard rate, has an associated discount function in the sense that an MDP with either this hazard distribution or the discount function, has the same value function for all policies. This equivalence implies that learning policies with a discount function can be interpreted as making them robust to the associated hazard distribution. Thus, discounting serves as a tool to ensure that policies deployed in the real world perform well even under risks they were not trained under.
Hyperbolic discounting from TDlearning algorithms. We propose an algorithm that approximates hyperbolic discounting while building on successful Qlearning (Watkins & Dayan, 1992) tools and their associated theoretical guarantees. We show learning many Qvalues, each discounting exponentially with a different discount factor , can be aggregated to approximate hyperbolic (and other nonexponential) discount factors. We demonstrate the efficacy of our approximation scheme in our proposed Pathworld environment which is characterized both by an uncertain pertimestep risk to the agent. The agent must choose which risky path to follow but it stands to gain a higher reward the longer, riskier paths. A conceptually similar situation might arise for a foraging agent balancing easily realizable, small meals versus more distant, fruitful meals. The setup is described in further detail in Section 7. We then consider higherdimensional RL agents in the ALE, where we measure the benefits of our technique. Our approximation mirrors the work of KurthNelson & Redish (2009); Redish & KurthNelson (2010) which empirically demonstrates that modeling a finite set of Agents simultaneously can approximate hyperbolic discounting function which is consistent with fMRI studies (Tanaka et al., 2004; Schweighofer et al., 2008)
. Our method extends to other nonhyperbolic discount functions and uses deep neural networks to model the different Qvalues from a shared representation.
Surprisingly and in addition to enabling new discounting schemes, we observe that learning a set of Qvalues is beneficial as an auxiliary task (Jaderberg et al., 2016). Adding this multihorizon auxiliary task often improves over strong baselines including C51 (Bellemare et al., 2017) and Rainbow (Hessel et al., 2018) in the ALE (Bellemare et al., 2013).
The paper is organized as follows. Section 3 recounts how a prior belief of the risk in the environment can imply a specific discount function. Section 4 formalizes hazard in MDPs. In Section 5 we demonstrate that hyperbolic (and other) discounting rates can be computed by Qlearning (Watkins & Dayan, 1992) over multiple horizons, that is, multiple discount functions . We then provide a practical approach to approximating these alternative discount schemes in Section 6. We demonstrate the efficacy of our approximation scheme in the Pathworld environment in Section 7 and then go on to consider the highdimensional ALE setting in Sections 7, 9. We conclude with ablation studies, discussion and commentary on future research directions.
This work questions the RL paradigm of learning policies through a single discount function which exponentially discounts future rewards through two contributions:

Hyperbolic (and other nonexponential)agent. A practical approach for training an agent which discounts future rewards by a hyperbolic (or other nonexponential) discount function and acts according to this.

Multihorizon auxiliary task. A demonstration of multihorizon learning over many simultaneously as an effective auxiliary task.
2 Related Work
Hyperbolic discounting in economics. Hyperbolic discounting is wellstudied in the field of economics (Sozou, 1998; Dasgupta & Maskin, 2005). Dasgupta and Maskin (2005) proposes a softer interpretation than Sozou (1998) (which produces a pertimestep of death via the hazard rate) and demonstrates that uncertainty over the timing of rewards can also give rise to hyperbolic discounting and preference reversals, a hallmark of hyperbolic discounting. However, though alternative motivations for hyperbolic discounting exist we build upon Sozou (1998) for its clarity and simplicity.
Hyperbolic discounting was initially presumed to not lend itself to TDbased solutions (Daw & Touretzky, 2000) but the field has evolved on this point. Maia (2009) proposes solution directions that find models that discount quasihyperbolically even though each learns with exponential discounting (Loewenstein, 1996) but reaffirms the difficulty. Finally, Alexander and Brown (2010) proposes hyperbolically discounted temporal difference (HDTD) learning by making connections to hazard. However, this approach introduces two additional free parameters to adjust for differences in rewardlevel.
Behavior RL and hyperbolic discounting in neuroscience. TDlearning has long been used for modeling behavioral reinforcement learning (Montague et al., 1996; Schultz et al., 1997; Sutton & Barto, 1998). TDlearning computes the error as the difference between the expected value and actual value (Sutton & Barto, 1998; Daw, 2003)
where the error signal emerges from unexpected rewards. However, these computations traditionally rely on exponential discounting as part of the estimate of the value which disagrees with empirical evidence in humans and animals
(Strotz, 1955; Mazur, 1985, 1997; Ainslie, 1975, 1992). Hyperbolic discounting has been proposed as an alternative to exponential discounting though it has been debated as an accurate model (Kacelnik, 1997; Frederick et al., 2002). Naive modifications to TDlearning to discount hyperbolically present issues since the simple forms are inconsistent (Daw & Touretzky, 2000; Redish & KurthNelson, 2010) RL models have been proposed to explain behavioral effects of humans and animals (Fu & Anderson, 2006; Rangel et al., 2008) but KurthNelson & Redish (2009) demonstrated that distributed exponential discount factors can directly model hyperbolic discounting. This work proposes the Agent, an agent that models the value function with a specific discount factor . When the distributed set of Agent’s votes on the action, this was shown to approximate hyperbolic discounting well in the adjustingdelay assay experiments (Mazur, 1987). Using the hazard formulation established in Sozou (1998), we demonstrate how to extend this to other nonhyperbolic discount functions and demonstrate the efficacy of using a deep neural network to model the different Qvalues from a shared representation.Towards more flexible discounting in reinforcement learning. RL researchers have recently adopted more flexible versions beyond a fixed discount factor (Feinberg & Shwartz, 1994; Sutton, 1995; Sutton et al., 2011; White, 2017). Optimal policies are studied in Feinberg & Shwartz (1994) where two value functions with different discount factors are used. Introducing the discount factor as an argument to be queried for a set of timescales is considered in both Horde (Sutton et al., 2011) and nets (Sherstan et al., 2018). Reinke et al. (2017) proposes the Average Reward Independent Gamma Ensemble framework which imitates the average return estimator.
Lattimore and Hutter (2011) generalizes the original discounting model through discount functions that vary with the age of the agent, expressing timeinconsistent preferences as in hyperbolic discounting. The need to increase training stability via effective horizon was addressed in FrançoisLavet, Fonteneau, and Ernst (2015) who proposed dynamic strategies for the discount factor . Metalearning approaches to deal with the discount factor have been proposed in Xu, van Hasselt, and Silver (2018). Finally, Pitis (2019) characterizes rational decision making in sequential processes, formalizing a process that admits a stateaction dependent discount rates. This body of work suggests growing tension between the original MDP formulation with a fixed and future research directions.
Operating over multiple time scales has a long history in RL. Sutton (1995) generalizes the work of Singh (1992) and Dayan and Hinton (1993) to formalize a multitime scale TD learning model theory. Previous work has been explored on solving MDPs with multiple reward functions and multiple discount factors though these relied on separate transition models (Feinberg & Shwartz, 1999; Dolgov & Durfee, 2005). Edwards, Littman, and Isbell (2015) considers decomposing a reward function into separate components each with its own discount factor. In our work, we continue to model the same rewards, but now model the value over different horizons. Recent work in difficult exploration games demonstrates the efficacy of two different discount factors (Burda et al., 2018) one for intrinsic rewards and one for extrinsic rewards. Finally, and concurrent with this work, Romoff et al. (2019) proposes the TDalgorithm which breaks a value function into a series of value functions with smaller discount factors.
Auxiliary tasks in reinforcement learning. Finally, auxiliary tasks have been successfully employed and found to be of considerable benefit in RL. Suddarth and Kergosien (1990) used auxiliary tasks to facilitate representation learning. Building upon this, work in RL has consistently demonstrated benefits of auxiliary tasks to augment the lowinformation coming from the environment through extrinsic rewards (Lample & Chaplot, 2017; Mirowski et al., 2016), (Jaderberg et al., 2016; Veeriah et al., 2018; Sutton et al., 2011)
3 Belief of Risk Implies a Discount Function
Sozou (1998)
formalizes time preferences in which future rewards are discounted based on the probability that the agent will not
survive to collect them due to an encountered risk or hazard.Definition 3.1.
Survival is the probability of the agent surviving until time .
(1) 
A future reward is less valuable presently if the agent is unlikely to survive to collect it. If the agent is riskneutral, the present value of a future reward received at time should be discounted by the probability that the agent will survive until time to collect it, .^{1}^{1}1Note the difference in RL where future rewards are discounted by timedelay so the value is .
(2) 
Consequently, if the agent is certain to survive, , then the reward is not discounted per Equation 2. From this it is then convenient to define the hazard rate.
Definition 3.2.
Hazard rate is the negative rate of change of the logsurvival at time
(3) 
or equivalently expressed as . Therefore the environment is considered hazardous at time if the log survival is decreasing sharply.
Sozou (1998) demonstrates that the prior belief of the risk in the environment implies a specific discounting function. When the risk occurs at a known constant rate than the agent should discount future rewards exponentially. However, when the agent holds uncertainty over the hazard rate then hyperbolic and alternative discounting rates arise.
3.1 Known Hazard Implies Exponential Discount
We recover the familiar exponential discount function in RL based on a prior assumption that the environment has a known constant hazard. Consider a known hazard rate of . Definition 3.2 sets a first order differential equation . The solution for the survival rate is which can be related to the RL discount factor
(4) 
This interprets as the pertimestep probability of the episode continuing. This also allows us to connect the hazard rate to the discount factor .
(5) 
As the hazard increases , then the corresponding discount factor becomes increasingly myopic . Conversely, as the environment hazard vanishes, , the corresponding agent becomes increasingly farsighted .
In RL we commonly choose a single which is consistent with the prior belief that there exists a known constant hazard rate . We now relax the assumption that the agent holds this strong prior that it exactly knows the true hazard rate. From a Bayesian perspective, a looser prior allows for some uncertainty in the underlying hazard rate of the environment which we will see in the following section.
3.2 Uncertain Hazard Implies NonExponential Discount
We may not always be so confident of the true risk in the environment and instead reflect this underlying uncertainty in the hazard rate through a hazard prior . Our survival rate is then computed by weighting specific exponential survival rates defined by a given over our prior
(6) 
Sozou (1998) shows that under an exponential prior of hazard the expected survival rate for the agent is hyperbolic
(7) 
We denote the hyperbolic discount by to make the connection to in reinforcement learning explicit. Further, Sozou (1998) shows that different priors over hazard correspond to different discount functions. We reproduce two figures in Figure 2 showing the correspondence between different hazard rate priors and the resultant discount functions. The common approach in RL is to maintain a deltahazard (black line) which leads to exponential discounting of future rewards. Different priors lead to nonexponential discount functions.
4 Hazard in MDPs
To study MDPs with hazard distributions and general discount functions we introduce two modifications. The hazardous MDP now is defined by the tuple . In standard form, the state space and the action space may be discrete or continuous. The learner observes samples from the environment transition probability for going from to given . We will consider the case where is a substochastic transition function, which defines an episodic MDP. The environment emits a bounded reward on each transition. In this work we consider noninfinite episodic MDPs.
The first difference is that at the beginning of each episode, a hazard is sampled from the hazard distribution . This is equivalent to sampling a continuing probability . During the episode, the hazard modified transition function will be , in that . The second difference is that we now consider a general discount function . This differs from the standard approach of exponential discounting in RL with according to , which is a special case.
This setting makes a close connection to partially observable Markov Decision Process (POMDP) (Kaelbling et al., 1998) where one might consider as an unobserved variable. However, the classic POMDP definition contains an explicit discount function as part of it’s definition which does not appear here.
A policy is a mapping from states to actions. The state action value function is the expected discounted rewards after taking action in state and then following policy until termination.
(8) 
where and implies that and .
4.1 Equivalence Between Hazard and Discounting
In the hazardous MDP setting we observe the same connections between hazard and discount functions delineated in Section 3. This expresses an equivalence between the value function of an MDP with a discount and MDP with a hazard distribution.
For example, there exists an equivalence between the exponential discount function to the undiscounted case where the agent is subject to a per timestep of dying (Lattimore & Hutter, 2011). The typical Qvalue (left side of Equation 9) is when the agent acts in an environment without hazard or and discounts future rewards according to which we denote as . The alternative Qvalue (right side of Equation 9) is when the agent acts under hazard rate but does not discount future rewards which we denote as .
(9) 
where denotes the Dirac delta distribution at . This follows from
Following Section 3 we also show a similar equivalence between hyperbolic discounting and the specific hazard distribution , where again, in Appendix A.
For notational brevity later in the paper, we will omit the explicit hazard distribution superscript if the environment is not hazardous.
5 Computing Hyperbolic QValues From Exponential QValues
We show how one can repurpose exponentiallydiscounted Qvalues to compute hyperbolic (and othernonexponential) discounted Qvalues. The central challenge with using nonexponential discount strategies is that most RL algorithms use some form of TD learning (Sutton, 1988). This family of algorithms exploits the Bellman equation (Bellman, 1958) which, when using exponential discounting, relates the value function at one state with the value at the following state.
(10) 
where expectation denotes sampling , , and .
Being able to reuse the literature on TD methods without being constrained to exponential discounting is thus an important challenge.
5.1 Computing Hyperbolic Values
Let’s start with the case where we would like to estimate the value function where rewards are discounted hyperbolically instead of the common exponential scheme. We refer to the hyperbolic Qvalues as below in Equation 12
(11)  
(12) 
We may relate the hyperbolic value to the values learned through standard learning. To do so, notice that the hyperbolic discount can be expressed as the integral of a certain function for in Equation 13.
(13) 
The integral over this specific function yields the desired hyperbolic discount factor by considering an infinite set of exponential discount factors over its domain . We visualize the hyperbolic discount factors (consider ) for the first few timesteps in Figure 3.
Recognize that the integrand is the standard exponential discount factor which suggests a connection to standard Qlearning (Watkins & Dayan, 1992). This suggests that if we could consider an infinite set of then we can combine them to yield hyperbolic discounts for the corresponding timestep . We build on this idea of modeling many throughout this work.
We employ Equation 13 and return to the task of computing hyperbolic Qvalues ^{2}^{2}2Hyperbolic Qvalues can generally be infinite for bounded rewards. We consider noninfinite episodic MDPs only.
(14)  
(15)  
(16)  
(17) 
where has been replaced on the first line by and the exchange is valid if . This shows us that we can compute the value according to hyperbolic discount factor by considering an infinite set of values computed through standard learning. Examining further, each results in TDerrors learned for a new . For values of , which extends the horizon of the hyperbolic discounting, this would result in larger .
5.2 Generalizing to Other NonExponential Values
Equation 13 computes hyperbolic discount functions but its origin was not mathematically motivated. We consider here an alternative scheme to deduce ways to model hyperbolic as well as different discount schemes through integrals of .
Lemma 5.1.
Let be the state action value function under exponential discounting in a hazardous MDP and let refer to the value function in the same MDP except for new discounting . If there exists a function such that
(18) 
which we will refer to as the exponential weighting condition, then
(19) 
Proof.
Applying the condition on ,
(20)  
(21)  
(22) 
∎
where again the exchange is valid if . We can now see that the exponential weighting condition is satisfied for hyperbolic discounting and a list of other discounting that we might want to consider.
For instance, the hyperbolic discount can also be expressed as the integral of a different function for in Equation 23.
(23) 
As before, an integral over a function yields the desired hyperbolic discount factor . This integral can be derived by recognizing Equation 6 as the Laplace transform of the prior and then applying a change of variables . Computing hyperbolic and other discount functions is demonstrated in detail in Appendix B. We summarize in Table 1 how a particular hazard prior can be computed via integrating over specific weightings and the corresponding discount function.
Dirac Delta Prior  

Exponential Prior  
Uniform Prior 
6 Approximating Hyperbolic Values
Section 5 describes an equivalence between hyperbolicallydiscounted Qvalues and integrals of exponentiallydiscounted Qvalues requiring evaluating an infinite set of value functions. We now present a practical approach to approximate discounting using standard learning (Watkins & Dayan, 1992).
6.1 Approximating the Discount Factor Integral
To avoid estimating an infinite number of values we introduce a free hyperparameter () which is the total number of values to consider, each with their own . We use a practicallyminded approach to choose that emphasizes evaluating larger values of rather than uniformly choosing points and empirically performs well as seen in Section 7.
(24) 
Our approach is described in Appendix C. Each computes the discounted sum of returns according to that specific discount factor .
We previously proposed two equivalent approaches for computing hyperbolic Qvalues, but for simplicity we consider the one presented in Lemma 5.1. The set of values permits us to estimate the integral through a Riemann sum (Equation 25) which is described in further detail in Appendix D.
(25)  
(26) 
where we estimate the integral through a lower bound. We consolidate this entire process in Figure 4 where we show the full process of rewriting the hyperbolic discount rate, hyperbolicallydiscounted Qvalue, the approximation and the instantiated agent. This approach is similar to that of KurthNelson & Redish (2009) where each Agent models a specific discount factor . However, this differs in that our final agent computes a weighted average over each Qvalue rather than a sampling operation of each agent based on a distribution.
7 Pathworld Experiments
7.1 When to Discount Hyperbolically?
The benefits of hyperbolic discounting will be greatest under:

Uncertain hazard. The hazardrate characterizing the environment is not known. For instance, an unobserved hazardrate variable is drawn independently at the beginning of each episode from .

Nontrivial intertemporal decisions. The agent faces nontrivial intertemporal decision. A nontrivial decision is one between smaller nearby rewards versus larger distant rewards.^{3}^{3}3A trivial intertemporal decision is one between small distant rewards versus large close rewards.
In the absence of both properties we would not expect any advantage to discounting hyperbolically. As described before, if there is a singletrue hazard rate , than an optimal exists and future rewards should be discounted exponentially according to it. Further, without nontrivial intertemporal tradeoffs which would occur if there is one path through the environment with perfect alignment of short and longterm objectives, all discounting schemes will yield the same optimal policy.
7.2 Pathworld Details
We note two sources for discounting rewards in the future: time delay and survival probability (Section 4). In Pathworld of 5, we train to maximize hyperbolically discounted returns () under no hazard () but then evaluate the undiscounted returns with the paths subject to hazard . Through this procedure, we are able to train an agent that is robust to hazards in the environment.
The agent makes one decision in Pathworld (Figure 5): which of the paths to investigate. Once a path is chosen, the agent continues until it reaches the end or until it dies. This is similar to a multiarmed bandit, with each action subject to dynamic risk. The paths vary quadratically in length with the index but the rewards increase linearly with the path index . This presents a nontrivial decision for the agent. At deployment, an unobserved hazard is drawn and the agent is subject to a pertimestep risk of dying of . This environment differs from the adjustingdelay procedure presented by Mazur (1987) and then later modified by KurthNelson & Redish (2009). Rather then determining timepreferences through varaibletiming of rewards, we determine timepreferences through risk to the reward.
7.3 Results in Pathworld
Figure 7 validates that our approach wellapproximates the true hyperbolic value of each path when the hazard prior matches the true distribution. Agents that discount exponentially according to a single (as is commonly the case in RL) incorrectly value the paths.
[0.55]
Discount function  MSE 

hyperbolic value  0.002 
=0.975  0.566 
=0.95  1.461 
=0.9  2.253 
=0.99  2.288 
=0.75  2.809 
We examine further the failure of exponential discounting in this hazardous setting. For this environment, the true hazard parameter in the prior was (i.e. ). Therefore, at deployment, the agent must deal with dynamic levels of risk and faces a nontrivial decision of which path to follow. Even if we tune an agent’s such that it chooses the correct argmax path, it still fails to capture the functional form (Figure 7) and it achieves a high error over all paths (Table 7). If the argmax action was not available or if the agent was proposed to evaluate nontrivial intertemporal decisions, it would act suboptimally.
In the next two experiments we consider the more realistic case where the agent’s prior over hazard does not exactly match the environment true hazard rate. In Figure 9 we consider the case that the agent still holds an exponential prior but has the wrong coefficient and in Figure 11
we consider the case where the agent still holds an exponential prior but the true hazard is actually drawn from a uniform distribution with the same mean.
[0.55]
Discount function  MSE 

k=0.05  0.002 
k=0.1  0.493 
k=0.025  0.814 
k=0.2  1.281 
[0.55]
Discount function  MSE 

hyperbolic value  0.235 
0.266  
0.470  
4.029 
Through these two validating experiments, we demonstrate the robustness of estimating hyperbolic discounted Qvalues in the case when the environment presents dynamic levels of risk and the agent faces nontrivial decisions. Hyperbolic discounting is preferable to exponential discounting even when the agent’s prior does not precisely match the true environment hazard rate distribution, by coefficient (Figure 9) or by functional form (Figure 11).
8 Atari 2600 Experiments
With our approach validated in Pathworld, we now move to the highdimensional environment of Atari 2600, specifically, ALE. We use the Rainbow variant from Dopamine (Castro et al., 2018) which implements three of the six considered improvements from the original paper: distributional RL, predicting nstep returns and prioritized replay buffers.
The agent (Figure 12) maintains a shared representation of state, but computes
value logits for each of the
via whereis a ReLUnonlinearity
(Nair & Hinton, 2010) and and are the learnable parameters of the affine transformation for that head.We provide details on the hyperparameters in Appendix G. We consider the performance of the hyperbolic agent built on Rainbow (referred to as HyperRainbow) on a random subset of Atari 2600 games in Figure 13.
We find that the HyperRainbow agent (blue) performs very well, often improving over the strongbaseline Rainbow agent. On this subset of 19 games, we find that it improves upon 14 games and in some cases, by large margins. However, in Section 9 we seek a more complete understanding of the underlying driver of this improvement in ALE through an ablation study.
9 MultiHorizon Auxiliary Task Results
To dissect the ALE improvements, recognize that HyperRainbow changes two properties from the base Rainbow agent:

Behavior policy. The agent acts according to hyperbolic Qvalues computed by our approximation described in Section 6

Learn over multiple horizons. The agent simultaneously learns Qvalues over many rather than a Qvalue for a single
The second modification can be regarded as introducing an auxiliary task (Jaderberg et al., 2016). Therefore, to attribute the performance of each properly we construct a Rainbow agent augmented with the multihorizon auxiliary task (referred to as MultiRainbow and shown in orange) but have it still act according to the original policy. That is, MultiRainbow acts to maximize expected rewards discounted by a fixed but now learns over multiple horizons as shown in Figure 12.
We find that the MultiRainbow agent performs nearly as well on these games, suggesting the effectiveness of this as a standalone auxiliary task. This is not entirely unexpected given the rather specialcase of hazard exhibited in ALE through stickyactions (Machado et al., 2018).
We examine further and investigate the performance of this auxiliary task across the full Arcade Learning Environment (Bellemare et al., 2017) using the recommended evaluation by (Machado et al., 2018). Doing so we find empirical benefits of the multihorizon auxiliary task on the Rainbow agent as shown in Figure 14.
9.1 Analysis and Ablation Studies
To understand the interplay of the multihorizon auxiliary task with other improvements in deep RL, we test a random subset of 10 Atari 2600 games against improvements in Rainbow (Hessel et al., 2018). On this set of games we measure a consistent improvement with multihorizon C51 (MultiC51) in 9 out of the 10 games over the base C51 agent (Bellemare et al., 2017) in Figure 15.
Figure 15 indicates that the current implementation of MultiRainbow does not generally build successfully on the prioritized replay buffer. On the subset of ten games considered, we find that four out of ten games (Pong, Venture, Gravitar and Zaxxon) are negatively impacted despite (Hessel et al., 2018) finding it to be of considerable benefit and specifically beneficial in three out of these four games (Venture was not considered). The current prioritization scheme simply averaged the temporaldifference errors over all values to establish priority. Alternative prioritization schemes are offering encouraging preliminary results (Appendix E).
10 Discussion
This work builds on a body of work that questions one of the basic premises of RL: one should maximize the exponentially discounted returns via a single discount factor. By learning over multiple horizons simultaneously, we have broadened the scope of our learning algorithms. Through this we have shown that we can enable acting according to new discounting schemes and that learning multiple horizons is a powerful standalone auxiliary task. Our method wellapproximates hyperbolic discounting and performs better in hazardous MDP distributions. This may be viewed as part of an algorithmic toolkit to model alternative discount functions.
11 Future Work
There is growing interest in the timepreferences of RL agents. Through this work we have considered models of a constant, albeit uncertain, hazard rate . This moves beyond the canonical RL approach of fixing a single which implicitly holds no uncertainty on the value of but this still does not fully capture all aspects of risk since the hazard rate may be a function of time. Further, hazard may not be an intrinsic property of the environment but a joint property of both the policy and the environment. If an agent purses a policy leading to dangerous state distributions then it will naturally be subject to higher hazards and viceversa. We would therefore expect an interplay between timepreferences and policy. This is not simple to deal with but recent work proposing stateaction dependent discounting (Pitis, 2019) may provide a formalism for more general timepreference schemes.
Acknowledgements
This research and its general framing drew upon the talents of many researchers at Google Brain, DeepMind and Mila. In particular, we’d like thank Ryan Sepassi for framing of the paper, Utku Evci for last minute Matplotlib help, Audrey Durand, Margaret Li, Adrien Ali Taïga, Ofir Nachum, Doina Precup, Jacob Buckman, Marcin Moczulski, Nicolas Le Roux, Ben Eysenbach, Sherjil Ozair, Anirudh Goyal, Ryan Lowe, Robert Dadashi, Chelsea Finn, Sergey Levine, Graham Taylor and Irwan Bello for general discussions and revisions.
References
 Ainslie (1975) George Ainslie. Specious reward: a behavioral theory of impulsiveness and impulse control. Psychological bulletin, 82(4):463, 1975.
 Ainslie (1992) George Ainslie. Picoeconomics: The strategic interaction of successive motivational states within the person. Cambridge University Press, 1992.
 Alexander & Brown (2010) William H Alexander and Joshua W Brown. Hyperbolically discounted temporal difference learning. Neural computation, 22(6):1511–1527, 2010.

Bellemare et al. (2013)
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 47:253–279, 2013.  Bellemare et al. (2017) Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887, 2017.
 Bellman (1957) Richard Bellman. A markovian decision process. Journal of Mathematics and Mechanics, 6(5):679–684, 1957.
 Bellman (1958) Richard Bellman. On a routing problem. Quarterly of applied mathematics, 16(1):87–90, 1958.
 Bertsekas (1995) Dimitri P Bertsekas. Neurodynamic programming: an overview. 1995.
 Bertsekas & Tsitsiklis (1996) Dimitri P Bertsekas and John N Tsitsiklis. Neurodynamic programming, volume 5. Athena Scientific Belmont, MA, 1996.
 Burda et al. (2018) Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
 Castro et al. (2018) Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare. Dopamine: A research framework for deep reinforcement learning. CoRR, abs/1812.06110, 2018. URL http://arxiv.org/abs/1812.06110.
 Dasgupta & Maskin (2005) Partha Dasgupta and Eric Maskin. Uncertainty and hyperbolic discounting. American Economic Review, 95(4):1290–1299, 2005.
 Daw (2003) Nathaniel D Daw. Reinforcement learning models of the dopamine system and their behavioral implications. PhD thesis, Carnegie Mellon University, 2003.
 Daw & Touretzky (2000) Nathaniel D Daw and David S Touretzky. Behavioral considerations suggest an average reward td model of the dopamine system. Neurocomputing, 32:679–684, 2000.
 Dayan & Hinton (1993) Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in neural information processing systems, pp. 271–278, 1993.
 Dolgov & Durfee (2005) Dmitri Dolgov and Edmund Durfee. Stationary deterministic policies for constrained mdps with multiple rewards, costs, and discount factors. Ann Arbor, 1001:48109, 2005.
 Edwards et al. (2015) Ashley Edwards, Michael L Littman, and Charles L Isbell. Expressing tasks robustly via multiple discount factors. 2015.
 Feinberg & Shwartz (1994) Eugene A Feinberg and Adam Shwartz. Markov decision models with weighted discounted criteria. Mathematics of Operations Research, 19(1):152–168, 1994.
 Feinberg & Shwartz (1999) Eugene A Feinberg and Adam Shwartz. Constrained dynamic programming with two discount factors: Applications and an algorithm. IEEE Transactions on Automatic Control, 44(3):628–631, 1999.
 FrançoisLavet et al. (2015) Vincent FrançoisLavet, Raphael Fonteneau, and Damien Ernst. How to discount deep reinforcement learning: Towards new dynamic strategies. arXiv preprint arXiv:1512.02011, 2015.
 Frederick et al. (2002) Shane Frederick, George Loewenstein, and Ted O’donoghue. Time discounting and time preference: A critical review. Journal of economic literature, 40(2):351–401, 2002.
 French (1999) Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
 Fu & Anderson (2006) WaiTat Fu and John R Anderson. From recurrent choice to skill learning: A reinforcementlearning model. Journal of experimental psychology: General, 135(2):184, 2006.
 Green & Myerson (2004) Leonard Green and Joel Myerson. A discounting framework for choice with delayed and probabilistic rewards. Psychological bulletin, 130(5):769, 2004.
 Green et al. (1981) Leonard Green, Ewin B Fisher, Steven Perlow, and Lisa Sherman. Preference reversal and self control: Choice as a function of reward amount and delay. Behaviour Analysis Letters, 1981.
 Green et al. (1994) Leonard Green, Nathanael Fristoe, and Joel Myerson. Temporal discounting and preference reversals in choice between delayed outcomes. Psychonomic Bulletin & Review, 1(3):383–389, 1994.
 Green et al. (1997) Leonard Green, Joel Myerson, and Edward McFadden. Rate of temporal discounting decreases with amount of reward. Memory & cognition, 25(5):715–723, 1997.
 Hessel et al. (2018) Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 Jaderberg et al. (2016) Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
 Kacelnik (1997) Alex Kacelnik. Normative and descriptive models of decision making: time discounting and risk sensitivity. Characterizing human psychological adaptations, 208:51–66, 1997.
 Kaelbling et al. (1998) Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(12):99–134, 1998.
 Kearns & Singh (2002) Michael Kearns and Satinder Singh. Nearoptimal reinforcement learning in polynomial time. Machine learning, 49(23):209–232, 2002.

KurthNelson & Redish (2009)
Zeb KurthNelson and A David Redish.
Temporaldifference reinforcement learning with distributed representations.
PLoS One, 4(10):e7362, 2009.  Lample & Chaplot (2017) Guillaume Lample and Devendra Singh Chaplot. Playing fps games with deep reinforcement learning. 2017.
 Lattimore & Hutter (2011) Tor Lattimore and Marcus Hutter. Time consistent discounting. In International Conference on Algorithmic Learning Theory, pp. 383–397. Springer, 2011.
 Loewenstein (1996) George Loewenstein. Out of control: Visceral influences on behavior. Organizational behavior and human decision processes, 65(3):272–292, 1996.
 Machado et al. (2018) Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the Arcade Learning Environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 2018.
 Maia (2009) Tiago V Maia. Reinforcement learning, conditioning, and the brain: Successes and challenges. Cognitive, Affective, & Behavioral Neuroscience, 9(4):343–364, 2009.
 Mazur (1985) James E Mazur. Probability and delay of reinforcement as factors in discretetrial choice. Journal of the Experimental Analysis of Behavior, 43(3):341–351, 1985.
 Mazur (1987) James E Mazur. An adjusting procedure for studying delayed reinforcement. 1987.
 Mazur (1997) James E Mazur. Choice, delay, probability, and conditioned reinforcement. Animal Learning & Behavior, 25(2):131–147, 1997.
 McCloskey & Cohen (1989) Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109–165. Elsevier, 1989.
 Mirowski et al. (2016) Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673, 2016.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Montague et al. (1996) P Read Montague, Peter Dayan, and Terrence J Sejnowski. A framework for mesencephalic dopamine systems based on predictive hebbian learning. Journal of neuroscience, 16(5):1936–1947, 1996.
 Myerson & Green (1995) Joel Myerson and Leonard Green. Discounting of delayed rewards: Models of individual choice. Journal of the experimental analysis of behavior, 64(3):263–276, 1995.
 Nair & Hinton (2010) Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pp. 807–814, 2010.
 OpenAI (2018) OpenAI. Openai five. https://blog.openai.com/openaifive/, 2018.
 Pitis (2019) Silviu Pitis. Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence. AAAI Press, 2019.
 Prokhorov & Wunsch (1997) Danil V Prokhorov and Donald C Wunsch. Adaptive critic designs. IEEE transactions on Neural Networks, 8(5):997–1007, 1997.
 Rangel et al. (2008) Antonio Rangel, Colin Camerer, and P Read Montague. A framework for studying the neurobiology of valuebased decision making. Nature reviews neuroscience, 9(7):545, 2008.
 Redish & KurthNelson (2010) A David Redish and Zeb KurthNelson. Neural models of temporal discounting. 2010.
 Reinke et al. (2017) Chris Reinke, Eiji Uchibe, and Kenji Doya. Average reward optimization with multiple discounting reinforcement learners. In International Conference on Neural Information Processing, pp. 789–800. Springer, 2017.
 Romoff et al. (2019) Joshua Romoff, Peter Henderson, Ahmed Touati, Yann Ollivier, Emma Brunskill, and Joelle Pineau. Separating value functions across timescales. arXiv preprint arXiv:1902.01883, 2019.
 Samuelson (1937) Paul A Samuelson. A note on measurement of utility. The review of economic studies, 4(2):155–161, 1937.
 Schultz et al. (1997) Wolfram Schultz, Peter Dayan, and P Read Montague. A neural substrate of prediction and reward. Science, 275(5306):1593–1599, 1997.
 Schweighofer et al. (2008) Nicolas Schweighofer, Mathieu Bertin, Kazuhiro Shishida, Yasumasa Okamoto, Saori C Tanaka, Shigeto Yamawaki, and Kenji Doya. Lowserotonin levels increase delayed reward discounting in humans. Journal of Neuroscience, 28(17):4528–4532, 2008.
 Sherstan et al. (2018) Craig Sherstan, James MacGlashan, and Patrick M. Pilarski. Generalizing value estimation over timescal. In FAIM Workshop on Prediction and Generative Modeling in Reinforcement Learning, 2018.
 Singh (1992) Satinder P Singh. Scaling reinforcement learning algorithms by learning variable temporal resolution models. In Machine Learning Proceedings 1992, pp. 406–415. Elsevier, 1992.
 Sozou (1998) Peter D Sozou. On hyperbolic discounting and uncertain hazard rates. Proceedings of the Royal Society of London B: Biological Sciences, 265(1409):2015–2020, 1998.
 Strotz (1955) Robert Henry Strotz. Myopia and inconsistency in dynamic utility maximization. The Review of Economic Studies, 23(3):165–180, 1955.
 Suddarth & Kergosien (1990) Steven C Suddarth and YL Kergosien. Ruleinjection hints as a means of improving network performance and learning time. In Neural Networks, pp. 120–129. Springer, 1990.
 Sutton (1988) Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
 Sutton (1995) Richard S Sutton. Td models: Modeling the world at a mixture of time scales. In Machine Learning Proceedings 1995, pp. 531–539. Elsevier, 1995.
 Sutton & Barto (1998) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. 1998.
 Sutton et al. (2011) Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White, and Doina Precup. Horde: A scalable realtime architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent SystemsVolume 2, pp. 761–768. International Foundation for Autonomous Agents and Multiagent Systems, 2011.
 Tanaka et al. (2004) Saori C Tanaka, Kenji Doya, Go Okada, Kazutaka Ueda, Yasumasa Okamoto, and Shigeto Yamawaki. Prediction of immediate and future rewards differentially recruits corticobasal ganglia loops. Nature Neuroscience, 7:887 EP –, 07 2004. URL https://doi.org/10.1038/nn1279.
 Veeriah et al. (2018) Vivek Veeriah, Junhyuk Oh, and Satinder Singh. Manygoals reinforcement learning. arXiv preprint arXiv:1806.09605, 2018.
 Watkins & Dayan (1992) Christopher JCH Watkins and Peter Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.
 White (2017) Martha White. Unifying task specification in reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3742–3750. JMLR. org, 2017.
 Xu et al. (2018) Zhongwen Xu, Hado van Hasselt, and David Silver. Metagradient reinforcement learning. arXiv preprint arXiv:1805.09801, 2018.
Appendix A Equivalence of Hyperbolic Discounting and Exponential Hazard
Appendix B Alternative Discount Functions
We expand upon three special cases to see how functions may be related to different discount functions .
Three cases:

Delta hazard prior:

Exponential hazard prior:

Uniform hazard prior: for
For the three cases we begin with the Laplace transform on the prior and then chnage the variables according to the relation between , Equation 5.
b.1 Delta Hazard Prior
A delta prior on the hazard rate is consistent with exponential discounting.
where is a Dirac delta function defined over variable with value . The change of variable (equivalently ) yields differentials and the limits and . Additionally, the hazard rate value is equivalent to the .
where we define a to make the connection to standard RL discounting explicit. Additionally and reiterating, the use of a single discount factor, in this case , is equivalent to the prior that a single hazard exists in the environment.
b.2 Exponential Hazard Prior
Again, the change of variable yields differentials and the limits and .
where is the prior. With the exponential prior and by substituting we verify Equation 23
b.3 Uniform Hazard Prior
Finally if we hold a uniform prior over hazard, for then Sozou (1998) shows the Laplace transform yields
Use the same change of variables to relate this to . The bounds of the integral become and .
which recovers the discounting scheme.
Appendix C Determining the Interval
We provide further detail for which we choose to model and motivation why. We choose a which is the largest to learn through Bellman updates. If we are using as the hyperbolic coefficient in Equation 7 and we are approximating the integral with our would be
(27) 
However, allowing get arbitrarily close to 1 may result in learning instabilities Bertsekas (1995). Therefore we compute an exponentiation base of which bounds our at a known stable value. This induces an approximation error which is described more in Appendix F.
Appendix D Estimating Hyperbolic Coefficients
As discussed, we can estimate the hyperbolic discount in two different ways. We illustrate the resulting estimates here and resulting approximations. We use lowerbound Riemann sums in both cases for simplicity but more sophisticated integral estimates exist.
(a) We approximate the integral of the function via a lower estimate of rectangles at specific values. The sum of these rectangles approximates the hyperbolic discounting scheme for time .
(b) Alternative form for approximating hyperbolic coefficients which is sharply peaked as which led to larger errors in estimation under our initial techniques.
As noted earlier, we considered two different integrals for computed the hyperbolic coefficients. Under the form derived by the Laplace transform, the integrals are sharply peaked as . The difference in integrals is visually apparent comparing in Figure 16.
Appendix E Performance of Different Replay Buffer Prioritization Scheme
As found through our ablation study in Figure 15, the MultiRainbow auxiliary task interacted poorly with the prioritized replay buffer when the TDerrors were averaged evenly across all heads. As an alternative scheme, we considered prioritizing according to the largest , which is also the defining the values by which the agent acts.
The (preliminary^{4}^{4}4These runs have been computed over approximately 100 out of 200 iterations and will be updated for the final version.) results of this new prioritization scheme is in Figure 17.
To this point, there is evidence that prioritizing according to the TDerrors generated by the largest gamma is a better strategy than averaging.
Appendix F Approximation Errors
Instead of evaluating the upper bound of Equation 23 at 1 we evaluate at which yields . Our approximation induces an error in the approximation of the hyperbolic discount.
[0.6]
Discount function  MSE 

max=0.999  0.002 
max=0.9999  0.003 
max=0.99  0.233 
max=0.95  1.638 
max=0.9  2.281 
Appendix G Hyperparameters
For all our experiments in DQN Mnih et al. (2015), C51 Bellemare et al. (2017) and Rainbow Hessel et al. (2018), we benchmark against the baselines set by Castro et al. (2018) and we use the default hyperparameters for each of the respective algorithms. That is, our Multiagent uses the same optimization, learning rates, and hyperparameters as it’s base class.
Hyperparameter  Value 
Runner.sticky_actions  Sticky actions prob 0.25 
Runner.num_iterations  200 
Runner.training_steps  250000 
Runner.evaluation_steps  125000 
Runner.max_steps_per_episode  27000 
WrappedPrioritizedReplayBuffer.replay_capacity  1000000 
WrappedPrioritizedReplayBuffer.batch_size  32 
RainbowAgent.num_atoms  51 
RainbowAgent.vmax  10. 
RainbowAgent.update_horizon  3 
RainbowAgent.min_replay_history  20000 
RainbowAgent.update_period  4 
RainbowAgent.target_update_period  8000 
RainbowAgent.epsilon_train  0.01 
RainbowAgent.epsilon_eval  0.001 
RainbowAgent.epsilon_decay_period  250000 
RainbowAgent.replay_scheme  ’prioritized’ 
RainbowAgent.tf_device  ’/gpu:0’ 
RainbowAgent.optimizer  @tf.train.AdamOptimizer() 
tf.train.AdamOptimizer.learning_rate  0.0000625 
tf.train.AdamOptimizer.epsilon  0.00015 
HyperRainbowAgent.number_of_gamma  10 
HyperRainbowAgent.gamma_max  0.99 
HyperRainbowAgent.hyp_exponent  0.01 
HyperRainbowAgent.acting_policy  ’largest_gamma’ 
Appendix H Auxiliary Task Results
Final results of the multihorizon auxiliary task on Rainbow (MultiRainbow) in Table 3.
Game Name  DQN  C51  Rainbow  MultiRainbow 

AirRaid  8190.3  9191.2  16941.2  12659.5 
Alien  2666.0  2611.4  3858.9  3917.2 
Amidar  1306.0  1488.2  2805.7  2477.0 
Assault  1661.6  2079.0  3815.9  3415.1 
Asterix  3772.5  15289.5  19789.2  24385.6 
Asteroids  844.7  1241.5  1524.1  1654.5 
Atlantis  935784.0  894862.0  890592.0  923276.7 
BankHeist  723.5  863.4  1209.0  1132.0 
BattleZone  20508.5  28323.2  42911.1  38827.1 
BeamRider  6326.4  6070.6  7026.7  7610.9 
Berzerk  590.3  538.3  864.0  879.1 
Bowling  40.3  49.8  68.8  62.9 
Boxing  83.3  83.5  98.8  99.3 
Breakout  146.6  254.1  123.9  162.5 
Carnival  4967.9  4917.1  5211.8  5072.2 
Centipede  3419.9  8068.9  6878.0  6946.6 
ChopperCommand  3084.5  6230.4  13415.1  13942.9 
CrazyClimber  113992.2  146072.3  151454.9  160161.0 
DemonAttack  7229.2  8485.1  19738.0  14780.9 
DoubleDunk  4.5  2.7  22.6  21.9 
ElevatorAction  2434.3  73416.0  81958.0  85633.3 
Enduro  895.0  1652.9  2290.1  2337.5 
FishingDerby  12.4  16.6  44.5  45.1 
Freeway  26.3  33.8  33.8  33.8 
Frostbite  1609.6  4522.8  8988.5  7929.7 
Gopher  6685.8  8301.1  11749.6  13664.6 
Gravitar  339.1  709.8  1293.0  1638.7 
Hero  17548.5  34117.8  47545.4  50141.8 
IceHockey  5.0  3.3  2.6  6.3 
Jamesbond  618.3  816.5  1263.8  773.4 
JourneyEscape  2604.2  1759.1  818.1  1002.9 
Kangaroo  13118.1  9419.7  13794.0  13930.6 
Krull  6558.0  7232.3  6292.5  6645.7 
KungFuMaster  26161.2  27089.5  30169.6  31635.2 
MontezumaRevenge  2.6  1087.5  501.3  800.3 
MsPacman  3664.0  3986.2  4254.2  4707.3 
NameThisGame  7808.1  12934.0  9658.9  11045.9 
Phoenix  5893.4  6577.3  8979.0  23720.3 
Pitfall  11.8  5.3  0.0  0.0 
Pong  17.4  19.7  20.3  20.6 
Pooyan  3800.8  3771.2  6347.7  4670.0 
PrivateEye  2051.8  19868.5  21591.4  888.9 
Qbert  11011.4  11616.6  19733.2  20817.4 
Riverraid  12502.4  13780.4  21624.2  21421.2 
RoadRunner  40903.3  49039.8  56527.4  55613.0 
Robotank  62.5  64.7  67.9  67.2 
Seaquest  2512.4  38242.7  11791.5  64985.0 
Skiing  15314.9  17996.7  17792.9  15603.3 
Solaris  2062.7  2788.0  3061.9  3139.9 
SpaceInvaders  1976.0  4781.9  4927.9  8802.1 
StarGunner  47174.3  35812.4  58630.5  72943.2 
Tennis  0.0  22.2  0.0  0.0 
TimePilot  3862.5  8562.7  12486.1  14421.7 
Tutankham  141.1  253.1  255.6  264.9 
UpNDown  10977.6  9844.8  42572.5  50862.3 
Venture  88.0  1430.7  1612.4  1639.9 
VideoPinball  222710.4  594468.5  651413.1  650701.1 
WizardOfWor  3150.8  3633.8  8992.3  9318.9 
YarsRevenge  25372.0  12534.2  47183.8  49929.4 
Zaxxon  5199.9  7509.8  15906.2  21921.3 
Comments
There are no comments yet.