Our aim is to propose a model of subjective time based on information theory and to investigate its implications relative to two phenomena:
Why does time appear to slow down when you visit a new place, and speed up once you get familiar with it? Recent findings in psychology, neuroscience, and ethology suggest that perceived duration does not coincide with physical duration, but rather depend on the statistical properties of stimuli. Experiments in psychophysics experiments have shown that, if presented with a train of repeated stimuli at constant time intervals (e.g., a letter, word, object, or face), subjects would perceive them as decreasing in duration [70, 44]. On the other hand, the opposite effect is reported whenever the properties of a train of stimuli are suddenly changed: brighter [10, 66], bigger [41, 73], dynamic [11, 31], or more complex stimuli [54, 49]
appear to last longer. Measurements of brain activity have found that longer durations correlate with increased neuronal firing rates, fMRI, or EEG signals[2, 20, 16, 35, 45, 40]. When combined with ideas from information theory, these observations have led to the hypothesis that the subjective duration of a stimulus is proportional to the amount of neural energy required to represent said stimulus, and that this energy is a signature of the coding efficiency .
Why do $100 today feel more than $100 tomorrow? Intertemporal choices lie at the heart of economic decision-making. The economic literature has proposed early on  that rational decision-makers prefer immediate rewards over similar rewards in the future because they discount time
—that is, future moments in time weight less in their assessment of utility. Historically, the first and most commonly used mathematical model of temporal discounting, namelyexponential discounting111Also known as geometric discounting in the artificial intelligence literature., proposes that decision-makers discount future utilities using a single, constant discount rate which compresses distinct psychological motives [52, 24]. The second most influential type of model is hyperbolic discounting, which postulates that discount rates decline over longer time horizons, i.e. future values decrease less rapidly than exponential. Structurally, hyperbolic discounting does not possess many of the elegant properties of exponential discounting (such as e.g. dynamical consistency); however, it has significantly more empirical support [8, 34, 4, 59].
Temporal discounting was originally conceived as a property of the decision-maker’s preferences (e.g. the utility function’s curvature). However, relatively recent studies have investigated the role of subjective time as the underlying cause of intertemporal preferences. For instance, prior work proposed that hyperbolic discounting arises due to a sub-additive perception of duration ; and recent experimental findings [65, 74, 8] have suggested that intertemporal choice patterns are well captured by treating time as a perceptual modality subject to classical psychophysical laws (e.g. Weber-Fechner law).
Similar to prior proposals , the question we address here is: how does an agent’s memory affect time perception and intertemporal preferences? Here we propose a model of time perception in terms of an agent’s memory requirements to encode (temporal) sensorimotor dependencies. This approach does not touch upon the cognitive processes that implement the sense of time (such as internal clock and attentional counter models [27, 39, 17, 38]), nor does it attempt to provide a phenomenological account [63, 72, 15, 18]. Instead, by restricting our attention to the purely statistical properties of behavior, we obtain a model of time perception that is agnostic to the implementation details and the substrate. In this abstraction, a range of tools become available that enable the quantitative investigation of representational limitations. Our main finding is that a system’s memory simultaneously shapes its perception of time and its intertemporal preferences, consistent with previous experimental findings.
2 Predictive Information and Free Energy
We consider adaptive agents that interact with an environment, e.g. a mouse chasing food in a laboratory maze, a vacuum-cleaning robot, or a ribosomal complex synthesizing proteins in the citoplasm (see Fig. 1). In artificial intelligence, such agent-environment systems are often approximated using discrete-time stochastic processes in which the agent and the environment take turns to exchange actions and observations drawn from appropriately defined finite sets and respectively . In each time step, the agent acts following a policy which specifies the probability of generating the action when it is in state . Similarly, the environment replies with an observation following the stochastic dynamics . The agent generates actions so as to bring about its goals, such as optimizing a feedback signal or maintaining a homeostatic equilibrium.
We center our analysis on the stochastic process describing the interaction dynamics as seen from the perspective of the agent. An adaptive agent has to simultaneously solve two learning tasks: predicting the environment and learning to act optimally. In Bayesian reinforcement learning, the standard approach is to model this as an agent that uses past data to learn a parameter that encapsulates both the properties of the environment and its corresponding optimal policy. As the agent gains knowledge about , it improves its predictions about the environment and simultaneously its ability to choose better actions (see Section 3 for concrete examples). Accordingly, the stochastic process represents the agent’s beliefs about the interaction dynamics, and the state is given by the agent’s information state or memory state222Notice that the agent’s memory is not localized within its body: rather, it is distributed between the body and the surroundings. For instance, a vacuum-cleaning robot with a handful internal states can use the objects in its environment as an external memory device—opening the possibility to Turing-complete behavior. Hence, a comprehensive analysis of the agent’s behavior must be based on the memory of the combined agent-environment system.. We will assume that the state it is the minimal sufficient statistic, i.e. the minimal information shared between the past experience and the parameter , as any redundant information contained in the past does not improve the agent’s prediction abilities and thus can be discarded.
In order to establish a link between the agent’s memory state, reward function, and temporal perception properties, we first need to clarify what we mean by memory and reward, and how to infer these from the stochastic process .
2.1 Predictive Information
We first focus on the memory of the stochastic process. As is customary in information theory, the information content of a given finite sequence is assessed in terms of its binary codeword length . The binary length is a standardized proxy for the complexity of a sequence, as it characterizes both the amount of two-state storage units required for remembering it and the difficulty of generating it from fair coin flips  (Fig. 1b). Shannon proved that the optimal expected codeword length is given by the entropy, implying that a minimal codeword has length , where is the probability of the sequence . Many techniques exist to construct near-optimal lossless codewords for data streams; a particularly elegant one is arithmetic coding [48, 61].
What is memory? Following Bialek et al. , imagine that we have measured the first few interactions, and call this past . Given , we want to predict a finite number of future interactions . Even before looking at the data, we know that some futures are more likely than others, and this knowledge is summarized by a prior distribution . If in addition we take into account the information contained in the past, then we obtain a more tightly concentrated posterior distribution over futures, . The number of bits in the past that are used to improve the prediction of a future is quantified as the difference between the prior and posterior codeword lengths:
1 measures the minimal memory that a system must possess to enable this prediction333This difference can be negative. In this case, it can be interpreted as the amount of information contained in the past that is inconsistent with the future . . If we average over all realizations, we obtain the mutual information between the past and the future:
In the literature, this quantity is known as the predictive information [5, 6]. Intuitively, the predictive information can be thought of as quantifying the amount of memory that is preserved by the patterns, rules, or correlations that relate the past with the future. An important property of the predictive information is that it is subextensive if the stochastic process is stationary. In other words, the predictive information has a sublinear asymptotic growth in the length of a realization. This is in stark contrast to the entropy of the process, which grows linearly with the length of a realization. Consequently, only a vanishing fraction of the dynamics of the process is governed by patterns; most of it is driven by pure noise.
2.2 Free Energy
Our second step is to establish a firm link between the statistics of stochastic processes and their implicit rewards. The evolution of the stochastic process can be analyzed using decision-theoretic tools by adopting a complementary view to the previous one. In this interpretation, conditioning amounts to imposing constraints on an otherwise free evolution of the stochastic process. The assumption is that, if uncontrolled, the future interactions would follow the stochastic dynamics described by the prior distribution . However, when the agent experiences the past interactions , it acquires knowledge that leads it to steer the process into a more desired direction, resulting in the dynamics given by the posterior distribution . This transformation can be characterized as the result of maximizing expected rewards subject to constraints on the memory capacity of the process [68, 42]. The associated objective function is the free energy functional
which is to be maximized w.r.t. to the distribution over futures conditioned on the past . In the first expectation, is the real reward444We assume that rewards are additive: . of the future sequence conditioned on the past and is the terminal reward of the sequence . The second term is a penalization that measures the memory cost of changing the probability of . The parameter is the inverse temperature, and it encapsulates the trade-off between rewards and information costs: larger values correspond to cheaper memory costs and therefore more control. The posterior distribution over the future is then defined as the maximizer of (3) given by the Gibbs distribution
where is a normalizing constant.
Since (3) and (4) must hold for any past-future window, two consequences follow. The first is that the free energy functional has a recursive structure given by the equality between optimal free energies555From an economic point of view, the optimal free energy (5) turns out to be the certainty-equivalent value of knowing ; in other words, it is the net worth the agent attributes to the future when it has experienced the past . and terminal rewards:
In particular, the terminal rewards that appear in equation (3) are themselves the result of optimizing free energy functionals over the distant futures that occur after . If we average over the pasts, we get
which reveals that the KL-penalization term is a constraint on the predictive information, imposing a limit on the memory of the stochastic process during the maximization of the expected rewards. The second consequence is that rewards and log-likelihoods are related via an affine transformation
where is a constant. Intuitively, this means that “the future has more reward given the past ” and “the future is more likely given the past ” are two equivalent statements666Similar points regarding the equivalence between rewards and likelihoods were made previously, see e.g. [25, 55]. It is also worth clarifying that we treat rewards as internal, subjective quantities. This is consistent with expected utility theory [71, 53], but unlike the more recent interpretation in reinforcement learning where rewards are treated as externally supplied, objective quantities. . In addition, (7
) provides a simple formula for estimating the rewards from the realizations of a given stochastic process.
We have conducted simulations of agent-environment systems to achieve two goals: first, to measure the agents’ subjective time frame (made precise later in this section); and second, to calculate their implicit discount functions. The simulation results (samples from the stochastic processes) provided us with the necessary data to subsequently estimate (using the tools reviewed in the previous section) the implicit memory constraints and rewards contained in the agent-environment interactions.
For simplicity, our simulations are based on the standard framework of multi-armed bandit problems 
. In these problems, an agent gambles a slot-machine with multiple arms. When played, an arm provides the agent with a Bernoulli-distributed nominal reward, where the bias is initially unknown to the agent. The objective of the game is to play a sequence of arms in order to maximize the sum of rewards. Although simple, bandit problems pose many of the core challenges of sequential decision-making. For instance, an agent has to balance greedy choices versus choices intended to acquire new knowledge—a trade-off known as the exploration-exploitation dilemma.
Throughout all our simulations we used two-armed bandits with arms labeled as “a” and “b” (Fig. 2e). To investigate how the learning ability impacts time perception, we simulated agents with probabilistic models of increasing complexity (Materials & Methods). To do so, we used four different types of parameter spaces : a singleton set , representing an informed agent that already knows the dynamics of the environment and the optimal policy; a finite set ; a finite-dimensional parameter space giving rise to a parametric probabilistic model; and an infinite-dimensional parameter space . We call these agents informed, finite, parametric, and nonparametric  respectively. Note that only the last three are adaptive. Furthermore, the agents use a probability matching strategy known as Thompson sampling to pick their actions. Accordingly, in each turn the agent samples one bias for each arm from the posterior distribution over and then plays the arm with the largest bias . Fig. 2 compares the predictions made by the four probabilistic models.
The nominal rewards issued by the bandits should not be confused with the real rewards defined in the previous section. Although nominal rewards feature in the description of the problem setup of multi-armed bandits, finding a policy that maximizes rewards is in general intractable. Therefore, the resulting policies will not be optimal with respect to those nominal rewards. However, we can use the free energy functional to infer the real rewards optimized by the stochastic process defined by the interactions between the constructed agents and the bandits.
3.1 Present Scope and Perceived Durations
We measure the passage of time using clocks, e.g. hourglasses, wristwatches, planetary movements and atomic clocks, all of which register changes in the physical state of the world . Analogously, an agent tracks the passage of time through the changes in its memory state triggered by the interactions. Maintaining a memory state induces temporal correlations between the past and the future that are a signature of the underlying adaptive mechanisms [5, 56, 62]. Conversely, the lack of temporal correlations is indicative of the absence of memory.
At any given point during the realization of the stochastic process, it is natural to define the “present” as the minimal information contained in the memory state that can be confirmed empirically, i.e. the minimal sufficient statistic of the past (Fig. 3a,b). Imagine that the agent experiences the past and enters state . In this state, the amount of information the agent possesses about any particular future is equal to
bits. Similarly, only bits about the past are remembered. Both of these quantities are upper bounded by , the total information contained in the past. The scope of the agent’s present is the part of the past/future that is remembered/predicted by the agent’s memory state (Fig. 3d–f). In the special case of the informed agent, the present state has zero span (Fig. 3d) because it does not need to maintain any memory in order to behave optimally. In contrast, the parametric agent possesses an extensive scope (Fig. 3e,f) that grows with more experience. In particular, the agent can only remember its past up to a permutation of the interactions because the distribution over observations is exchangeable. Furthermore, it predicts futures that are consistent with the past experience; for instance, in the illustrated case the agent’s most likely future repeatedly pulls arm “b” and observes “1” in accordance with the past . Deviations from the past, such as those caused by oddballs, contradict the memory state (i.e. they share a negative number of bits). Note that the possession of a present scope is a property shared by all adaptive agents.
Given this definition of the present scope, we propose to model the passage of time relative to the stochastic process as the number of bits in the memory that change during the experience of an interaction (Fig. 3
c). This definition rests upon the assumption that the stochastic process is implemented on a computation model having fixed bandwidth per operation—such as a (probabilistic) Turing machine, which can only modify a limited number of bits per cycle777In the case of a probabilistic Turing machine, the only bits that can change within a cycle are those needed to represent the state of the Turing machine, the movement the header, and the content the current cell in the tape. [58, 43]. We argue that this provides a more plausible time metric than the time index or even the entropy of the stochastic process, for using the index of the stochastic process would yield a non-homogeneous complexity per time unit888Since an individual interaction can be arbitrarily complex, forcing it to be computed in one time unit would require a machine that can operate at an unbounded speed.. Formally, given a finite past and future , consider a transition that increases the past from to and reduces the future from to as depicted in Fig. 3c. The perceived duration of relative to this limited window is equal to the difference
These durations are illustrated in Fig. 3g–i. The informed agent operates in the equilibrium regime and thus does not experience time (Fig. 3g). In contrast, the parametric agent perceives durations that vary with the current knowledge. For instance, Fig. 3h–i show that predictable interactions decrease in duration (e.g. “b1,b1,b1”), whereas a mistake (e.g. accidentally playing “a” instead of “b”) and some of rewards result in longer durations. Oddballs, such as the deviations from “b1,b1,b1”, do not necessarily entail longer durations.
3.2 Temporal Discounting
When an agent has a limited capacity to predict the future, it can only exploit a fraction of the rewards that lie ahead. The precise amount can be inferred by inspecting how much they affect the behavior of the stochastic process. If the probability of choosing a future given a past is the result of optimizing the free energy functional (3), then the change in log-probability can be written as (Materials & Methods)
Using (8), we identify the l.h.s. with the information about the future predicted by the present. The r.h.s. is proportional to the difference between two terms: , the cumulative and terminal rewards of ; and , the certainty-equivalent value of all the potential futures. This result states that the change of a choice’s probability depends exclusively on how much it improves upon the summarized value of the choice set. In accordance to decision theory, we refer to this excess as the rejoice (i.e. negative regret, see Materials & Methods) . The rejoice scales proportionally with the agent’s memory and is therefore determined by the learning model.
In the economic literature, an agent that reacts only to a fraction of the reward is explained through a discount function. Typically, a discount function gives smaller weights to rewards that lie farther in the future . Following the same rationale, we next show how to derive discount functions that re-weight the rewards in order to equate them with the rejoices.
We conducted Monte-Carlo simulations of the four agents to study the evolution of their predictive performance (Materials & Methods). To isolate the effects of the adaptive policy from the effects of learning the environment, we also ran each agent a second time but using a (non-adaptive) uniform random policy. All estimates were obtained from averaging simulated trajectories of even length that were split in the middle to get a past and a future . The results are shown in Fig. 4. The first panel (Fig. 4a) shows that the conditional entropies of the past given the future , which were obtained by averaging over the negative log-likelihoods , grow linearly with the size of the window. As discussed in Sec. 2.2, log-likelihoods are affine transformations of the real rewards. In contrast, the mutual information between the past and the future grows sublinearly, and the growth rate increases with model complexity (Fig. 4b–c). The mutual information curves for non-adaptive policies provide a lower bound for their adaptive counterparts. However, adaptive policies converge to the optimal policies in the limit, and so we expect their mutual information curves to converge to the lower bound. The simulations confirm this in the finite and parametric agents. Fig. 4
c also shows that the parametric model has a logarithmic growth in mutual information. In comparison, the nonparametric model is seen to be super-logarithmic and the finite model upper-bounded. Finally, we have also inspected the sum of the nominal rewards contained in the futures. Fig.4d shows that all the agents predict nominal rewards that grow proportionally with the length of the past-future window. The slopes of these curves vary, and agents with adaptive policies predict better rewards.
We computed linear fits of the nominal reward curves (NR) of Fig. 4d and nonlinear fits for the mutual information (MI) curves of Fig. 4b for the agents with uniform policies, which will serve as lower bounds for the ones based on adaptive policies. Specifically, the nonlinear models have the following asymptotic behavior in the length of the interval of past interactions: exponential decay for the finite agent; logarithmic growth for the parametric agent; and power-law growth , , for the nonparametric agent. The last two have been justified analytically in previous work: Bialek et al. 
have shown that the predictive information of models based on finite- and infinite-dimensional parameter vectors has logarithmic and power-law growth respectively. Notice that since nominal and real rewards are both asymptotically linear, choosing one or the other for our fits would lead to the same asymptotic conclusions. The results of this fit are listed Table1. Combining these fits, we calculated the predicted rewards relative to the agent’s perceived duration (Fig. 4e). The plots show that agents predict a superlinear growth relative to their perceived duration. Given that the rejoice is proportional to the duration, agents must assign decaying weights to the predicted rewards, and these weights are given by the discount functions listed in Table 2, shown in Fig. 4
f (Materials & Methods). The discount functions can be classified according to their asymptotic decay, resulting in: infinite, linear, exponential, and hyperbolic discounting for the informed, finite, parametric, and nonparametric agent respectively. These functions provide lower bounds for the discount functions of the agents using an adaptive policy.
Our theoretical model relates the change of an agent’s memory state to the perceived duration of interactions and to the rejoice. Concisely stated, this relationship is given by
Specifically, we have taken the memory to be synonymous with the minimal sufficient statistics (i.e. encoding all past-future dependencies and only those) of the agent’s probabilistic model. As suggested by the agent-environment setup, we expect the memory substrate in animals to encompass every modulator of behavior, ranging from the synaptic weights in the brain to the immediate surroundings. Furthermore, an increase of the information-processing capacity per interaction can be related to physiological factors such a decrease in body size or an increase in metabolic rate . Our agent-environment simulations illustrate that while complex reactive behavior can emerge from limited to no memory (as in the informed agent), adaptive behavior depends on the availability of sufficient memory resources.
Consistent with previous findings in the field of time perception , the model links the shortening durations of repeated stimuli to their increased predictability. More specifically however, it distinguishes between three cases. Interactions that are (a) well-predicted or (b) novel but irrelevant for future behavior imply fewer changes to the agent’s memory when experienced. Therefore, their perceived duration is shorter. In contrast, (c) oddballs that are relevant for future behavior (i.e. eliciting larger rejoice) induce adaptation and are thus perceived as being longer in duration. This connection between the perception of time and rewards is supported by empirical findings in the literature on attention . For instance, in a recent study where participants performed a prospective timing task, it was found that only oddballs signaling relatively high reward compared to the standards were perceived to last longer, whereas oddballs with no or little reward remained unaffected .
Within this context, it is instructive to examine two limit cases. Consider an agent with a clock that ticks once per interaction. If the agent has little to no memory (like the informed agent in our experiments), then the temporal resolution vanishes and the clock spins infinitely fast from the agent’s point of view. In contrast, if the agent possesses no capacity constraints, as is assumed in the perfect rationality paradigm , then the clock slows down to a point where it appears frozen. In this sense, memory capacity acts a form of temporal inertia that quantifies the amount of information required to move the agent one clock tick in time.
The memory constraints limit the agent’s ability to react to distant rewards, giving rise to the phenomenon of temporal discounting. The simulation results have shown how model complexity qualitatively affects the asymptotic behavior of discount rates; in particular, exponential and hyperbolic discounting were shown to arise from parametric and nonparametric model classes respectively. Given that humans have been shown to display hyperbolic discounting , our results suggest that human intertemporal value judgment may arise from a memory formation rate comparable to that of nonparametric models.
The model of time perception can also account for some effects of general memory manipulation. For instance, an increase of memory plasticity will correlate positively with perceived durations. This phenomenon is consistent with experimental findings, e.g. in which the administration of dopamine has led to the overestimation of durations and the attenuation of impulsivity (steep discounting) [32, 30].
Finally, is worth remarking that the relation between memory changes, duration and rejoice that we have laid out here is not specific to said variables, but rather a general property of quantities that are extensive/additive in the interactions. In other words, the limitations imposed by the memory growth rate appear to be a general property of perception. Verifying this property for other perceptual modalities (other than time and reward) is a task to be further explored in the future.
Materials and Methods
Multi-Armed Bandit Processes.
We considered simple agents and environments where the set of actions and observations were chosen as and
respectively. For the actions, the symbols “a” and “b” encode the left and the right arm respectively; for the observations, 0 and 1 correspond to a nominal loss and reward responses. The environment is a two-armed bandit characterized by a bias vector, where for . When the agent pulls arm , the bandit replies with a reward drawn from a Bernoulli distribution with bias . When the biases are known, as is the case in the informed agent, then the optimal strategy consists in always playing the arm with the highest bias, . In our simulations of the informed agent, we chose a bias vector equal to Hence, the optimal strategy was to pick in every turn.
When the biases are unknown, the agent’s uncertainty is modeled by placing a prior distribution over the bias vector [19, 29]. In the case of the finite agent, the hypothesis class is given by the set of four bias vectors , and it places the prior pmf
over each arm’s bias. The terms and
are hyperparameters that keep track of the total number of times that the bandit responded with rewards and a losses respectively[9, 13]. The resulting prior pmf is a product distribution
The parametric agent enriches the finite case by extending the hypothesis class to all the bias vectors in the unit square , placing an independent Beta pdf
over each vector component. Because both agents have priors that are conjugate to the Bernoulli distribution, their posteriors are obtained by just updating the four hyperparameters , , , and . For example, if the agent plays arm “a” and the environment replies with a reward, then the hyperparameter is incremented in one count and the others are kept equal.
The nonparametric agent is based on a more flexible model. Rather than assuming a single bandit, the model considers a sequence of two-armed bandits labeled as , each one having its own vector of biases . Starting at bandit , the agent moves to the next bandit whenever it receives a reward, and returns to the first bandit () when it receives a loss. In each interaction, only the hyperparameters of the current bandit are updated. The resulting hypothesis class is given by the set of all maps .
To generate actions, all the agents employ either a uniform strategy or Thompson sampling . In the latter case, an agent generates Monte-Carlo samples of the biases of the current bandit from its posterior distribution. Subsequently, it plays the arm associated to the largest bias.
At the beginning of each simulation, all the hyperparameters of an agent’s probabilistic model where initialized to one, i.e. , for the finite and parametric agents, and , for all
in the case of the nonparametric agent. This corresponds to a uniform distribution over every bias. Furthermore, the true (unknown) biases were sampled uniformly in accordance to the agents’ priors.
The sufficient statistics for the four agents follow directly from their probabilistic models. The informed agent’s stochastic process is memoryless; therefore, its sufficient statistic can be modeled as a constant function of the past. The predictions made by the finite, parametric, and nonparametric agents depend upon the total counts of rewards and losses (i.e. the hyperparameters) observed so far, the number of the bandit, and the action issued during the current interaction if available.
The main difference between expected utility theory [71, 53] and regret theory [23, 3, 36] is that in the former decision-makers maximize the expected utility, whereas in the latter decision-makers minimize regret, i.e. they choose an action that minimizes a function , where is a reference action and is a utility function. The regret function quantifies how much the utility of is affected by what would have happened had been chosen instead of . Arguably, the simplest regret function is given by the difference .
Decision-making based on the free energy functional can be related to regret theory. The solution to the average free energy functional is given by Gibbs distribution
where the normalizing constant is the partition function. It is well-known that the optimal free energy is equal to
In the economic literature, this quantity is known as the certainty-equivalent, and it is a function of the set of future rewards that measures the agent’s subjective worth of the cumulative rewards that lie in the future. The value is bounded as
where the lower and upper bounds are attained when and respectively. Rearranging (12) as
reveals that the changes in choice probabilities are governed by a rejoice (negative regret) function that contrasts the rewards of future realizations against the certainty-equivalent:
The curves for the conditional entropy (Fig. 4a), the mutual information (Fig. 4b & c), and the predicted rewards (Fig. 4d) were obtained from Monte-Carlo averages made at equally spaced locations (). To calculate an estimate at location , we averaged the log-probabilities and rewards of interaction sequences of length generated from the agent’s stochastic process described in the previous section. Given the -th simulated interaction sequence, let and be its first and second half respectively. Furthermore, let denote the nominal rewards over the second half, calculated as the sum of the observations in . The entropies , , , and the expected nominal rewards were estimated as
We used the difference between (15) and (13) as an estimate for the mutual information . In particular, note that the estimate of the marginal entropy was obtained using a doubly-stochastic Monte-Carlo average over samples in which the second half was kept fixed during blocks of size . The conditional entropy of the past given the future, which stands for a proxy of the real rewards, was estimated using the formula . The number of samples were chosen as and .
In the cases of the parametric and nonparametric agents, we made additional approximations to the Thompson sampling strategy. Specifically, we used the normal approximation to the Beta distribution:
The approximation holds well for large values of and , and it has the advantage of keeping both the generation and evaluation of action samples computationally tractable. The probability of choosing arm “a” then becomes equal to , where.
The discount functions were derived using the following procedure. The choice of the functional forms of the model classes listed in Table 1 were motivated by prior studies of the long-term behavior of the entropy and the predictive information . The exception is the functional form of the finite model’s mutual information, which was chosen through inspection of the curve. The parameters were fit (least-square regression) to the data obtained in the Monte-Carlo simulations. This yielded two functions per agent: the expected nominal reward function and the mutual information , both as a function of the number of interactions of the past and the future window. For analytical convenience, we extended the number of interactions form the discrete to the continuous domain. The two functions and were then connected via . Fig. 4e is obtained by plotting , where the inverse is the number of interactions as a function of the mutual information , now interpreted as a temporal coordinate. These inverses were equal to: for the finite model; for the parametric model; and for the nonparametric model. The respective nominal reward functions are just rescaled versions of . To obtain the discount functions, we must make sure that the discounted future grows proportionally in , that is
because the rejoice is proportional to the mutual information. Assuming w.l.g. that , this is achieved when
Thus, the resulting discount functions have the shapes: for the finite case; , for the parametric case; and for the nonparametric case. Note that the constants and are positive in each case.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
P.A. Ortega would like to thank D. Balduzzi, D.A. Braun, J.R. Donoso, K.-E. Kim and D. Polani for helpful comments and suggestions. This study was funded by the Israeli Science Foundation center of excellence, the DARPA MSEE project and the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI).
- Ashby  W.R. Ashby. An introduction to cybernetics. Chapman and Hall Ltd., 1956.
- Barlow et al.  R. B. Barlow, D. M. Snodderly, and H. A. Swadlow. Intensity coding in primate visual system. Experimental Brain Research, 31(2):163–77, 1978.
- Bell  D.E. Bell. Regret in decision making under uncertainty. Operations Research, 33:961–981, 1982.
- Berns et al.  G.S. Berns, D. Laibson, and G. Loewenstein. Intertemporal choice–toward an integrative framework. Trends in Cognitive Science (Regul. Ed.), 11(11):482–8, 2007.
- Bialek et al. [2001a] W. Bialek, I. Nemenman, and N. Tishby. Predictability, Complexity, and Learning. Neural Computation, 13:2409–2463, 2001a.
- Bialek et al. [2001b] W. Bialek, I. Nemenman, and N. Tishby. Complexity through nonextensivity. Physica A: Statistical Mechanics and its Applications, 302(1-4):89–99, December 2001b.
- Bleichrodt and Wakker  H. Bleichrodt and P. P. Wakker. Regret theory: A bold alternative to the alternatives. The Economic Journal, 125(583):493–532, 2015.
- Bradford et al.  W.D. Bradford, P. Dolan, and M.M. Galizzi. Looking Ahead: Subjective Time Perception and Individual Time Discounting. Technical Report CEP Discussion Papers 1255, Centre for Economic Performance, LSE, 2014.
- Braun and Ortega  D.A. Braun and P.A. Ortega. A minimum relative entropy principle for adaptive control in linear quadratic regulators. In The 7th conference on informatics in control, automation and robotics, volume 3, pages 103–108, 2010.
- Brigner  W L Brigner. Effect of perceived brightness on perceived time. Percept Mot Skills, 63(2 Pt 1):427–30, October 1986.
- Brown  J. F. Brown. Motion expands perceived time. Psychologische Forschung, (14):233–248, 1931.
- Brown  S. W. Brown. Timing, resources, and interference: Attentional modulation of time perception. Attention and time, pages 107–121, 2010.
- Chapelle and Li  O. Chapelle and L. Li. An Empirical Evaluation of Thompson Sampling. In Advances in Neural Information Processing Systems 24, pages 2249–2257, 2011.
- Cover and Thomas  T.M. Cover and J.A. Thomas. Elements of information theory. Wiley New York, 1991.
- D’Argembeau and der Linden  Arnaud D’Argembeau and Martial Van der Linden. Individual differences in the phenomenology of mental time travel: The effect of vivid visual imagery and emotion regulation strategies. Conscious Cognition, 15(2):342–50, 2006.
- de Jong et al.  B. M. de Jong, S. Shipp, B. Skidmore, R. S. Frackowiak, and S. Zeki. The cerebral activity related to the visual perception of forward motion in depth. Brain, 117 (Pt 5):1039–54, 1994.
- Dragoi et al.  V. Dragoi, J. E. R. Staddon, R. G. Palmer, and C. V. Buhusi. Interval Timing as an Emergent Learning Property. Psychological Review, 110:126–144, 2003.
- Droit-Volet and Gil  S. Droit-Volet and S. Gil. The Time-Emotion Paradox. Philosophical Transactions of the Royal Society B: Biological Sciences, 364:1943–1953, 2009.
Optimal learning: computational procedures for bayes-adaptive markov decision processes. PhD thesis, 2002. Director-Andrew Barto.
- Dupont et al.  P. Dupont, G. A. Orban, B. De Bruyn, A. Verbruggen, and L. Mortelmans. Many areas in the human brain respond to visual motion. Journal of Neurophysiology, 72(3):1420–1424, 1994.
- Eagleman and Pariyadath  D.M. Eagleman and V. Pariyadath. Is subjective duration a signature of coding efficiency? Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 364(1525):1841–1851, July 2009.
- Failing and Theeuwes  M. Failing and J. Theeuwes. Reward alters the perception of time. Cognition, 148:19–26, 2016.
- Fishburn  P.C. Fishburn. The Foundations of Expected Utility. D. Reidel Publishing, Dordrecht, 1982.
- Frederick et al.  S. Frederick, G. Loewenstein, and T. O’Donoghue. Time Discounting and Time Preference: A Critical Review. Journal of Economic Literature, 40(2):351–401, June 2002.
- Friston et al.  K. Friston, R. Adams, and R. Montague. What is value-accumulated reward or evidence? Frontiers in Neurorobotics, 6:11, 2012.
- Ghahramani  Z. Ghahramani. Bayesian non-parametrics and the probabilistic approach to modelling. Philosophical Transactions of the Royal Society A, 371(20110553), 2013.
- Gibbon et al.  J. Gibbon, R. M. Church, and W. H. Meck. Scalar timing in memory. Annals of the New York Academy of Sciences, 423:52–77, 1984.
- Healy et al.  K. Healy, L. McNally, G.D. Ruxton, N. Cooper, and A.L. Jackson. Metabolic rate and body size are linked with perception of temporal information. Animal Behaviour, 86(4):685–696, 2013.
- Hutter  M. Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2004.
- Joutsa  J. Voon V. Johansson J. Niemelä S. Bergman J. Kaasinen V. Joutsa. Dopaminergic function and intertemporal choice. Translational psychiatry, 5(1), 2015.
- Kanai et al.  R. Kanai, C. L. Paffen, H. Hogendoorn, and F. A. Verstraten. Time dilation in dynamic visual display. Journal of vision, 6(12):1421–1430, 2006.
- Kayser  A. S. Allen D. C. Navarro-Cebrian A. Mitchell J. M. Fields H. L. Kayser. Dopamine, corticostriatal connectivity, and intertemporal choice. The Journal of Neuroscience, 32(27):9402–9409, 2012.
- Lai and Robbins  T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6:4–22, 1995.
- Laibson  D. Laibson. Golden Eggs and Hyperbolic Discounting. Quarterly Journal of Economics, 112(2):443–477, 1997.
- Linden et al.  D. E. Linden, D. Prvulovic, E. Formisano, M. Vollinger, F. E. Zanella, R. Goebel, and T. Dierks. The functional neuroanatomy of target detection: An fMRI study of visual and auditory oddball tasks. Cerebral Cortex, 9:815–823, 1999.
- Loomes and Sugden  G. Loomes and R. Sugden. Regret theory: An alternative approach to rational choice under uncertainty. Economic Journal, 92:805–824, 1982.
- MacKay  D.J.C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.
- Maniadakis and Trahanias  M. Maniadakis and P. Trahanias. Time models and cognitive processes: a review. Frontiers in Neurorobotics, 8:7, 2014.
- Matell and Meck  M. S. Matell and W. H. Meck. Neuropsychological mechanisms of interval timing behavior. Bioessays, 22(1):94–103, 2000.
- Murray et al.  S. O. Murray, H. Boyaci, and D. Kersten. The representation of perceived angular size in human primary visual cortex. Nature Neuroscience, 9(3):429–34, 2006.
- Ono and Kawahara  F. Ono and J. Kawahara. The subjective size of visual stimuli affects the perceived duration of their presentation. Perception & psychophysics, 69(6):952–957, August 2007.
- Ortega and Braun  P.A. Ortega and D.A. Braun. Thermodynamics as a theory of decision-making with information-processing costs. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Science, 469(2153), 2013.
- Papadimitriou  C. M. Papadimitriou. Computational Complexity. Addison-Wesley, 1994. ISBN 0201530821.
- Pariyadath and Eagleman  V. Pariyadath and D. M. Eagleman. Brief subjective durations contract with repetition. Journal of vision, 8(16), 2008. ISSN 1534-7362.
- Ranganath and Rainer  C. Ranganath and G. Rainer. Neural mechanisms for detecting and remembering novel events. Nature Reviews Neuroscience, 4:193–202, 2003.
- Read  D. Read. Is time-discounting hyperbolic or subadditive? Journal of Risk and Uncertainty, 23(1):5–32, 2001.
- Riggs  P. J. Riggs. Contemporary Concepts of Time in Western Science and Philosophy. In Long History, Deep Time: Deepening Histories of Place, pages 47–66. ANU Press, 2015.
- Rissanen  J. Rissanen. Generalized Kraft Inequality and Arithmetic Coding. IBM Journal of Research and Development, 20(3):198–203, 1976.
- Roelofs and Zeeman  C. O. Z. Roelofs and W. P. C. Zeeman. Influence of different sequences of optical stimuli on the estimation of duration of a given interval of time. Acta Psychologica, 8:89–128, 1951.
- Rubinstein  A. Rubinstein. Modeling bounded rationality. MIT Press, 1998.
- Russell and Norvig  S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ, 3rd edition edition, 2009.
- Samuelson  P.A. Samuelson. A Note on Measurement of Utility. Review of Economic Studies, 4(2):155–161, February 1937.
- Savage  L.J. Savage. The Foundations of Statistics. John Wiley and Sons, New York, 1954.
- Schiffman and Bobko  H. R. Schiffman and D. J. Bobko. Effects of stimulus complexity on the perception of brief temporal intervals. Journal of Experimental Psychology, 103(1):156–9, 1974.
- Schwartenbeck et al.  P. Schwartenbeck, T. H. B. FitzGerald, C. Mathys, R. Dolan, M. Kronbichler, and K. Friston. Evidence for surprise minimization over value maximization in choice behavior. Scientific Reports, 5(16575), 2015.
- Shalizi and Crutchfield  C.R. Shalizi and J.P. Crutchfield. Computational Mechanics: Pattern and Prediction, Structure and Simplicity. Journal of Statistical Physics 104 (2001): 816–879, 2000.
- Shannon  C.E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379–423 and 623–656, Jul and Oct 1948.
Introduction to the Theory of Computation. PWS Pub Co, 1996.
- Soman et al.  D. Soman, G. Ainslie, S. Frederick, X. Li, J. Lynch, P. Moreau, A. Mitchell, D. Read, A. Sawyer, Y. Trope, K. Wertenbroch, and G. Zauberman. The psychology of intertemporal discounting: Why are distant events valued differently from proximal ones? Marketing Letters, 16(3–4):347–360, 2005.
- Staddon  J. E. R. Staddon. Interval timing: memory, not a clock. Trends in Cognitive Sciences (Regul. Ed.), 9(7):312–4, 2005.
- Steinruecken  C. Steinruecken. Compressing structured objects. PhD thesis, University of Cambridge, 2014.
- Still et al.  S. Still, D.A. Sivak, A.J. Bell, and G.E. Crooks. Thermodynamics of Prediction. Phys. Rev. Lett., 109(12):120604, September 2012.
- Suddendorf and Corballis  T. Suddendorf and M. C. Corballis. Mental time travel and the evolution of the human mind. Genetic, Social, and General Psychology Monographs, 123(2):133–67, 1997.
- Sutton and Barto  R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
- Takahashi et al.  T. Takahashi, H. Oono, and M. H. B. Radford. Psychophysics of time perception and intertemporal choice models. Physica A: Statistical Mechanics and its Applications, 387(8):2066–2074, 2008.
- Terao et al.  M. Terao, J. Watanabe, A. Yagi, and S. Nishida. Reduction of stimulus visibility compresses apparent time intervals. Nat. Neurosci., 11:541–542, 2008.
- Thompson  W.R. Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 25(3/4):pp. 285–294, 1933.
- Tishby and Polani  N. Tishby and D. Polani. Information Theory of Decisions and Actions. In Hussain Taylor Vassilis, editor, Perception-reason-action cycle: Models, algorithms and systems. Springer, Berlin, 2011.
- Todorov  E. Todorov. Efficient computation of optimal actions. Proceedings of the National Academy of Sciences U.S.A., 106:11478–11483, 2009.
- Tse et al.  Peter Ulric U. Tse, James Intriligator, Josée Rivest, and Patrick Cavanagh. Attention and the subjective expansion of time. Perception & psychophysics, 66(7):1171–1189, October 2004. ISSN 0031-5117.
- Von Neumann and Morgenstern  J. Von Neumann and O. Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, Princeton, 1944.
- Wheeler et al.  M.A. Wheeler, D.T. Stuss, and E. Tulving. Toward a theory of episodic memory: The frontal lobes and autonoetic consciousness. Psychological Bulletin, 121:331–354, 1997.
- Xuan et al.  B. Xuan, D. Zhang, S. He, and X. Chen. Larger stimuli are judged to last longer. Journal of Vision, 7:1–5, 2007.
- Zauberman et al.  G. Zauberman, B. K. Kim, S. A. Malkoc, and J. R. Bettman. Discounting time and time discounting: Subjective time perception and intertemporal preferences. Journal of Marketing Research, 46(4):543–556, 2009.