The asymmetric progression of time has a profound effect on how we, as agents, perceive, process and manipulate our environment. Given a sequence of observations of our familiar surroundings (e.g. as photographs or video frames), we possess the innate ability to predict whether the said observations are ordered correctly. We use this ability not just to perceive, but also to act: for instance, we know to be cautious about dropping a vase, guided by the intuition that the act of breaking a vase cannot be undone. It is manifest that this profound intuition reflects some fundamental properties of the world in which we dwell, and in this work, we ask whether and how these properties can be exploited to learn a representation that functionally mimics our understanding of the asymmetric nature of time.
In his book The Nature of Physical World Eddington (1929), British astronomer Sir Arthur Stanley Eddington coined the term Arrow of Time to denote this inherent asymmetry. It was attributed to the non-decreasing nature of the total thermodynamic entropy of an isolated system, as required by the second law of thermodynamics. However, the mathematical groundwork required for its description was already laid by Lyapunov (1892) in the context of dynamical systems. Since then, the notion of an arrow of time has been formalized and explored in various contexts, spanning not only physics (Prigogine, 1978; Jordan et al., 1998; Crooks, 1999) but also algorithmic information theory (Zurek, 1989, 1998), causal inference (Janzing et al., 2016) and time-series analysis (Janzing, 2010; Bauer et al., 2016) .
Expectedly, the notion of irreversiblity plays a central role in the discourse. In his Nobel lecture, Prigogine (1978) posits that irreversible processes induce the arrow of time111Even for systems that are reversible at the microscopic scale, the unified integral fluctuation theorem (Seifert, 2012) shows that the ratio of the probability of a trajectory and its time-reversed counterpart grows exponentially with the amount of entropy the former produces.
shows that the ratio of the probability of a trajectory and its time-reversed counterpart grows exponentially with the amount of entropy the former produces.
. At the same time, the matter of reversibility has received considerable attention in reinforcement learning, especially in the context of safe exploration(Hans et al., ; Moldovan and Abbeel, 2012; Eysenbach et al., 2017), learning backtracking models (Goyal et al., 2018; Nair et al., 2018) and AI-Safety (Amodei et al., 2016; Krakovna et al., 2018). In these applications, learning a notion of (ir)reversibility is of paramount importance: for instance, the central premise of safe exploration is to avoid states that prematurely and irreversibly terminate the agent and/or damage the environment. It is related (but not identical) to the problem of detecting and avoiding side-effects, in particular those that adversely affect the environment. In Amodei et al. (2016), the example considered is that of a cleaning robot tasked with moving a box across a room. The optimal way of successfully completing the task might involve the robot doing something disruptive, like knocking a vase over. Such disruptions might be difficult to recover from; in the extreme case, they might be virtually irreversible – say when the vase is broken.
The scope of this work includes detecting and quantifying such disruptions by learning the arrow of time of an environment222Detecting the arrow of time in videos has been studied (Wei et al., 2018; Pickup et al., 2014).. Concretely, we aim to learn a potential (scalar) function on the state space. This function must keep track of the passage of time, in the sense that states that tend to occur in the future – states with a larger number of broken vases, for instance – should be assigned larger values. To that end, we first introduce a general objective functional (Section 2.1) and study it analytically for toy problems (Section 2.2). We continue by interpreting the solution to the objective (Section 2.3) and highlight its applications (Section 3
). To tackle more complex problems, we parameterize the potential function by a neural network and present a stochastic training algorithm (Section4). Subsequently, we demonstrate results on a selection of discrete and continuous environments and discuss the results critically, highlighting both the strengths and shortcomings of our method (Section 4). Finally, we place our method in a broader context by empirically elucidating connections to the theory of stochastic processes and the variational Fokker-Planck equation (Section 4.1).
2 The -Potential
Preliminaries. In this section, we will represent the arrow of time as a scalar function that increases (in expectation) over time. Given a Markov Decision Process (Environment), let and be its state and action space (respectively). A policy is a function mapping a state to a distribution over the action space, . Given and some distribution over the states, we call the sequence a state-transition trajectory , where we have and for some initial state distribution . In this sense, can be thought of as an instantiation of the Markov (stochastic) process with transitions characterized by .
Methods. Now, for any given function , one may define the following functional:
where the expectation is over the state transitions of the Markov process and the time-step . We now define:
where implements some regularizer (e.g. , etc.) weighted by . The term is maximized if the quantity increases in expectation333Note that while requires to increase along all trajectories in expectation, it does not guarantee that it must increase along all trajectories. with increasing , whereas the regularizer ensures that is well-behaved and does not diverge to infinity in a finite domain444Under certain conditions, resembles the negative stochastic discrete-time Lyapunov function (Li et al., 2013) of the Markov process .. In what follows, we simplify notation by using and interchangeably.
2.2 Theoretical Analysis
The optimization problem formulated in Eqn 2 can be studied analytically: in Appendix A, we derive the analytic solutions for Markov processes with discrete state-spaces and known transition matrices. The key result of our analysis is a characterization of how the optimal must behave for the considered regularization schemes. Further, we evaluate the solutions for illustrative toy Markov chains in Fig 1 and 2 to learn the following.
First, consider the two variable Markov chain in Fig 1 where the initial state is either or with equal probability. If , the transition from to is more likely than the reverse transition from to . In this case, one would expect that and that increases with , which is indeed what we find555cf. Examples 1 and 3 in Appendix A. given an appropriate regularizer. Conversely, if , the transition between and is equally likely in either direction (i.e. it is fully reversible), and we obtain . Second, consider the four variable Markov chain in Fig 2 where the initial state is and all state transitions are irreversible. Inituitively, one should expect that , and , given that all state transitions are equally irreversible. We obtain this behaviour with an appropriate regularizer666cf. Example 4 in Appendix A..
While this serves to show that the optimization problem defined in Eqn 2 can indeed lead to interesting solutions, an analytical treatment is not always feasible for complex environments with a large number of states and/or undetermined state transition rules. In such cases, as we shall see in later sections, one may resort to parameterizing as a function approximator and solve the optimization problem in Eqn 2 with stochastic gradient methods.
2.3 Interpretation and Subtleties
Having defined and analyzed , we turn to the task of interpreting it. Based on the analytical results presented in Section 2.2, it seems reasonable to expect that even in interesting environments, should remain constant (in expectation) along reversible trajectories. Further, along trajectories with irreversible transitions, one may hope that not only increases, but also quantifies the irreversibility in some sense. In Section 4, we empirically investigate if this is indeed the case. But before that, there are two conceptual aspects that warrant closer scrutiny.
The first is rooted in the observation that the states are collected by a given but arbitrary policy . In particular, there may exist demonic777This is indeed an allusion to Maxwell’s Demon, cf. Thomson (1874). policies for which the resulting arrow-of-time is unnatural, perhaps even misleading. Consider for instance the actions of a practitioner of Kintsugi, the ancient Japanese art of repairing broken pottery. The corresponding policy might cause the environment to transition from a state where the vase is broken to one where it is not. If we learn on such demonic (or expert) trajectories888By doing so, we solve an inverse RL problem (Ng and Russell, 2000)., it might be the case that counter to our intuition, states with a larger number of broken vases are assigned smaller values (and the vice versa). Now, we may choose to resolve this conundrum by defining
where is the set of all policies defined on , and denotes uniform sampling. The resulting function would characterize the arrow-of-time w.r.t. all possible policies, and one would expect that for a vast majority of such policies, the transition from broken vase to a intact vase is rather unlikely and/or requires highly specialized policies.
Unfortunately, determining is not feasible for most interesting applications, given the outer expectation over all possible policies; we therefore settle for a (uniformly) random policy which we denote by (and the corresponding potential as ). The simplicity (or rather, clumsiness) of justifies its adoption, since one would expect a demonic policy to be rather complex and not implementable with random actions. In this sense, we ensure that the arrow of time characterizes the underlying dynamics of the environment, and not the peculiarities of a particular agent. However, the price we pay for our choice is the lack of adequate exploration in complex enough environments, although this problem plagues most model-based reinforcement learning approaches999While this is a fundamental problem, powerful methods for off-policy learning exist (cf. Munos et al. (2016) and references therein); however, a full analysis is beyond the scope of the current work. (cf. Ha and Schmidhuber (2018)).
The second aspect concerns what we require of environments in which the arrow of time is informative. To illustrate the matter, we consider a class of Hamiltonian systems101010Systems where Liouville’s theorem holds. Further, the Hamiltonian is assumed to be time-independent., a typical instance of which could be a billiard ball moving on a frictionless arena and bouncing (elastically) off the edges111111This is well studied in the context of dynamical systems and chaos theory (keyword: dynamic billiards); see (Bunimovich, 2007) and references therein.. The state space comprises the ball’s velocity and its position constrained to a billiard table (without holes!), where the ball is initialized at a random position on the table. For such a system, it can be seen by time-reversal symmetry that when averaged over a large number of trajectories, the state transition is just as likely as the reverse transition . In this case, one should expect the arrow of time to be constant121212Prigogine (1978) (page 783 et seq.) provides a more physical treatment. (see Eqn 2). A similar argument can be made for systems that identically follow closed trajectories in their respective state space (e.g. a frictionless and undriven pendulum). It follows that must remain constant along the trajectory and that the arrow of time is uninformative. However, for so-called dissipative systems, the notion of an arrow of time is pronounced and well studied (Prigogine, 1978; Willems, 1972). In MDPs, dissipative behaviour may arise in situations where certain transitions are irreversible by design (e.g. bricks disappearing in Atari Breakout), or due to partial observability (e.g. for a damped pendulum, the state space does not track the microscopic processes that give rise to friction131313Note that while a damped pendulum can be expressed as a Hamiltonian system (McDonald, 2015), the Hamiltonian is time dependent.).
Therefore, a central premise underlying the practical utility of learning the arrow of time is that the considered MDP is indeed dissipative. Operating under this assumption, we now discuss a few applications of the arrow of time and experimentally demonstrate its learnability on non-trivial environments.
3 Applications with Related Work
3.1 Measuring Reachability
Given two states and in , the reachability of from measures how difficult it is for an agent at state to reach state . The prospect of learning reachability from state-transition trajectories has been explored: in Savinov et al. (2018), the approach taken involves learning a logistic regressor network to predict the probability of states and being reachable to one another within a certain number of steps (of a random policy), in which case . However, the model is not directed: it does not learn whether is more likely to follow , or the vice versa. Instead, we propose to learn a function such that where is said to measure the directed reachability of state from state by following some reference policy . In the following, we take the reference policy as given (e.g. a random policy) and drop the for notational clarity. Now, has the following properties.
First, consider the case where the transition between states and is fully reversible, i.e. when state is exactly as reachable from state as is from . In expectation, we obtain and consequently, . We denote such reversible transitions with . Now, if instead the state is more likely to occur after state than state after , we say is more reachable141414With respect to the reference (random) policy, which is implicit in our notation. from than from . It follows in expectation that , and consequently, along with . Second, it can easily be seen that the reachability measure implemented by is additive by design: given three states , we have that . As a special case, consider when and : it follows that . In words, if both transitions, from to and from and , are fully reversible, it automatically follows that the transition from to is also fully reversible. Third, allows for a soft measure of reachability. As we shall see in Section 4, it measures not only whether a state is reachable from another state , but also quantifies how reachable the former is from the latter. For instance: if the state is one with all vases intact, with one vase broken, and with a hundred vases broken, we find that . This behaviour is sought-after in the context of AI-Safety (Krakovna et al., 2018; Leike et al., 2017).
While these properties are satisfactory, the following aspect should be considered to prevent potential confusion. Namely, while we expect if the transition between states and is fully reversible, the converse is not guaranteed, especially for non-ergodic environments. For instance, if a Markov chain does not admit a trajectory between states and , it might still be the case that , and consequently, .
3.2 Detecting and Penalizing Side Effects for Safe Exploration
The problem of detecting and avoiding side-effects is well known and crucially important for safe exploration (Moldovan and Abbeel, 2012; Eysenbach et al., 2017; Krakovna et al., 2018; Armstrong and Levinstein, 2017). Broadly, the problem involves detecting and avoiding state transitions that permanently and irreversibly damage the agent or the environment. As such, it is not surprising that it is fundamentally related to reachability, as in the agent is prohibited from taking actions that drastically reduce the reachability between the resulting state and some predefined safe state. In Eysenbach et al. (2017), the authors learn a reset policy responsible for resetting the environment to some initial state after the agent has completed its trajectory. The resulting value function of the reset policy indicates when the actual (forward) policy executes an irreversible state transition. In contrast, Krakovna et al. (2018) propose to attack the problem by measuring reachability relative to a baseline state. However, determining it requires counterfactual reasoning, which in turn requires a known causal model.
We propose to directly use the reachability measure defined in Section 3.1 to derive a Lagrangian for safe-exploration. Let be the reward (potentially including an exploration bonus) at time-step . The augmented reward is given by:
where is a scaling coefficient. In practice, one may replace with , where is a monotonically increasing transfer function (e.g. a step function).
Intuitively, transitions that are less reversible cause the -potential to increase, and the resulting reachability measure in expectation. This in-turn incurs a penalty, which is reflected in the value function of the agent. Conversely, transitions that are reversible should have the property that (also in expectation), thereby incurring no penalty.
3.3 Rewarding Curious Behaviour
In many environments where reinforcement learning methods shine, the reward function is assumed to be given; however, shaping a good reward function can often prove to be a challenging endeavour. It is in this context that the notion of curiosity comes to play an important role (Schmidhuber, 2010; Chentanez et al., 2005; Pathak et al., 2017; Burda et al., 2018; Savinov et al., 2018). One typical approach towards encouraging curious behaviour is to seek novel states that surprise the agent (Schmidhuber, 2010; Pathak et al., 2017; Burda et al., 2018) and use the error in the agent’s prediction of future states is used as a curiosity reward. This approach is, however, susceptible to the so-called noisy-TV problem, wherein an uninteresting source of entropy like a noisy-TV can induce a large curiosity bonus because the agent cannot predict its future state. Savinov et al. (2018) propose to define novelty in terms of (undirected) reachability - states that are easily reachable from the current state are considered less novel.
The -potential and the corresponding reachability measure affords another way of defining a curiosity reward: namely, states that are difficult to access by a simple reference policy (e.g. a random policy) should incur a larger reward. In other words, it encourages an agent to do hard things, i.e. to seek states that are otherwise difficult to reach just by chance. The general form of the corresponding reward is given by: .
Despite the above being independent of the reward function defined by the environment, the latter might often align with the former: in many environments, the task at hand is to reach the least reachable state. This is readily recognized in classical control tasks like Pendulum, Cartpole and Mountain-Car, where the goal state is often the least reachable. However, if the environment’s specified task requires the agent to inadvertently execute irreversible trajectories, we expect our proposed reward to be less applicable.
4 Algorithm and Experiments
In this section, we introduce a learning algorithm for the parameterized -potential and empirically validate it on a selection of discrete and continuous environments (more experiments can be found in Appendix C).
For interesting MDPs with a large number of states and unknown state transition models, an analytic solution like in Section 2.2 is not feasible. In such cases, the -potential can be parameterized by a neural network with parameters , reducing the optimization problem in Eqn 2 to:
For stochastic training, the expectation in Eqn 1
can be replaced by its Monte-Carlo estimate, and optimization problem in Eqn5
can be solved via stochastic gradient descent – the details are given in Algorithm1 (Appendix B).
Now, we turn to the question of what regularizer to use. Perhaps the simplest candidate is early stopping, wherein the network
is simply not trained to convergence. In combination with weight-decay and/or gradient clipping, we find it to work surprisingly well in practice. Another good regularizer is the so-calledtrajectory regularizer (cf. Eqn 17 in Appendix A):
In words, the trajectory regularizer penalizes all changes in , whereas the primary objective encourages to increase along a trajectory; for an appropriate coefficient , a balance is found.
2D World with Vases.151515Experimental details and additional plots can be found in Appendix C.1.1. The environment considered is a 2D world, where cells can be occupied by the agent, the goal and/or a vase (their respective positions are randomly sampled in each episode). If the agent enters a cell with a vase in it, the vase disappears without compromising the agent. We use a random policy to generate state-transition trajectories, which we then use to train the -potential. In Fig 9 (in Appendix C.1.1), we plot the -potential along a trajectory (parameterized by ) generated by a random policy. We find that increases step-wise when the agent breaks a vase, but remains constant as it moves around – consequently, we observe that the breaking of a vase corresponds to a spike in the signal (Fig 3). Indeed, the latter is reversible whereas the former irreversibly changes the environment. Moreover, we find that the spikes in Fig 3 are of roughly similar heights, indicating that the model has learned to measure the number of vases broken, i.e. it has learned to quantify irreversiblity, instead of merely detecting it161616The trained -potential can be utilized to derive a safety reward from the trained model, as elaborated in Section 3.2 (cf. Fig 11 in Appendix C.1.1)..
Now, to study the robustness of the -potential to noise, we carry out the following two experiments. In the first of the two, we append a uniformly-random temporally uncorrelated noise signal to the state, which serves as an entropy source (i.e. a noisy-TV). In the second, we append a clock to the state, i.e. a temporally-correlated signal that increases in constant intervals as the trajectory progresses. Fig 7(a) and 7(b) (Appendix C.1.1) show the respective plots for the corresponding reachability . While the former noises the background in the signal, the spikes remain clearly visible, suggesting that the -potential is fairly robust to temporally uncorrelated sources of entropy. The latter has a more interesting effect - the -potential latches on to the clock signal, which results in the baseline shifting up by a constant. While the spikes remain visible for the most part, this experiment shows that the model might be susceptible to spurious causal signals in the environment.
Sokoban171717Experimental details and additional plots can be found in Appendix C.1.3. ("warehouse-keeper") is a classic puzzle video game, where an agent must push a number of boxes to set goal locations placed on a map. We use a 2D-world like implementation181818Our implementation is adapted from Schrader (2018)., where each cell can be occupied by a wall, the agent or a box. Additionally, a goal marker may co-occupy a cell with all sprites except a wall. The agent may only push boxes (and not pull), rendering certain moves irreversible - for instance, when a box is pushed against a wall. Solving Sokoban requires long-term planning, precisely due to the existence of such irreversible moves. To further exacerbate the problem, the task of even determining whether a move is irreversible might be non-trivial.
We train the -potential on trajectories generated by a random policy, wherein we generate a random (solvable) map for each trajectory. Fig 4 shows the evolution of with timesteps for a randomly sampled (validation) map. We find that increases sharply when a box is pushed against a wall, but remains constant as the agent moves about (potentially pushing a box around). Indeed, the latter is reversible whereas the former is not. Further, we confirm that does not necessarily increase along all trajectories, but only in expectation (Fig 15). We therefore learn that the -potential can be used to extract useful information from the environment, all without any external supervision (via rewards) or specialized policies.
Mountain-Car with Friction.191919Experimental details and additional plots can be found in Appendix C.2.2 The environment considered shares its dynamics with the well known (continuous) Mountain-Car environment (Sutton and Barto, 2011), but with a crucial amendment: the car is subject to friction202020Technically, this is achieved by subtracting a velocity dependent term from the acceleration.. Friction is required to make the environment dissipative and thereby induce an arrow of time (cf. Section 2.3). Moreover, we initialize the system in a uniform-randomly sampled state to avoid exploration issues.
Fig 5(a) shows the output of the -potential trained with trajectory regularization overlayed with random trajectories, whereas Fig 5 plots the negative potential at zero-velocity together with the height of the mountain. We not only find that is at its maximum around the valley, but also that at zero velocity largely recovers the terrain just from random trajectories. In addition, we also train the -potential under identical conditions but without friction. The resulting environment is not dissipative, and in Fig 5(b) we accordingly find that the corresponding -potential is not informative (and an order of magnitude smaller), highlighting the practical importance of dissipation.
Appendices C.1.2 and C.2.1 present additional experiments. The former shows for a discrete environment that the -potential can be used to derive a reward signal that correlates well with what one might engineer; in the latter, we learn the -potential for an under-damped pendulum.
4.1 Connection to Stochastic Processes
In this section, we empirically study the link between our method and the theory of stochastic processes. Concretely: our goal is to investigate whether a learned arrow of time behaves as expected by comparing it with a known notion of an arrow of time due to Jordan, Kinderlehrer, and Otto (1998). Experimental details are provided in Appendix C.3.
Following the notation of Jordan et al. (1998), we consider the spatial distribution at time of a particle undergoing Brownian motion in the presence of a potential . The Ito stochastic differential equation (associated with the Fokker-Planck-Equation) is given by:
is the random variable corresponding to the distribution, the initial spatial distribution is fixed, is a parameter and is the standard Wiener process (or equivalently,
is uncorrelated white-noise). The celebrated Jordan-Kinderlehrer-Otto result(Jordan et al., 1998) shows that the dynamics of a distribution satisfying the Ito SDE (Eqn 7) has the property that the following Free-Energy functional can only decrease with time212121In other words, is a Lyapunov functional.:
where is the energy functional and is the (Gibbs-Boltzmann) entropy functional. It follows that the free-energy functional induces an arrow of time for the stochastic process generated by Eqn 7.
Given that background, the question we now ask is the following: given just samples from the stochastic process , how well does a learned -potential agree with the true Free-Energy functional? To answer that, we first define the -functional as:
This allows us to compare with , modulo a linear scaling and shift. To that end, we train with realizations of two-dimensional random walks under an elliptic paraboloid potential . Further, is estimated via Monte-Carlo sampling, the differential entropy via a non-parametric estimator (Kozachenko and Leonenko, 1987; Kraskov et al., 2004; Gao et al., 2015)
, and the linear transform coefficients for
via linear regression. Fig16 (in Appendix C.3) plots as a function of state , whereas Fig 7 shows that after appropriate (linear) scaling, the learned largely agrees with the true .
Over the course of the paper, we addressed the problem of learning the arrow of time in Markov (Decision) Processes. Having formulated an objective (Eqn 2) and analyzed the corresponding optimization problem for discrete state-spaces (Section 2.2 and Appendix A), we laid out the fundamental challenges that arise – namely the presence of demonic policies and the requirement that the environment be dissipative (Section 2.3). Under appropriate assumptions, we discussed how the arrow of time can be used to measure reachability, detect side-effects and define a curiosity reward (Section 3). Subsequently, we demonstrated the process of learning the arrow of time on a selection of discrete and continuous environments (Section 4). Finally, we showed for random walks that the learned arrow of time agrees well with the Free-Energy functional, which acts as the true arrow of time. Future work could draw connections to algorithmic independence of cause and mechanism (Janzing et al., 2016) and explore applications in causal inference (Janzing, 2010; Peters et al., 2017).
The authors would like to acknowledge Min Lin for the initial discussions, Simon Ramstedt, Zaf Ahmed and Maximilian Puelma Touzel for their feedback on the draft.
- Eddington (1929) Arthur Stanley Eddington. The nature of the physical world / by A.S. Eddington. Cambridge University Press Cambridge, England, 1st ed. edition, 1929.
- Lyapunov (1892) Aleksandr Mikhailovich Lyapunov. The general problem of the stability of motion. Kharkov Mathematical Society, 1892.
- Prigogine (1978) Ilya Prigogine. Time, structure, and fluctuations. Science, 201(4358):777–785, 1978.
- Jordan et al. (1998) Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker–planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
- Crooks (1999) Gavin E. Crooks. Entropy production fluctuation theorem and the nonequilibrium work relation for free energy differences. Phys. Rev. E, 60:2721–2726, Sep 1999. doi: 10.1103/PhysRevE.60.2721. URL https://link.aps.org/doi/10.1103/PhysRevE.60.2721.
- Zurek (1989) Wojciech H Zurek. Algorithmic randomness and physical entropy. Physical Review A, 40(8):4731, 1989.
- Zurek (1998) Wojciech H Zurek. Decoherence, chaos, quantum-classical correspondence, and the algorithmic arrow of time. Physica Scripta, 1998(T76):186, 1998.
- Janzing et al. (2016) Dominik Janzing, Rafael Chaves, and Bernhard Schölkopf. Algorithmic independence of initial condition and dynamical law in thermodynamics and causal inference. New Journal of Physics, 18(9):093052, 2016.
- Janzing (2010) Dominik Janzing. On the entropy production of time series with unidirectional linearity. Journal of Statistical Physics, 138(4-5):767–779, 2010.
Bauer et al. (2016)
Stefan Bauer, Bernhard Schölkopf, and Jonas Peters.
The arrow of time in multivariate time series.
International Conference on Machine Learning, pages 2043–2051, 2016.
- Seifert (2012) Udo Seifert. Stochastic thermodynamics, fluctuation theorems and molecular machines. Reports on progress in physics, 75(12):126001, 2012.
- (12) Alexander Hans, Daniel Schneegaß, Anton Maximilian Schäfer, and Steffen Udluft. Safe exploration for reinforcement learning.
- Moldovan and Abbeel (2012) Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in markov decision processes. arXiv preprint arXiv:1205.4810, 2012.
- Eysenbach et al. (2017) Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, and Sergey Levine. Leave no trace: Learning to reset for safe and autonomous reinforcement learning. arXiv preprint arXiv:1711.06782, 2017.
- Goyal et al. (2018) Anirudh Goyal, Philemon Brakel, William Fedus, Soumye Singhal, Timothy Lillicrap, Sergey Levine, Hugo Larochelle, and Yoshua Bengio. Recall traces: Backtracking models for efficient reinforcement learning. arXiv preprint arXiv:1804.00379, 2018.
- Nair et al. (2018) Suraj Nair, Mohammad Babaeizadeh, Chelsea Finn, Sergey Levine, and Vikash Kumar. Time reversal as self-supervision. arXiv preprint arXiv:1810.01128, 2018.
- Amodei et al. (2016) Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
- Krakovna et al. (2018) Victoria Krakovna, Laurent Orseau, Miljan Martic, and Shane Legg. Measuring and avoiding side effects using relative reachability. CoRR, abs/1806.01186, 2018. URL http://arxiv.org/abs/1806.01186.
- Wei et al. (2018) Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. Learning and using the arrow of time. In
- Pickup et al. (2014) Lyndsey C Pickup, Zheng Pan, Donglai Wei, YiChang Shih, Changshui Zhang, Andrew Zisserman, Bernhard Scholkopf, and William T Freeman. Seeing the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2035–2042, 2014.
- Li et al. (2013) Yan Li, Weihai Zhang, and Xikui Liu. Stability of nonlinear stochastic discrete-time systems. Journal of Applied Mathematics, 2013, 2013.
- Thomson (1874) William Thomson. Kinetic theory of the dissipation of energy, 1874.
- Ng and Russell (2000) Andrew Y Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In in Proc. 17th International Conf. on Machine Learning. Citeseer, 2000.
- Munos et al. (2016) Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054–1062, 2016.
- Ha and Schmidhuber (2018) David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
- Bunimovich (2007) L. Bunimovich. Dynamical billiards. Scholarpedia, 2(8):1813, 2007. doi: 10.4249/scholarpedia.1813. revision #91212.
- Willems (1972) Jan C Willems. Dissipative dynamical systems part i: General theory. Archive for rational mechanics and analysis, 45(5):321–351, 1972.
- McDonald (2015) Kirk T McDonald. A damped oscillator as a hamiltonian system. 2015.
- Savinov et al. (2018) Nikolay Savinov, Anton Raichuk, Raphaël Marinier, Damien Vincent, Marc Pollefeys, Timothy Lillicrap, and Sylvain Gelly. Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274, 2018.
- Leike et al. (2017) Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. Ai safety gridworlds. arXiv preprint arXiv:1711.09883, 2017.
- Armstrong and Levinstein (2017) Stuart Armstrong and Benjamin Levinstein. Low impact artificial intelligences. arXiv preprint arXiv:1705.10720, 2017.
- Schmidhuber (2010) Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
- Chentanez et al. (2005) Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 1281–1288, 2005.
- Pathak et al. (2017) Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In ICML, 2017.
- Burda et al. (2018) Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355, 2018.
- Schrader (2018) Max-Philipp B. Schrader. gym-sokoban. https://github.com/mpSchrader/gym-sokoban, 2018.
- Sutton and Barto (2011) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. 2011.
Kozachenko and Leonenko (1987)
LF Kozachenko and Nikolai N Leonenko.
Sample estimate of the entropy of a random vector.Problemy Peredachi Informatsii, 23(2):9–16, 1987.
- Kraskov et al. (2004) Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Physical review E, 69(6):066138, 2004.
- Gao et al. (2015) Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. Efficient estimation of mutual information for strongly dependent variables. In Artificial Intelligence and Statistics, pages 277–286, 2015.
- Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. MIT press, 2017.
Modularized implementation of deep rl algorithms in pytorch.https://github.com/ShangtongZhang/DeepRL, 2018.
- Van Hasselt et al. (2016) Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
- Wang et al. (2015) Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
- Savitzky and Golay (1964) Abraham Savitzky and Marcel JE Golay. Smoothing and differentiation of data by simplified least squares procedures. Analytical chemistry, 36(8):1627–1639, 1964.
- Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
Appendix A Theoretical Analysis
In this section, we present a theoretical analysis of the optimization problem formulated in Eqn 2, and analytically evaluate the result for a few toy Markov processes to validate that the resulting solutions are indeed consistent with intuition. To simplify the exposition, we consider the discrete case where the state space of the MDP is finite.
Consider a discrete Markov chain with enumerable states . At an arbitrary (but given) time-step , we let denote the probability that the Markov chain is in state , and the corresponding vector (over states). With we denote the probability of the Markov chain transitioning from state to under some policy , i.e. . One has the transition rule:
where is the -th matrix power of . Now, we let denote the value takes at state , i.e. , and the corresponding vector (over states) becomes . This reduces the expectation of the function (now a vector) w.r.t any state distribution (now also a vector) to the scalar product . In matrix notation, the optimization problem in Eqn 2 simplifies to:
For certain , the discrete problem in Eqn 11 can be handled analytically. We consider two candidates for , the first being the norm of , and the second one being the norm of change in , in expectation along a trajectory.
If , the solution to the optimization problem in Eqn 11 is given by:
First, note that the objective in Eqn 11 becomes:
To solve the maximization problem, we must differentiate w.r.t. its argument , and set the resulting expression to zero. This yields:
Proposition 1 has an interesting implication: if the Markov chain is initialized at equilibrium, i.e. if , we obtain that identically. Given the above, we may now consider simple Markov chains to explore the implications of Eqn 12.
The above example illustrates two things. On the one hand, if , one obtains a Markov chain with perfect reversibility, i.e. the transition is equally as likely as the transition . In this case, one indeed obtains , as mentioned above. On the other hand, if one sets , the transition from is never sampled, and that from is irreversible; consequently, takes the largest value possible. Now, while this aligns well with our intuition, the following example exposes a weakness of the L2-norm-penalty used in Proposition 1.
Consider two Markov chains, both always initialized at . For the first Markov chain, the dynamics admits the following transitions: , whereas for the second chain, one has . Now, for both chains and , it’s easy to see that if , but otherwise. From Eqn 12, one obtains:
The solution for given by Eqn 16 indeed increases (non-strictly) monotonously with timestep. However, we obtain for both Markov chains. In particular, does not increase between the transition in the former and the transition in the latter, even though both transitions are irreversible. It is in general apparent from 1 that the solution for depends only on the initial and final state distribution, and not the intermediate trajectory.
Now, consider the following regularizer that penalizes not just the function norm, but the change in in expectation along trajectories:
where is the relative weight of the L2 regularizer. This leads to the result:
Like in Proposition 1, we maximize it by setting the gradient of w.r.t. to zero. This yields:
The first term in the RHS is again a telescoping sum; it evaluates to: (cf. proof of Proposition 1). The second term can be expressed as (with
as the identity matrix):
Consider the two-state Markov chain in Example 1 and the associated transition matrix and initial state distribution . Using the regularization scheme in Eqn 17 and the associated solution Eqn 18, one obtains:
To obtain this result222222Interested readers may refer to the attached SymPy computation., we use that for all and truncate the sum without loss of generality at .
Like in Example 1, we observe if for all (i.e. at equilibrium). In addition, if , it can be shown that increases monotonously with and takes the largest possible value at . We therefore find that for the simple two-state Markov chain of Example 1, the regularization in Eqn 17 indeed leads to intuitive behaviour for the respective solution . Now:
Consider the four-state Markov chain with transitions and the corresponding transition matrix , where , for all other . Set , i.e. the chain is always initialized at . Now, the summation over in Eqn 18 can be truncated w.l.o.g when , given that for all . At , one solution is:
Further, for all finite , one obtains , where the inequality is strict. This is unlike Eqn 16 where , and consistent with the intuitive expectation that the arrow of time must increase along irreversible transitions.
In conclusion: we find that the functional objective defined in Eqn 2 may indeed lead to analytical solutions that are consistent with the notion of an arrow of time in certain toy Markov chains. However, in most interesting real world environments, the transition model is not known and or or the number of states is infeasibly large, rendering an analytic solution intractable. In such cases, as we see in Section 4, it is possible to parameterize as a neural network and train the resulting model with stochastic gradient descent to optimize the functional objective defined in Eqn 2.
Appendix B Algorithm
Appendix C Experimental Details
All experiments were run on a workstation with 40 cores, 256 GB RAM and 2 nVidia GTX 1080Ti.
c.1 Discrete Environments
c.1.1 2D World with Vases
The environment state comprises three binary images (corresponding to agent, vases and goal), and the vases appear in a different arrangement every time the environment is reset. The probability of sampling a vase at any given position is set to .
We use a two-layer deep and 256-unit wide ReLU network to parameterize the-potential. It is trained on 4096 trajectories of length 128 for 10000 iterations of stochastic gradient descent with Adam optimizer (learning rate: 0.0001). The batch-size is set to 128, and we use a weight decay of 0.005 to regularize the model. We use a validation trajectory to generate the plots in Fig 9 and 3. Moreover, Fig 10 shows histograms of the values taken by at various time-steps along the trajectory. We learn that takes on larger values (on average) as increases.
To test the robustness of our method, we conduct experiments where the environment state is augmented with one of: (a) a image with uniform-randomly sampled pixel values (TV-noise) and (b) a image where every pixel takes the value , where is the time-step232323Recall that the trajectory length is set to . of the corresponding state (Causal Noise). Fig 7(a) and 7(b) plot the corresponding along randomly sampled trajectories.
Given a learned arrow of time, we now present an experiment where we use it to derive a safe-exploration penalty (in addition to the environment reward). To that end, we now consider the situation where the agent’s policy is not random, but specialized to reach the goal state (from its current state). For both the baseline and the safe agents, every action is rewarded with the change in Manhattan norm of the agent’s position to that of the goal – i.e. an action that moves the agent closer to the goal is rewarded , one that moves it farther away from the goal is penalized , and one that keeps the distance unchanged is neither penalized nor rewarded (). Further, every step is penalized by (so as to keep the trajectories short), and exceeding the available time limit ( steps) incurs a termination penalty (). In addition, the reward function of the safe agent is augmented with the reachability, i.e. it takes the form described in Eqn 4. We use and a transfer function such that if (cf. Fig 3), and otherwise.
The policy is parameterized by a 3-layer deep 256-unit wide (fully connected) ReLU network and trained via Duelling Double Deep Q-Learning242424We adapt the implementation due to Shangtong (2018). (Van Hasselt et al., 2016; Wang et al., 2015). The discount factor is set to and the target network is updated once every 200 iterations. For exploration, we use a greedy policy, where is decayed linearly from to in the span of the first iterations. The replay buffer stores experiences and the batch-size used is . Fig 10(a) shows the probability of reaching the goal (in an episode of 30 steps) over the iterations (sample size ), whereas Fig 10(b) shows the expected number of vases broken per episode (over the same 100 episodes). Both curves are smoothed by a Savitzky-Golay filter (Savitzky and Golay, 1964) of order and window-size (the original, unsmoothed curves are shaded). As expected, we find that using the safety penalty does indeed result in fewer vases broken, but also makes the task of reaching the goal difficult (we do not ensure that the goal is reachable without breaking vases).
c.1.2 2D World with Drying Tomatoes
The environment considered comprises a 2D world where each cell is initially occupied by watered tomato plant252525We draw inspiration from the tomato-watering environment described in Leike et al. (2017).. The agent waters the cell it occupies, restoring the moisture level of the plant in the said cell to . However, for each step the agent does not water a plant, it loses some moisture (by of maximum in our experiments). If a plant loses all moisture, it is considered dead and no amount of watering can resurrect it. The state-space comprises two images: the first image is an indicator of the agent’s position, whereas the pixel values of the second image quantifies the amount of moisture held by the plant262626This is a strong causal signal which may distract the model. We include it nonetheless to make the task more challenging. at the corresponding location.
We show that it is possible to recover an intrinsic reward signal that coincides well with one that one might engineer. To that end, we parameterize the -potential as a two-layer deep 256-unit wide ReLU network and train it on 4096 trajectories (generated by a random policy) of length 128 for 10000 iterations of Adam (learning rate: 0.0001). The batch-size is set to 128 and the model is regularized with the trajectory regularizer ().
Unsurprisingly, we find that increases as the plants lose moisture. But conversely, when the agent waters a plant, it causes the -potential to decrease by an amount that strongly correlates with the amount of moisture the watered plant gains. This can be used to define a dense reward signal for the agent:
where we use a momentum of to evaluate the running average.
In Fig 12, we plot for a random trajectory the intrinsic reward against a reference reward, which in this case is the moisture gain of the plant the agent just watered. Further, we observe the reward function dropping significantly at around the 90-th iteration - this is precisely when all plants have died. This demonstrates that the -potential can indeed be useful for defining intrinsic rewards.
The environment state comprises five binary images, where the pixel value at each location indicates the presence of the agent, a box, a goal, a wall and empty space. The layout of all sprites are randomized at each environment reset, under the constraint that the game is still solvable (Schrader, 2018). The -potential is parameterized by a two-layer deep and 512-unit wide network, which is trained on 4096 trajectories of length 512 for 20000 steps of Adam (learning rate: 0.0001). The batch-size is set to 256 and we use the trajectory regularizer () to regularize our model.
c.2 Continous Environments
c.2.1 Under-damped Pendulum
Under-damped Pendulum. The environment considered simulates an under-damped pendulum, where the state space comprises the angle272727 is commonly represented as instead of a scalar. and angular velocity of the pendulum. The dynamics are governed by the following differential equation where is the (time-dependent) torque applied by the agent and , , are constants:
We adapt the implementation in OpenAI Gym (Brockman et al., 2016) to add an extra term to the dynamics to simulate friction. In our experiments, we set , , and the torque is uniformly sampled iid. from the interval .
The -Potential is parameterized by a two-layer 256-unit wide ReLU network, which is trained on 4096 trajectories of length 256 for 20000 steps of stochastic gradient descent with Adam (learning rate: 0.0001). The batch-size is set to 1024 and we use the trajectory regularizer with to regularize the network. Fig 13(a) plots the learned -potential (trained with trajectory regularizer) as a function of the state whereas Fig 13(b) shows the negative potential for all angles at zero angular velocity, i.e. . We indeed find that states in the vicinity of have a larger -potential, owing to the fact that all trajectories converge to for large due to the dissipative action of friction.
c.2.2 Continuous Mountain Car
The environment282828We adapt the implemetation due to Brockman et al. (2016), available here: github.com/openai/gym/blob/master/gym/envs/classic_control/continuous_mountain_car.py considered is a variation of Mountain Car (Sutton and Barto, 2011), where the state-space is a tuple of the position and velocity of a vehicle on a mountainous terrain. The action space is the interval and denotes the force applied by the vehicle. The dynamics of the modified environment is given by the following equation of motion:
where and are constants set to and respectively, and the velocity is clamped to the interval . Our modification is the last term to simulate friction. Further, the initial state is sampled uniformly from the state space . This can potentially be avoided if an exploratory policy is used (instead of the random policy) to gather trajectories, but we leave this for future work.
The -potential is parameterized by a two-layer 256-unit wide ReLU network, which is trained on 4096 trajectories of length 256 for 20000 steps of stochastic gradient descent with Adam (learning rate: 0.0001). The batch-size is set to 1024 and we use the trajectory regularizer with .
c.3 Stochastic Processes
The environment state comprises two scalars, the and coordinates of the particle’s position . The potential is given by:
corresponding to a two dimensional Ornstein-Uhlenbeck process, and is set to .
We train a two-layer deep, 512-unit wide network on 8092 trajectories of length 64 for 20000 steps of stochastic gradient descent with Adam (learning rate: 0.0001). The batch-size is set to 1024 and the network is regularized by weight decay (with coefficient ). Fig 16 shows the learned -potential as a function of position . Fig 7 compares the free-energy functional with the learnt arrow of time given by the linearly scaled -functional. To obtain the linear scaling parameters for the , we find parameters and such that is minimized (constraining to be positive).