Log In Sign Up

A Theoretical Connection Between Statistical Physics and Reinforcement Learning

Sequential decision making in the presence of uncertainty and stochastic dynamics gives rise to distributions over state/action trajectories in reinforcement learning (RL) and optimal control problems. This observation has led to a variety of connections between RL and inference in probabilistic graphical models (PGMs). Here we explore a different dimension to this relationship, examining reinforcement learning using the tools and abstractions of statistical physics. The central object in the statistical physics abstraction is the idea of a partition function Z, and here we construct a partition function from the ensemble of possible trajectories that an agent might take in a Markov decision process. Although value functions and Q-functions can be derived from this partition function and interpreted via average energies, the Z-function provides an object with its own Bellman equation that can form the basis of alternative dynamic programming approaches. Moreover, when the MDP dynamics are deterministic, the Bellman equation for Z is linear, allowing direct solutions that are unavailable for the nonlinear equations associated with traditional value functions. The policies learned via these Z-based Bellman updates are tightly linked to Boltzmann-like policy parameterizations. In addition to sampling actions proportionally to the exponential of the expected cumulative reward as Boltzmann policies would, these policies take entropy into account favoring states from which many outcomes are possible.


page 1

page 2

page 3

page 4


State Action Separable Reinforcement Learning

Reinforcement Learning (RL) based methods have seen their paramount succ...

Reinforcement Learning for Temporal Logic Control Synthesis with Probabilistic Satisfaction Guarantees

Reinforcement Learning (RL) has emerged as an efficient method of choice...

Accelerated Jarzynski Estimator with Deterministic Virtual Trajectories

The Jarzynski estimator is a powerful tool that uses nonequilibrium stat...

Sequential Dynamic Decision Making with Deep Neural Nets on a Test-Time Budget

Deep neural network (DNN) based approaches hold significant potential fo...

Assured RL: Reinforcement Learning with Almost Sure Constraints

We consider the problem of finding optimal policies for a Markov Decisio...

Parameterized MDPs and Reinforcement Learning Problems – A Maximum Entropy Principle Based Framework

We present a framework to address a class of sequential decision making ...

1 Introduction

One of the central challenges in the pursuit of machine intelligence is robust sequential decision making. In a stochastic and uncertain environment, an agent must capture information about the distribution over ways they may act and move through the state space. Indeed, the algorithmic process of planning and learning itself can lead to a well-defined distribution over state/action trajectories. This observation has led to a variety of connections between reinforcement learning (RL) and inference in probabilistic graphical models (PGMs) (Levine, 2018). In some ways this connection is unsurprising: belief propagation (and its relatives such as the sum-product algorithm) is understood to be an example of dynamic programming (Koller and Friedman, 2009) and dynamic programming was developed to solve control problems (Bellman, 1966; Bertsekas, 1995). Nevertheless, the exploration of the connection between control and inference has yielded fruitful insights into sequential decision making algorithms (Kalman, 1960; Attias, 2003; Ziebart, 2010; Kappen, 2011; Levine, 2018).

In this work, we present another point of view on reinforcement learning as a distribution over trajectories, one in which we draw upon useful abstractions from statistical physics. This view is in some ways a natural continuation of the agenda of connecting control to inference, as many insights in probabilistic graphical models have deep connections to, e.g., spin glass systems (Hopfield, 1982; Yedidia et al., 2001; Zdeborová and Krzakala, 2016)

. More generally, physics has often been a source of inspiration for ideas in machine learning

(MacKay, 2003; Mezard and Montanari, 2009)

. Boltzmann machines

(Ackley et al., 1985), Hamiltonian Monte Carlo (Duane et al., 1987; Neal et al., 2011; Betancourt, 2017)

and, more recently, tensor networks

(Stoudenmire and Schwab, 2016) are a few examples. In addition to direct inspiration, physics provides a compelling framework to reason about certain problems. The terms momentum, energy, entropy, and phase transition are ubiquitous in machine learning. However, abstractions from physics have generally not been so far helpful for understanding reinforcement learning models and algorithms. That is not to say there is a lack of interaction; RL is being used in some experimental physics domains, but physics has not yet as directly informed RL as it has, e.g., graphical models (Carleo et al., 2019).

Nevertheless, we should expect deep connections between reinforcement learning and physics: an RL agent is trying to find a policy that maximizes expected reward and many natural phenomena can be viewed through a minimization principle. For example, in classical mechanics or electrodynamics, a mass or light will follow a path that minimizes a physical quantity called the action, a property known as the principle of least action. Similarly, in thermodynamics, a system with many degrees a freedom—such as a gas—will explore its configuration space in the search for a configuration that minimizes its free energy. In reinforcement learning, rewards and value functions have a very similar flavor to energies, as they are extensive quantities and the agent is trying to find a path that maximizes them. In RL, however, value functions are often treated as the central object of study. This stands in contrast to statistical physics formulations of such problems in which the partition function is the primary abstraction, from which all the relevant thermodynamic quantities—average energy, entropy, heat capacity—can be derived. It is natural to ask, then, is there a theoretical framework for reinforcement learning that is centered on a partition function, in which value functions can be interpreted via average energies?

In this work, we show how to construct a partition function for a reinforcement learning problem. In a deterministic environment (Section 2), the construction is elementary and very natural. We explicitly identify the link between the underlying average energies associated with these partition functions and value functions of Boltzmann-like stochastic policies. As in the inference-based view on RL, moving from deterministic to stochastic environments introduces complications. In Section 3.2, we propose a construction for stochastic environments that results in realistic policies. Finally, in Section 4, we show how the partition function approach leads to an alternative model-free reinforcement learning algorithm that does not explicitly represent value functions.

We model the agent’s sequential decision-making task as a Markov decision process (MDP), as is typical. The agent selects actions in order to maximize its cumulative expected reward until a final state is reached. The MDP is defined by the objects . and  are the sets of states and actions, respectively.

is the probability of landing in state 

after taking action  from state . is the reward resulting from this transition. We also make the following additional assumptions: 1)  is finite, 2) all rewards  are bounded from above by  and deterministic, and 3) the number of available actions is uniformly bounded over all states by . We also allow for terminal states to have rewards even though there are no further actions and transitions. We denote these final-state rewards by . By shifting all rewards by  we can assume without loss of generality that  making all transition rewards  non positive. The final state rewards  are still allowed to be positive however.

2 Partition Functions for Deterministic MDPs

Our starting point is to consider deterministic Markov decision processes. Deterministic MDPs are those in which the transition probability distributions assign all their mass to one state. Deterministic MDPs are a widely studied special case

(Madani, 2002; Wen and Van Roy, 2013; Dekel and Hazan, 2013) and they are realistic for many practical control problems, such as robotic manipulation and locomotion, drone maneuver or machine-controlled scientific experimentation. For the deterministic setting, we will use  to denote the state that follows the taking of action  in state . Similarly, we will denote the reward more concisely as .

2.1 Construction of State-Dependent Partition Functions

To construct a partition function, two ingredients are needed: a statistical ensemble, and an energy function  on that ensemble. We will construct our ensembles from trajectories through the MDP; a trajectory  is a sequence of tuples  such that state  is a terminal state. We use the notation , and  to indicate the state, action, and reward, respectively, of trajectory  at step . Each state-dependent ensemble  is then the set of all trajectories that start at , i.e., for which . We will use these ensembles to construct a partition function for each state . Taking  to be the length of the trajectory, we write the energy function as


The form on the right takes a notational shortcut of defining  for the reward of the terminal state. Since the agent is trying to maximize their cumulative reward,  is a reasonable measure of the agent’s preference for a trajectory in the sense that lower energy solutions accumulate higher rewards. Note in particular that the ground state configurations are the most rewarding trajectories for the agent. With the ingredients  and  defined, we get the following partition function


In this expression,  is a hyper-parameter that can be interpreted as the inverse of a temperature. (This interpretation comes from statistical physics where , where  is the Boltzmann constant.) This partition function does not distinguish between two trajectories having identical cumulative rewards but different lengths. However, among equivalently rewarding trajectories, it seems natural to prefer shorter trajectories. One way to encode this preference is to add an explicit penalty  on the length  of a trajectory, leading to a partition function


In statistical physics,  is called a chemical potential and it measures the tendency of a system (such as a gas) to accept new particles. It is sometimes inconvenient to reason about systems with a fixed number of particles, adding a chemical potential offers a way to relax that constraint, allowing a system to have a varying number of particles while keeping the average fixed.

Note that since MDPs can allow for both infinitely long trajectories and infinite sets of finite trajectories,  can be infinite even in relatively simple settings. In Appendix A.1, we find that a sufficient condition for  to be well defined is taking . As written, the partition function in Eq. 3 is ambiguous for final states. For clarity we define  for a terminal state . We will refer to these as the boundary conditions.

Mathematically, the parameter  has a similar role as the one played by , the discount rate commonly used in reinforcement learning problems. They both make infinite series convergent in an infinite horizon setting, and ensure that the Bellman operators are contractions in their respective frameworks (A.3 ,B.3). However, when using , the order in which the rewards are observed can have an impact on the learned policy which does not happen when  is used. This could be a desirable property for some problems as it uncouples rewards from preferences for shorter paths.

2.2 A Bellman Equation for 

As we have defined an ensemble  for each state , there is a partition function  defined for each state. These partition functions are all related through a Bellman-like recursion:


where, as before,  indicates the state deterministically following from taking action  in state . This Bellman equation can be easily derived by decomposing each trajectory  into two parts: the first transition resulting from taking initial action  and the remainder of the trajectory  which is a member of . The total energy and length can also be decomposed in the same way, so that:

Note in particular that this Bellman recursion is linear in .

2.3 The Underlying Value Function and Policy

The partition function can be used to compute an average energy to shed light on the behavior of the system. This average is computed under the Boltzmann (Gibbs) distribution induced by the energy on the ensemble of trajectories :


In probabilistic machine learning, this is usually how one sees the partition function: as the normalizer for an energy-based learning model or an undirected graphical model (see, e.g., Murray and Ghahramani (2004)). Under this probability distribution, high-reward trajectories are the most likely but sub-optimal ones could still be sampled. This approach is closely related to the soft-optimality approach to RL (Levine, 2018). This distribution over trajectories allows us to compute an average energy for state  either as an explicit expectation or as the partial derivative of the log partition function with respect to the inverse temperature:


The negative of the average energy is the value function: . This is an intuitive result: recall that the energy  is low when the trajectory  accumulates greater rewards, so lower average energy indicates that the expected cumulative reward—the value—is greater. Since the partition functions  are connected by a Bellman equation, we expect that the underlying value functions  would be connected in a similar way, and there is indeed a non-linear Bellman recursion:

The derivative rule for natural log gives us , so:


Note that the quantities  inside the summation of Eq. 7 are positive and sum to  due to the Bellman recursion for  from Eq. 4. Thus we can view this Bellman equation for  as an expectation under a distribution on actions, i.e., a policy:


The policy  resembles a Boltzmann policy but strictly speaking it is not. A Boltzmann policy  selects actions proportionally to the exponential of their expected cumulative reward: . In particular,  does not take entropy into account: if two actions have the same expected optimal value, they will be picked with equal probability regardless of the possibility that one of them could achieve this optimality in a larger number of ways. In the partition function view,  does take entropy into account and to clarify this difference we will look at the two extreme cases . When , where the temperature of the system is infinite, rewards become irrelevant and we find that: . This means that  is picking action  proportionally to the number of trajectories that begin with . Here the counting of trajectories happens in a weighted way: longer trajectories contribute less than shorter ones. This is different from a Boltzmann policy that would pick actions uniformly at random.

sn edges[ [[ ][ ]][[ ]][[ ]]]

Figure 1: Decision Tree MDP

When , the low-temperature limit, we find in Section A.2 that  where is a weighted count of the number of optimal trajectories that begin at the state . Boltzmann policies completely ignore the  entropic factor.

To illustrate this difference more clearly, we consider the deterministic decision tree MDP shown in Figure 1 where  is the initial state and the leafs , and  are the final states. The arrows represent the actions available at each state. There are no rewards and the boundary conditions are:  and . This gives us the boundary condition:  and . Computing the -functions at the intermediate states  and  we find:   and . Finally we have . The underlying policy for picking the first action is given by:


When , we get:  . A Boltzmann policy would pick these three actions with equal probability. The policy  is biased towards the heavier subtree.
When  we get:  . A Boltzmann policy would pick action  and  with a probability of .   prefers states from which many possible optimal trajectories are possible.

2.4 A Planning Algorithm

When the dynamics of the environment are known, it is possible to to learn  by exploiting the Bellman equation (4). We denote by  the property that there exists an action  that takes an agent from state  to state . The reward associated with this transition will be denoted . Let 

be the vector of all partition functions and 

be the matrix:


is a matrix representation of the Bellman operator in Eq. 4. With these notations, the Bellman equations in (4) can be compactly written as:  highlighting the fact that  is a fixed point of the map: . In Appendix A.3, we show that  is a contraction which makes it possible to learn  by starting with an initial vector  having compatible boundary conditions and successively iterating the map . We could also interpret  

as an eigenvector of 

. In this context, this algorithm is simply doing a power method.

Interestingly, we can learn  by solving the underdetermined linear system  with the right boundary conditions. We show in Appendix A.2 that the policies learned are related to Boltzmann policies which produce non linear Bellman equations at the value function level:


where  is the discount factor and  is a normalization constant different from . By working with partition functions we transformed a non linear problem into a linear one. This remarkable result is reminiscent of linearly solvable MDPs (Todorov, 2007).

Once  is learned the agent’s policy is given by: .

3 Partition functions for Stochastic MDPs

We now move to the more general MDP setting. The dynamics of the environment can now be stochastic. However, as mentioned at the end of the introduction, we still assume that given an initial state , an action , and a landing state , the reward  is deterministic.

3.1 A First Attempt: Averaging the Bellman Equation

A first approach to incorporating the stochasticity of the environment is to average the right-hand side of the Bellman equation (4) and define  as the solution of:


Interestingly, the solution of this equation can be constructed in the same spirit of Section 2.1 by summing a functional over the set of trajectories. If we define  to be the log likelihood of a trajectory:  then  is defined by


satisfies the Bellman equation (12). The proof can be found in Appendix B.1. In Appendix B.2 we derive the Bellman equation satisfied by the underlying value function  and we find:


This Bellman equation does not correspond to a realistic policy; the policy depends on the landing state 

which is a random variable. The agent’s policy and the environment’s transitions cannot be decoupled. This is not surprising, from Eq. 

13 we see that  puts rewards and transition probabilities on an equal footing. As a result an agent believes they can choose any available transition as long as they are willing to pay the price in log probability. This encourages risky behavior: the agent is encouraged to bet on highly unlikely but beneficial transitions. These observations were also noted in Levine (2018).

3.2 A Variational Approach

Constructing a partition function for a stochastic MDP is not straightforward because there are two types of randomness: the first comes from the agent’s policy and the second from stochasticity of the environment. Mixing these two sources of randomness can lead to unrealistic policies as we saw in Section 3.1. A more principled approach is needed.

We construct a new deterministic MDP  from . We take  to be the space of probability distributions over , similar to belief state representations for partially-observable MDPs (Astrom, 1965; Sondik, 1978; Kaelbling et al., 1998). We make the assumption that the actions  are the same for all states and take . For  and  we define  where  is the transition matrix corresponding to choosing action  in the original MDP. We define .

being finite, it has a finite number  of final states which we denote . The final states of  are of the form  where  verify  and  is a Dirac delta function at state . The intrinsic value  of such a final state is then given by . This leads to the boundary conditions:


This new MDP  is deterministic, and we can follow the same approach of Section 2 and construct a partition function   on .   can be recovered by evaluating . From this construction we also get that  satisfies the following Bellman equation:


Just as it is the case for deterministic MDPs, the Bellman operator associated with this equation is a contraction. This is proved in Appendix B.3. However  is now infinite which makes solving Eq. 16 intractable. We adopt a variational approach which consists in finding the best approximation of  within a parametric family 

. We measure the fitness of a candidate through the following loss function


For illustration purposes, and inspired by the form of the boundary conditions (15), we consider a simple parametric family given by the partition functions of the form , where . The optimal  can be found using usual optimization techniques such as gradient descent. By evaluation of  at  we see that we must have  and consequently we have . The optimal solution satisfies the following Bellman equation:


The underlying value function verifies  where the policy  is given by . This approach leads to a realistic policy as its only dependency is on the current state, not a future one, unlike the policies arising from Eq. 14.

4 The Model-Free Case

4.1 Construction of State-Action-Dependent Partition Function

In a model free setting, where the transition dynamics are unknown, state-only value functions such as  are less useful than state-action value functions such as . Consequently, we will extend our construction to state-action partition functions . For a deterministic environment, we extend the construction in Section 2 and define  by


where  denotes the set of trajectories having . Since , we have . As a consequence of this construction,  satisfies the following linear Bellman equation:


This Bellman equation can be easily derived by decomposing each trajectory  into two parts: the first transition resulting from taking initial action  and the remainder of the trajectory  which is a member of  for some action  . The total energy and length can also be decomposed in the same way, so that:

In the same spirit of Section 2.3, one can show that the average underlying value function satisfies a Bellman equation:


can be then reinterpreted as the -function of the policy . Similarily to the results of Section 2.3 and Appendix A.2, the policy  can be thought of a Boltzmann policy of parameter  that takes entropy into account. This construction can be extend to a stochastic environments by following the same approach used in Section 3.2.

In the following we show how learning the state-action partition function  leads to an alternative approach to model-free reinforcement learning that does not explicitly represent value functions.

4.2 A Learning Algorithm


-Learning, the update rule typically consists of a linear interpolation between the current value estimate and the one arising

a posteriori:


where  is the learning rate and  is the discount factor. For -functions we will replace the linear interpolation with a geometric one. We take the update rule for -functions to be the following:


To understand what this update rule is doing, it is insightful to look at how how the underlying  -function,    is updated. We find:


We see that we recover a weighted version of the SARSA update rule. This update rule is referred to as expected

SARSA. Expected SARSA is known to reduce the variance in the updates by exploiting knowledge about stochasticity in the behavior policy and hence is considered an improvement over vanilla SARSA

(Van Seijen et al., 2009).

Since the underlying update rule is equivalent to the expected SARSA update rule, we can use any exploration strategy that works for expected SARSA. One exploration strategy could be -greedy which consists in taking action  with probability  and picking an action uniformly at random with probability . Another possibility would be a Boltzmann-like exploration which consists in taking action  with probability .

We would like to emphasize that even though the expected SARSA update is not novel, the learned policies through this updates rule are proper to the partition-function approach. In particular, the learned policies  are Boltzmann-like policies with some entropic preference properties as described in Section 2.3 and Appendix A.2.

5 Conclusion

In this article we discussed how planning and reinforcement learning problems can be approached through the tools and abstractions of statistical physics. We started by constructing partition functions for each state of a deterministic MDP and then showed how to extend that definition to the more general stochastic MDP setting through a variational approach. Interestingly, these partition functions have their own Bellman equation making it possible to solve planning and model-free RL problems without explicit reference to value functions. Nevertheless, conventional value functions can be derived from our partition function and interpreted via average energies. Computing the implied value functions can also shed some light on the policies arising from these algorithms. We found that the learned policies are closely related to Boltzmann policies with the additional interesting feature that they take entropy into consideration by favoring states from which many trajectories are possible. Finally, we observed that working with partition functions is more natural in some settings. In a deterministic environment for example, near-optimal Bellman equations become linear which is not the case in a value-function-centric approach.

6 Acknowledgments

We would like to thank Alex Beatson, Weinan E, Karthik Narasimhan and Geoffrey Roeder for helpful discussions and feedback. This work was funded by a Princeton SEAS Innovation Grant and the Alfred P. Sloan Foundation.


  • Ackley et al. (1985) D Ackley, G Hinton, and T Sejnowski. A learning algorithm for Boltzmann machines. Cognitive Science, 9(1):147–169, 1985.
  • Astrom (1965) Karl J Astrom. Optimal control of Markov processes with incomplete state information. Journal of Mathematical Analysis and Applications, 10(1):174–205, 1965.
  • Attias (2003) Hagai Attias. Planning by probabilistic inference. In

    International Conference on Artificial Intelligence and Statistics

    , 2003.
  • Bellman (1966) Richard Bellman. Dynamic programming. Science, 153(3731):34–37, 1966.
  • Bertsekas (1995) Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995.
  • Betancourt (2017) Michael Betancourt. A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint arXiv:1701.02434, 2017.
  • Carleo et al. (2019) Giuseppe Carleo, Ignacio Cirac, Kyle Cranmer, Laurent Daudet, Maria Schuld, Naftali Tishby, Leslie Vogt-Maranto, and Lenka Zdeborová. Machine learning and the physical sciences, 2019.
  • Dekel and Hazan (2013) Ofer Dekel and Elad Hazan. Better rates for any adversarial deterministic MDP. In International Conference on Machine Learning, pages 675–683, 2013.
  • Duane et al. (1987) Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics Letters B, 195(2):216–222, 1987.
  • Hopfield (1982) John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982.
  • Kaelbling et al. (1998) Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
  • Kalman (1960) Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1):35–45, 1960.
  • Kappen (2011) Hilbert J Kappen. Optimal control theory and the linear Bellman equation. In D Barber, A T Cemgil, and S Chiappa, editors, Bayesian Time Series Models. Cambridge University Press, 2011.
  • Koller and Friedman (2009) Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  • Levine (2018) Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018.
  • MacKay (2003) David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
  • Madani (2002) Omid Madani. Polynomial value iteration algorithms for deterministic MDPs. In Proceedings of the Eighteenth conference on Uncertainty in Artificial Intelligence, pages 311–318. Morgan Kaufmann Publishers Inc., 2002.
  • Mezard and Montanari (2009) Marc Mezard and Andrea Montanari. Information, physics, and computation. Oxford University Press, 2009.
  • Murray and Ghahramani (2004) Iain Murray and Zoubin Ghahramani. Bayesian learning in undirected graphical models: approximate MCMC algorithms. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pages 392–399. AUAI Press, 2004.
  • Neal et al. (2011) Radford M Neal et al. MCMC using Hamiltonian dynamics.

    Handbook of Markov Chain Monte Carlo

    , 2(11):2, 2011.
  • Sondik (1978) Edward J Sondik. The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs. Operations Research, 26(2):282–304, 1978.
  • Stoudenmire and Schwab (2016) Edwin Stoudenmire and David J Schwab. Supervised learning with tensor networks. In Advances in Neural Information Processing Systems, pages 4799–4807, 2016.
  • Todorov (2007) Emanuel Todorov. Linearly-solvable Markov decision problems. In Advances in Neural Information Processing Systems, pages 1369–1376, 2007.
  • Van Seijen et al. (2009) Harm Van Seijen, Hado Van Hasselt, Shimon Whiteson, and Marco Wiering. A theoretical and empirical analysis of expected SARSA. In 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pages 177–184. IEEE, 2009.
  • Wen and Van Roy (2013) Zheng Wen and Benjamin Van Roy. Efficient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems, pages 3021–3029, 2013.
  • Yedidia et al. (2001) Jonathan S Yedidia, William T Freeman, and Yair Weiss. Generalized belief propagation. In Advances in Neural Information Processing Systems, pages 689–695, 2001.
  • Zdeborová and Krzakala (2016) Lenka Zdeborová and Florent Krzakala. Statistical physics of inference: Thresholds and algorithms. Advances in Physics, 65(5):453–552, 2016.
  • Ziebart (2010) Brian D Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, University of Washington, 2010.

Appendix A Deterministic MDPs

a.1 is well defined

Proposition 1.

is well defined for .


The MDP being finite,  has a finite number of final state we can then find a constant  such that, for all final states  we have .

Where used the fact that all rewards  are non positive and that the number of available actions at each state is bounded by . When , the sum    becomes convergent and   is well defined. ∎

Remark 1.

is a sufficient condition, but not a necessary one.  could be well defined for all values of . This happens for instance when  is finite for all .

a.2 The underlying policy is Boltzmann-like

For high values of , the sum  will become dominated by the contribution of few of its terms. As , the sum will be dominated by the contribution of the paths with the biggest reward. We have

We see that .

Since the MDP is finite and deterministic, it has a finite number of transitions and rewards. Consequently, the set  takes discrete values, in particular, there is a finite gap  between the maximum value and the second biggest value of this set. Let’s denote by  the set of trajectories that achieve this maximum and by .

counts the number of trajectories  in a weighted way: longer trajectories contribute less than shorter ones. It is a measure of the size of  that takes into account our preference for shorter trajectories. Putting everything together we get:

This shows that  , which results in the following policy for :

differs from a traditional Boltzmann policy in the following way: if we have two actions  and  such that   but there are twice more optimal trajectories spanning from  than there are from  then action  will be chosen twice as often as . This is to contrast with the usual Boltzmann policy that will pick  and  with equal probability. When  is the same for all , we recover a Boltzmann policy. When  the policy converges to a an optimal policy and  converges to the optimal value function.

a.3   is a contraction

Proposition 2.

Let   and let . The map defined by

is a contraction for the sup-norm:  .


is the set of all possible partition functions with compatible boundary conditions. The matrix  is more explicitly defined by:

Because   when  is a final state, the map  is well defined (i.e. ). Since the MDP is finite, it has a finite number of final state so there exists a constant  such that, for all final states  we have