1 Introduction
One of the central challenges in the pursuit of machine intelligence is robust sequential decision making. In a stochastic and uncertain environment, an agent must capture information about the distribution over ways they may act and move through the state space. Indeed, the algorithmic process of planning and learning itself can lead to a welldefined distribution over state/action trajectories. This observation has led to a variety of connections between reinforcement learning (RL) and inference in probabilistic graphical models (PGMs) (Levine, 2018). In some ways this connection is unsurprising: belief propagation (and its relatives such as the sumproduct algorithm) is understood to be an example of dynamic programming (Koller and Friedman, 2009) and dynamic programming was developed to solve control problems (Bellman, 1966; Bertsekas, 1995). Nevertheless, the exploration of the connection between control and inference has yielded fruitful insights into sequential decision making algorithms (Kalman, 1960; Attias, 2003; Ziebart, 2010; Kappen, 2011; Levine, 2018).
In this work, we present another point of view on reinforcement learning as a distribution over trajectories, one in which we draw upon useful abstractions from statistical physics. This view is in some ways a natural continuation of the agenda of connecting control to inference, as many insights in probabilistic graphical models have deep connections to, e.g., spin glass systems (Hopfield, 1982; Yedidia et al., 2001; Zdeborová and Krzakala, 2016)
. More generally, physics has often been a source of inspiration for ideas in machine learning
(MacKay, 2003; Mezard and Montanari, 2009)(Ackley et al., 1985), Hamiltonian Monte Carlo (Duane et al., 1987; Neal et al., 2011; Betancourt, 2017)and, more recently, tensor networks
(Stoudenmire and Schwab, 2016) are a few examples. In addition to direct inspiration, physics provides a compelling framework to reason about certain problems. The terms momentum, energy, entropy, and phase transition are ubiquitous in machine learning. However, abstractions from physics have generally not been so far helpful for understanding reinforcement learning models and algorithms. That is not to say there is a lack of interaction; RL is being used in some experimental physics domains, but physics has not yet as directly informed RL as it has, e.g., graphical models (Carleo et al., 2019).Nevertheless, we should expect deep connections between reinforcement learning and physics: an RL agent is trying to find a policy that maximizes expected reward and many natural phenomena can be viewed through a minimization principle. For example, in classical mechanics or electrodynamics, a mass or light will follow a path that minimizes a physical quantity called the action, a property known as the principle of least action. Similarly, in thermodynamics, a system with many degrees a freedom—such as a gas—will explore its configuration space in the search for a configuration that minimizes its free energy. In reinforcement learning, rewards and value functions have a very similar flavor to energies, as they are extensive quantities and the agent is trying to find a path that maximizes them. In RL, however, value functions are often treated as the central object of study. This stands in contrast to statistical physics formulations of such problems in which the partition function is the primary abstraction, from which all the relevant thermodynamic quantities—average energy, entropy, heat capacity—can be derived. It is natural to ask, then, is there a theoretical framework for reinforcement learning that is centered on a partition function, in which value functions can be interpreted via average energies?
In this work, we show how to construct a partition function for a reinforcement learning problem. In a deterministic environment (Section 2), the construction is elementary and very natural. We explicitly identify the link between the underlying average energies associated with these partition functions and value functions of Boltzmannlike stochastic policies. As in the inferencebased view on RL, moving from deterministic to stochastic environments introduces complications. In Section 3.2, we propose a construction for stochastic environments that results in realistic policies. Finally, in Section 4, we show how the partition function approach leads to an alternative modelfree reinforcement learning algorithm that does not explicitly represent value functions.
We model the agent’s sequential decisionmaking task as a Markov decision process (MDP), as is typical. The agent selects actions in order to maximize its cumulative expected reward until a final state is reached. The MDP is defined by the objects . and are the sets of states and actions, respectively.
is the probability of landing in state
after taking action from state . is the reward resulting from this transition. We also make the following additional assumptions: 1) is finite, 2) all rewards are bounded from above by and deterministic, and 3) the number of available actions is uniformly bounded over all states by . We also allow for terminal states to have rewards even though there are no further actions and transitions. We denote these finalstate rewards by . By shifting all rewards by we can assume without loss of generality that making all transition rewards non positive. The final state rewards are still allowed to be positive however.2 Partition Functions for Deterministic MDPs
Our starting point is to consider deterministic Markov decision processes. Deterministic MDPs are those in which the transition probability distributions assign all their mass to one state. Deterministic MDPs are a widely studied special case
(Madani, 2002; Wen and Van Roy, 2013; Dekel and Hazan, 2013) and they are realistic for many practical control problems, such as robotic manipulation and locomotion, drone maneuver or machinecontrolled scientific experimentation. For the deterministic setting, we will use to denote the state that follows the taking of action in state . Similarly, we will denote the reward more concisely as .2.1 Construction of StateDependent Partition Functions
To construct a partition function, two ingredients are needed: a statistical ensemble, and an energy function on that ensemble. We will construct our ensembles from trajectories through the MDP; a trajectory is a sequence of tuples such that state is a terminal state. We use the notation , , and to indicate the state, action, and reward, respectively, of trajectory at step . Each statedependent ensemble is then the set of all trajectories that start at , i.e., for which . We will use these ensembles to construct a partition function for each state . Taking to be the length of the trajectory, we write the energy function as
(1) 
The form on the right takes a notational shortcut of defining for the reward of the terminal state. Since the agent is trying to maximize their cumulative reward, is a reasonable measure of the agent’s preference for a trajectory in the sense that lower energy solutions accumulate higher rewards. Note in particular that the ground state configurations are the most rewarding trajectories for the agent. With the ingredients and defined, we get the following partition function
(2) 
In this expression, is a hyperparameter that can be interpreted as the inverse of a temperature. (This interpretation comes from statistical physics where , where is the Boltzmann constant.) This partition function does not distinguish between two trajectories having identical cumulative rewards but different lengths. However, among equivalently rewarding trajectories, it seems natural to prefer shorter trajectories. One way to encode this preference is to add an explicit penalty on the length of a trajectory, leading to a partition function
(3) 
In statistical physics, is called a chemical potential and it measures the tendency of a system (such as a gas) to accept new particles. It is sometimes inconvenient to reason about systems with a fixed number of particles, adding a chemical potential offers a way to relax that constraint, allowing a system to have a varying number of particles while keeping the average fixed.
Note that since MDPs can allow for both infinitely long trajectories and infinite sets of finite trajectories, can be infinite even in relatively simple settings. In Appendix A.1, we find that a sufficient condition for to be well defined is taking . As written, the partition function in Eq. 3 is ambiguous for final states. For clarity we define for a terminal state . We will refer to these as the boundary conditions.
Mathematically, the parameter has a similar role as the one played by , the discount rate commonly used in reinforcement learning problems. They both make infinite series convergent in an infinite horizon setting, and ensure that the Bellman operators are contractions in their respective frameworks (A.3 ,B.3). However, when using , the order in which the rewards are observed can have an impact on the learned policy which does not happen when is used. This could be a desirable property for some problems as it uncouples rewards from preferences for shorter paths.
2.2 A Bellman Equation for
As we have defined an ensemble for each state , there is a partition function defined for each state. These partition functions are all related through a Bellmanlike recursion:
(4) 
where, as before, indicates the state deterministically following from taking action in state . This Bellman equation can be easily derived by decomposing each trajectory into two parts: the first transition resulting from taking initial action and the remainder of the trajectory which is a member of . The total energy and length can also be decomposed in the same way, so that:
Note in particular that this Bellman recursion is linear in .
2.3 The Underlying Value Function and Policy
The partition function can be used to compute an average energy to shed light on the behavior of the system. This average is computed under the Boltzmann (Gibbs) distribution induced by the energy on the ensemble of trajectories :
(5) 
In probabilistic machine learning, this is usually how one sees the partition function: as the normalizer for an energybased learning model or an undirected graphical model (see, e.g., Murray and Ghahramani (2004)). Under this probability distribution, highreward trajectories are the most likely but suboptimal ones could still be sampled. This approach is closely related to the softoptimality approach to RL (Levine, 2018). This distribution over trajectories allows us to compute an average energy for state either as an explicit expectation or as the partial derivative of the log partition function with respect to the inverse temperature:
(6) 
The negative of the average energy is the value function: . This is an intuitive result: recall that the energy is low when the trajectory accumulates greater rewards, so lower average energy indicates that the expected cumulative reward—the value—is greater. Since the partition functions are connected by a Bellman equation, we expect that the underlying value functions would be connected in a similar way, and there is indeed a nonlinear Bellman recursion:
The derivative rule for natural log gives us , so:
(7) 
Note that the quantities inside the summation of Eq. 7 are positive and sum to due to the Bellman recursion for from Eq. 4. Thus we can view this Bellman equation for as an expectation under a distribution on actions, i.e., a policy:
(8) 
The policy resembles a Boltzmann policy but strictly speaking it is not. A Boltzmann policy selects actions proportionally to the exponential of their expected cumulative reward: . In particular, does not take entropy into account: if two actions have the same expected optimal value, they will be picked with equal probability regardless of the possibility that one of them could achieve this optimality in a larger number of ways. In the partition function view, does take entropy into account and to clarify this difference we will look at the two extreme cases . When , where the temperature of the system is infinite, rewards become irrelevant and we find that: . This means that is picking action proportionally to the number of trajectories that begin with . Here the counting of trajectories happens in a weighted way: longer trajectories contribute less than shorter ones. This is different from a Boltzmann policy that would pick actions uniformly at random.
When , the lowtemperature limit, we find in Section A.2 that where is a weighted count of the number of optimal trajectories that begin at the state . Boltzmann policies completely ignore the entropic factor.
To illustrate this difference more clearly, we consider the deterministic decision tree MDP shown in Figure 1 where is the initial state and the leafs , , , and are the final states. The arrows represent the actions available at each state. There are no rewards and the boundary conditions are: and . This gives us the boundary condition: and . Computing the functions at the intermediate states and we find: , and . Finally we have . The underlying policy for picking the first action is given by:
(9) 
When , we get: .
A Boltzmann policy would pick these three actions with equal probability. The policy is biased towards the heavier subtree.
When we get: . A Boltzmann policy would pick action and with a probability of . prefers states from which many possible optimal trajectories are possible.
2.4 A Planning Algorithm
When the dynamics of the environment are known, it is possible to to learn by exploiting the Bellman equation (4). We denote by the property that there exists an action that takes an agent from state to state . The reward associated with this transition will be denoted . Let
be the vector of all partition functions and
be the matrix:(10) 
is a matrix representation of the Bellman operator in Eq. 4. With these notations, the Bellman equations in (4) can be compactly written as: highlighting the fact that is a fixed point of the map: . In Appendix A.3, we show that is a contraction which makes it possible to learn by starting with an initial vector having compatible boundary conditions and successively iterating the map : . We could also interpret
as an eigenvector of
. In this context, this algorithm is simply doing a power method.Interestingly, we can learn by solving the underdetermined linear system with the right boundary conditions. We show in Appendix A.2 that the policies learned are related to Boltzmann policies which produce non linear Bellman equations at the value function level:
(11) 
where is the discount factor and is a normalization constant different from .
By working with partition functions we transformed a non linear problem into a linear one. This remarkable result is reminiscent of linearly solvable MDPs (Todorov, 2007).
Once is learned the agent’s policy is given by: .
3 Partition functions for Stochastic MDPs
We now move to the more general MDP setting. The dynamics of the environment can now be stochastic. However, as mentioned at the end of the introduction, we still assume that given an initial state , an action , and a landing state , the reward is deterministic.
3.1 A First Attempt: Averaging the Bellman Equation
A first approach to incorporating the stochasticity of the environment is to average the righthand side of the Bellman equation (4) and define as the solution of:
(12) 
Interestingly, the solution of this equation can be constructed in the same spirit of Section 2.1 by summing a functional over the set of trajectories. If we define to be the log likelihood of a trajectory: then is defined by
(13) 
satisfies the Bellman equation (12). The proof can be found in Appendix B.1. In Appendix B.2 we derive the Bellman equation satisfied by the underlying value function and we find:
(14) 
This Bellman equation does not correspond to a realistic policy; the policy depends on the landing state
which is a random variable. The agent’s policy and the environment’s transitions cannot be decoupled. This is not surprising, from Eq.
13 we see that puts rewards and transition probabilities on an equal footing. As a result an agent believes they can choose any available transition as long as they are willing to pay the price in log probability. This encourages risky behavior: the agent is encouraged to bet on highly unlikely but beneficial transitions. These observations were also noted in Levine (2018).3.2 A Variational Approach
Constructing a partition function for a stochastic MDP is not straightforward because there are two types of randomness: the first comes from the agent’s policy and the second from stochasticity of the environment. Mixing these two sources of randomness can lead to unrealistic policies as we saw in Section 3.1. A more principled approach is needed.
We construct a new deterministic MDP from . We take to be the space of probability distributions over , similar to belief state representations for partiallyobservable MDPs (Astrom, 1965; Sondik, 1978; Kaelbling et al., 1998). We make the assumption that the actions are the same for all states and take . For and we define where is the transition matrix corresponding to choosing action in the original MDP. We define .
being finite, it has a finite number of final states which we denote . The final states of are of the form where verify and is a Dirac delta function at state . The intrinsic value of such a final state is then given by . This leads to the boundary conditions:
(15) 
This new MDP is deterministic, and we can follow the same approach of Section 2 and construct a partition function on . can be recovered by evaluating . From this construction we also get that satisfies the following Bellman equation:
(16) 
Just as it is the case for deterministic MDPs, the Bellman operator associated with this equation is a contraction. This is proved in Appendix B.3. However is now infinite which makes solving Eq. 16 intractable. We adopt a variational approach which consists in finding the best approximation of within a parametric family
. We measure the fitness of a candidate through the following loss function:
.For illustration purposes, and inspired by the form of the boundary conditions (15), we consider a simple parametric family given by the partition functions of the form , where . The optimal can be found using usual optimization techniques such as gradient descent. By evaluation of at we see that we must have and consequently we have . The optimal solution satisfies the following Bellman equation:
(17) 
The underlying value function verifies where the policy is given by . This approach leads to a realistic policy as its only dependency is on the current state, not a future one, unlike the policies arising from Eq. 14.
4 The ModelFree Case
4.1 Construction of StateActionDependent Partition Function
In a model free setting, where the transition dynamics are unknown, stateonly value functions such as are less useful than stateaction value functions such as . Consequently, we will extend our construction to stateaction partition functions . For a deterministic environment, we extend the construction in Section 2 and define by
(18) 
where denotes the set of trajectories having . Since , we have . As a consequence of this construction, satisfies the following linear Bellman equation:
(19) 
This Bellman equation can be easily derived by decomposing each trajectory into two parts: the first transition resulting from taking initial action and the remainder of the trajectory which is a member of for some action . The total energy and length can also be decomposed in the same way, so that:
In the same spirit of Section 2.3, one can show that the average underlying value function satisfies a Bellman equation:
(20) 
can be then reinterpreted as the function of the policy . Similarily to the results of Section 2.3 and Appendix A.2, the policy can be thought of a Boltzmann policy of parameter that takes entropy into account.
This construction can be extend to a stochastic environments by following the same approach used in Section 3.2.
In the following we show how learning the stateaction partition function leads to an alternative approach to modelfree reinforcement learning that does not explicitly represent value functions.
4.2 A Learning Algorithm
In
Learning, the update rule typically consists of a linear interpolation between the current value estimate and the one arising
a posteriori:(21) 
where is the learning rate and is the discount factor. For functions we will replace the linear interpolation with a geometric one. We take the update rule for functions to be the following:
(22) 
To understand what this update rule is doing, it is insightful to look at how how the underlying function, is updated. We find:
(23) 
We see that we recover a weighted version of the SARSA update rule. This update rule is referred to as expected
SARSA. Expected SARSA is known to reduce the variance in the updates by exploiting knowledge about stochasticity in the behavior policy and hence is considered an improvement over vanilla SARSA
(Van Seijen et al., 2009).Since the underlying update rule is equivalent to the expected SARSA update rule, we can use any exploration strategy that works for expected SARSA. One exploration strategy could be greedy which consists in taking action with probability and picking an action uniformly at random with probability . Another possibility would be a Boltzmannlike exploration which consists in taking action with probability .
We would like to emphasize that even though the expected SARSA update is not novel, the learned policies through this updates rule are proper to the partitionfunction approach. In particular, the learned policies are Boltzmannlike policies with some entropic preference properties as described in Section 2.3 and Appendix A.2.
5 Conclusion
In this article we discussed how planning and reinforcement learning problems can be approached through the tools and abstractions of statistical physics. We started by constructing partition functions for each state of a deterministic MDP and then showed how to extend that definition to the more general stochastic MDP setting through a variational approach. Interestingly, these partition functions have their own Bellman equation making it possible to solve planning and modelfree RL problems without explicit reference to value functions. Nevertheless, conventional value functions can be derived from our partition function and interpreted via average energies. Computing the implied value functions can also shed some light on the policies arising from these algorithms. We found that the learned policies are closely related to Boltzmann policies with the additional interesting feature that they take entropy into consideration by favoring states from which many trajectories are possible. Finally, we observed that working with partition functions is more natural in some settings. In a deterministic environment for example, nearoptimal Bellman equations become linear which is not the case in a valuefunctioncentric approach.
6 Acknowledgments
We would like to thank Alex Beatson, Weinan E, Karthik Narasimhan and Geoffrey Roeder for helpful discussions and feedback. This work was funded by a Princeton SEAS Innovation Grant and the Alfred P. Sloan Foundation.
References
 Ackley et al. (1985) D Ackley, G Hinton, and T Sejnowski. A learning algorithm for Boltzmann machines. Cognitive Science, 9(1):147–169, 1985.
 Astrom (1965) Karl J Astrom. Optimal control of Markov processes with incomplete state information. Journal of Mathematical Analysis and Applications, 10(1):174–205, 1965.

Attias (2003)
Hagai Attias.
Planning by probabilistic inference.
In
International Conference on Artificial Intelligence and Statistics
, 2003.  Bellman (1966) Richard Bellman. Dynamic programming. Science, 153(3731):34–37, 1966.
 Bertsekas (1995) Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995.
 Betancourt (2017) Michael Betancourt. A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint arXiv:1701.02434, 2017.
 Carleo et al. (2019) Giuseppe Carleo, Ignacio Cirac, Kyle Cranmer, Laurent Daudet, Maria Schuld, Naftali Tishby, Leslie VogtMaranto, and Lenka Zdeborová. Machine learning and the physical sciences, 2019.
 Dekel and Hazan (2013) Ofer Dekel and Elad Hazan. Better rates for any adversarial deterministic MDP. In International Conference on Machine Learning, pages 675–683, 2013.
 Duane et al. (1987) Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics Letters B, 195(2):216–222, 1987.
 Hopfield (1982) John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982.
 Kaelbling et al. (1998) Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(12):99–134, 1998.
 Kalman (1960) Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1):35–45, 1960.
 Kappen (2011) Hilbert J Kappen. Optimal control theory and the linear Bellman equation. In D Barber, A T Cemgil, and S Chiappa, editors, Bayesian Time Series Models. Cambridge University Press, 2011.
 Koller and Friedman (2009) Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
 Levine (2018) Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018.
 MacKay (2003) David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
 Madani (2002) Omid Madani. Polynomial value iteration algorithms for deterministic MDPs. In Proceedings of the Eighteenth conference on Uncertainty in Artificial Intelligence, pages 311–318. Morgan Kaufmann Publishers Inc., 2002.
 Mezard and Montanari (2009) Marc Mezard and Andrea Montanari. Information, physics, and computation. Oxford University Press, 2009.
 Murray and Ghahramani (2004) Iain Murray and Zoubin Ghahramani. Bayesian learning in undirected graphical models: approximate MCMC algorithms. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pages 392–399. AUAI Press, 2004.

Neal et al. (2011)
Radford M Neal et al.
MCMC using Hamiltonian dynamics.
Handbook of Markov Chain Monte Carlo
, 2(11):2, 2011.  Sondik (1978) Edward J Sondik. The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs. Operations Research, 26(2):282–304, 1978.
 Stoudenmire and Schwab (2016) Edwin Stoudenmire and David J Schwab. Supervised learning with tensor networks. In Advances in Neural Information Processing Systems, pages 4799–4807, 2016.
 Todorov (2007) Emanuel Todorov. Linearlysolvable Markov decision problems. In Advances in Neural Information Processing Systems, pages 1369–1376, 2007.
 Van Seijen et al. (2009) Harm Van Seijen, Hado Van Hasselt, Shimon Whiteson, and Marco Wiering. A theoretical and empirical analysis of expected SARSA. In 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pages 177–184. IEEE, 2009.
 Wen and Van Roy (2013) Zheng Wen and Benjamin Van Roy. Efficient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems, pages 3021–3029, 2013.
 Yedidia et al. (2001) Jonathan S Yedidia, William T Freeman, and Yair Weiss. Generalized belief propagation. In Advances in Neural Information Processing Systems, pages 689–695, 2001.
 Zdeborová and Krzakala (2016) Lenka Zdeborová and Florent Krzakala. Statistical physics of inference: Thresholds and algorithms. Advances in Physics, 65(5):453–552, 2016.
 Ziebart (2010) Brian D Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, University of Washington, 2010.
Appendix A Deterministic MDPs
a.1 is well defined
Proposition 1.
is well defined for .
Proof.
The MDP being finite, has a finite number of final state we can then find a constant such that, for all final states we have .
Where used the fact that all rewards are non positive and that the number of available actions at each state is bounded by . When , the sum becomes convergent and is well defined. ∎
Remark 1.
is a sufficient condition, but not a necessary one. could be well defined for all values of . This happens for instance when is finite for all .
a.2 The underlying policy is Boltzmannlike
For high values of , the sum will become dominated by the contribution of few of its terms. As , the sum will be dominated by the contribution of the paths with the biggest reward. We have
We see that .
Since the MDP is finite and deterministic, it has a finite number of transitions and rewards. Consequently, the set takes discrete values, in particular, there is a finite gap between the maximum value and the second biggest value of this set. Let’s denote by the set of trajectories that achieve this maximum and by .
counts the number of trajectories in a weighted way: longer trajectories contribute less than shorter ones. It is a measure of the size of that takes into account our preference for shorter trajectories. Putting everything together we get:
This shows that , which results in the following policy for :
differs from a traditional Boltzmann policy in the following way: if we have two actions and such that but there are twice more optimal trajectories spanning from than there are from then action will be chosen twice as often as . This is to contrast with the usual Boltzmann policy that will pick and with equal probability. When is the same for all , we recover a Boltzmann policy. When the policy converges to a an optimal policy and converges to the optimal value function.
a.3 is a contraction
Proposition 2.
Let and let . The map defined by
is a contraction for the supnorm: .
Proof.
is the set of all possible partition functions with compatible boundary conditions. The matrix is more explicitly defined by:
Because when is a final state, the map is well defined (i.e. ). Since the MDP is finite, it has a finite number of final state so there exists a constant such that, for all final states we have