Reinforcement learning (RL) is a promising field of current AI research which has seen tremendous accomplishments in recent years including mastery of board games such as Go and Chess [1, 2]. An important advance in RL is the development of the maximum entropy reinforcement learning (MaxEnt RL) approach which has many appealing features such as improved exploration and robustness [3, 4, 5]
. The approach is based on the addition of an entropy-based regularization term that generalizes the optimal control problem in RL and allows it to be recast as a problem in Bayesian inference. The process involves the introduction of optimality variables such that the posterior trajectory distribution, conditioned on optimality, provides the solution to the optimal control problem [7, 8, 9, 6] in MaxEnt RL. While this control-as-inference framework has led to several advances, there are open questions relating to analytical results that characterize the optimal dynamics in the general case.
The combination of RL approaches with insights from statistical physics has led to significant advances in the field . While the connections to equilibrium statistical mechanics are well established , the interface with non-equilibrium statistical mechanics (NESM) is less explored. One of the contributions of this paper is to show how insights from NESM can lead to the development of an analytical framework for MaxEnt RL that addresses open questions and reveals multiple new avenues of research.
Recent research in NESM using large deviation theory has developed a framework for analyzing Markovian processes conditioned on rare events [11, 12, 13, 14, 15]. In this framework, a generalization of the Doob -transform  is used to determine the driven process - a conditioning-free Markovian process that has the same statistics as the original Markovian process conditioned on a rare event. The connection to MaxEnt RL can be seen by noting that the goal in MaxEnt RL is to derive the posterior trajectory distribution conditioned on optimality and, in the long-time limit, optimality of the trajectory is a rare event for the original dynamics. This commonality of conditioning on rare events suggests that approaches and insights from NESM can be used to characterize the optimal control policy for MaxEnt RL problems. While previous work has established connections between NESM and stochastic optimal control , an explicit characterization of the optimal controlled processes for general MaxEnt RL problems has not been derived to date.
Contributions: The main contributions of this work are:
A mapping between MaxEnt RL and Markovian processes conditioned on rare events in the long-time limit.
Analytical closed-form expressions characterizing trajectory distributions conditioned on optimality. In particular, we derive closed-form results for the optimal policy, dynamics and initial state distributions for the general case of stochastic dynamics in MaxEnt RL.
Extension of previous approaches for linearly solvable Markov Decision Process (MDP) problems to a more general class of MDPs in MaxEnt RL. Our approach leads to expressions for soft- value functions in terms of the dominant eigenvalue and corresponding left eigenvector of a regular, non-negative matrix derived from the MDP.
Novel approaches for model-based and model-free RL, which we validate using simulations.
2 Prior relevant work
Our approach is primarily based on the probabilistic inference approach to reinforcement learning and control which was reviewed recently in . There is a long series of works on "Control As Inference" framework, wherein the problem of control is formulated as a Bayesian inference problem [7, 9, 17, 8, 6]. This formulation brings the powerful arsenal of the inference methods to the realm of control. Specifically,  demonstrated the Baum-Welch algorithm on planning problems. Another inference method, known as expectation propagation , was adapted by  for approximated message passing during planning. This approach was further developed in multiple works [8, 9] as reviewed in . Even though this perspective on control as inference contributed to the development of new methods, its high computational complexity hampers a broader application of inference-based methods for control problem. Analytical closed-form results for "control as inference" and/or a reduction to problems with lower complexity would facilitate the development of new control methods.
A promising step in this direction was made in , which proposed a particular class of Linearly Solvable MDPs. The LSMDP method greatly simplifies reinforcement learning by reducing this particular class of MDPs to an eigenvalue problem. Surprisingly, LSMDPs have been shown to be effective not only for RL, but they also lead to improved computational complexity for other important problems. For instance, LSMDP improves the computational complexity of the Shortest Path problem from to as shown in Section 3 in . Moreover, insights from LSMDP led to a new model-free reinforcement learning algorithm, denoted as -learning, which can be more efficient than the standard -learning. This efficiency is due mostly to the closed-form solutions for this particular class of MDPs. In this work we develop analytical closed-form solutions for a general class of MDPs in the long-time limit.
In this section we provide the necessary background for three core components of this work: i) Markov Decision Process, MDP, ii) Maximum Entropy Reinforcement Learning, MaxEnt RL, and iii) Perron Frobenius Theory, PFT.
3.1 Markov Decision Processes
We consider the standard MDP, given by a tuple , comprising of the state space , the action space, , the transition function, , and the reward function, . The agent policy, denoted by ), is to be optimized with regard to the expected accumulated reward. We focus on an arbitrary large, but a finite time horizon, without discounting.
For a fixed policy, consider a Markov Chain with states represented by tuples, where is an agent’s current state and is an action taken while in state
. The probability that the agent transitions to stateafter taking action is denoted by . The choice of action given the agent’s current state is drawn from a policy distribution and the corresponding reward collected by the agent is given by .
With the above representation, the prior probability distribution for a trajectorywith initial state distribution , can be expressed as
The goal of standard RL is to find the optimal policy that maximizes the total expected rewards collected by the agent,
3.2 Maximum Entropy Reinforcement Learning
In MaxEnt RL, the cumulative reward objective which is optimized in the preceding equation is modified and the optimization is carried out over trajectory distributions. The objective is augmented to include an entropy term, specifically the Kullback-Leibler divergence between the controlled trajectory distribution and the prior trajectory distribution, which allows us to recast the modified optimal control problem as an inference problem. This control-as-inference approach involves the introduction of optimality variables defined such that
denotes an inverse temperature parameter. The binary random variablerepresents the probability that the trajectory is optimal at time step . The statistical dependencies between , , and , are shown at Figure (1).
With this choice, the posterior trajectory distribution, obtained by conditioning on for all , exactly corresponds to the trajectory distribution generated by optimal control. The optimal control problem in MaxEnt RL thus becomes equivalent to a problem in Bayesian inference.
Let denote the accumulated energetic cost for a trajectory
. From Bayes’ law, it follows that the posterior probability distribution for trajectories, conditioned onfor all , is given by
One of the central problems in MaxEnt RL is to determine the policy, dynamics and initial state distributions, conditioned on optimality. Note that, in many practical RL problems, control of system dynamics is not feasible. In these cases, the posterior dynamics and initial state distributions must be constrained to exactly match the prior dynamics and initial state distributions and the optimization is carried out by varying the policy distribution alone. In this work, we consider the unconstrained problem and obtain expressions for the optimal dynamics and initial state distributions along with the optimal policy distribution.
3.3 Perron-Frobenius theorem
According to the Perron-Frobenius theorem, an irreducible, non-negative matrix, has a unique dominant eigenvalue with a corresponding unique right eigenvector (with ) and a unique left eigenvector (with ). The eigenvectors, and are normalized to satisfy and . In addition, if is sub-stochastic, we must have and so we can define such that . The Perron-Frobenius theorem for irreducible non-negative matrices guarantees . Consequently, for large :
4 Methodology and results
Without any loss of generality , we consider reward functions such that the maximum reward is set to and we have . In this case, Equation (3) indicates that, in the long-time limit, optimality of the entire trajectory is a rare event and the problem of determining the posterior policy and dynamics corresponds to conditioning on a rare event. Recent research in NESM [12, 14] has developed a framework for characterizing Markovian processes conditioned on rare events. In the following, we show how some of the key results obtained in NESM can be derived in the MaxEnt RL context.
Let , denote two consecutive state-action tuples. We can combine the system dynamics with the fixed prior policy to compose the corresponding transition matrix for the discrete time Markov Chain
4.1 Twisted and driven transition matrices
It is readily seen that the twisted transition matrix generates a probability distribution over trajectories that corresponds to term in the numerator of Eq. 4
. However, the twisted matrix is not a stochastic matrix and thus it cannot be interpreted as a transition matrix for a Markov chain that conserves probability. To address this issue, we introduce an additional absorbing state for the agent such that the extended transition matrix(defined below) is a stochastic matrix.
where is defined such that , i.e. .
The extended model now provides an interpretation for the optimality variable introduced in Equation (3). Let us consider system evolution for time steps using the transition matrix . Imposing the condition is equivalent to conditioning on non-absorption for the time steps. Thus the optimal trajectory distribution is generated by considering the probability distribution over trajectories generated by , conditional on no transitions to the absorbing state for the entire trajectory. This interpretation allows us to make connections to the theory of quasi-stationary distributions [21, 22] which can be used to analyze Markovian processes conditioned on non-absorption.
For the dynamics generated by , given the initial state-action pair , the probability of transitioning to after taking steps is given by . In the following, we assume that is a regular non-negative matrix, i.e. the corresponding dynamics is irreducible. To ensure that the dynamics is irreducible, we can, for example, add a modification that incorporates transitions to the initial state from states that would otherwise be considered terminal states. Consequently, the conditions of the Perron-Frobenius theorem are satisfied for , and for large , we have 
Now the probability that a trajectory starting with state-action pair is optimal for steps is given by
This result can be used to derive the posterior distribution over trajectories conditioned on optimality. Typically, the difficulty in deriving expressions for the posterior distribution stems from estimating the partition sum in the denominator of Equation (4). However, we note that the partition sum is given by and thus can be estimated using the results derived.
To derive expressions for posterior state and dynamics distributions conditioned on optimality, we define, consistent with the terminology in NESM, the driven transition matrix
with and denoting two consecutive state-action tuples.
4.2 Main Results
The following lemma connects the twisted transition matrix with the driven transition matrix, which is an essential step required to derive the closed-form solution for MaxEnt RL in the long-time limit.
In the long time limit, the Driven Transition Matrix is
Let us divide the trajectory into two parts such that and . We will focus on in the limit . Using the definition of the driven matrix, we have
Using Bayes’ rule, the LHS of Equation (14) can be expressed in terms of the twisted matrix as:
In Equation (14) using the substitution
and comparing with Equation (15), we get :
Equation (13) recovers the expression for the driven model as a generalized Doob -transform in recent work in NESM [12, 13, 14, 15]. It is interesting to note that our analysis recovers this result based on Bayesian inference of the posterior trajectory distribution.
The Lemma (Driven Transition Matrix) allows us to derive the following expressions for the optimal dynamics, policy and initial state-action pair distributions.
(Closed-Form Solution for MaxEntropy RL)
By definition we have
Using the approximation in Eq. 18, Eq. (23) reduces to the result stated in Eq. (22). The results for the optimal policy and dynamics can be obtained by decomposing the result for the driven transition matrix in Eq. (13). The details of the proof are provided in the Supplementary Materials.
Equations (20) (21) (22) show that, in the long-time limit, the optimal dynamics can be completely characterized by the dominant eigenvalue and the corresponding left eigenvector of the twisted matrix . While previous work has shown how a special class of MDPs are linearly solvable [20, 23], our results show that linear solutions can be obtained for general MDP models in the long-time limit. We elaborate on this generalization in Section (4.5).
The significance of this result is that it provides the complete solution for the central problem of MaxEnt RL (stated in Equation (5)). For the case of deterministic dynamics, the results show that the optimal dynamics is unchanged from the original dynamics and the optimal policy is determined by the left eigenvector . For the case of stochastic dynamics, the results determine how the original dynamics must be controlled to obtain the optimal dynamics.
4.3 Solution properties & connections to soft value functions
The results derived for the optimal dynamics can be used to derive analytical expressions for value functions in MaxEnt RL and to make further connections to statistical mechanics. In the following, unless otherwise stated, we consider trajectories starting from a fixed initial state . Using the relations between backward messages and the soft -function , , in combination with Equation (11) and the definition , yields the relations
Thus the value functions in MaxEnt RL can be obtained using the dominant eigenvalue and the left eigenvector of the twisted matrix . The results derived show that, in the large limit, the soft value functions per time step are primarily determined by the Perron-Frobenius eigenvalue. The small fluctuations around this constant offset value, however, are the key determinants of the optimal policy (see Fig. 2). These results have been validated by comparing with the dynamic programming solution for MaxEnt RL.
Besides the value functions, other quantities of interest in RL can also be obtained using the Perron-Frobenius eigenvalue and corresponding eigenvectors. For example, it can be shown that the forward messages can be directly obtained from the corresponding right eigenvector in the long-time limit , i.e. for . Furthermore, for such that and , we can show (see Supplementary Materials)
We note that represents the components of the dominant right eigenvector of the driven matrix , i.e. the components of the steady-state distribution over state-action pairs generated by the driven dynamics. Given that the steady-state distribution over state-action pairs is a key component of related approaches such as Inverse RL  and extensions of soft- learning , the result obtained in Equation (26) can significantly impact the computations involved in these approaches.
The formalism developed also provides insights into the analogs in statistical mechanics for quantities of interest in MaxEnt RL. The Boltzmann form of the optimal trajectory distribution indicates that optimization of the objective function by the inference approach  is equivalent to minimizing a free energy functional. This free energy functional attains its minimum value for the optimal trajectory distribution , which is given by (see Supplementary Materials)
The preceding equation indicates that the optimal controlled dynamics minimizes the collected energetic costs, regularized by the entropic costs corresponding to the Kullback-Leibler divergence between controlled and prior trajectory distributions.
Furthermore, from statistical mechanics at the trajectory level, we obtain that which can be rewritten as leading to the identity
In other words, minimizing the combined energetic and entropic costs is equivalent to obtaining the optimal soft -function.
4.4 Model-Based Learning - General Linearly Solvable MDP
Based on the theoretical derivations presented above, we obtain a way to linearly solve a general class of MDPs by determining the dominant eigenvalue and corresponding left eigenvector of the twisted transition matrix. Our approach is consistent with previous work  on a special class of linearly solvable MDPs. This previous approach considers the case of state-to-state transitions, with state dependent rewards and reducible dynamics (i.e. with absorbing goal states that have maximum reward) and cannot be applied to problems without absorbing states and with irreducible dynamics. Our method represents an extension of this approach that explicitly considers state-action pairs and applies to problems with irreducible dynamics. We have tested that our approach can also be applied to models with reducible dynamics with absorbing goal states that have the maximum reward. This is done by transforming the reducible model to an irreducible model by adding a small parameter that represents the probability of transitioning from the absorbing state back to the initial state. This additional parameter makes the dynamics irreducible and thus our formalism can be applied to determine the left eigenvector which in turn determines the optimal policy. As , the limiting value of determines the optimal policy for the reducible case. Working in this limit, we can show that the equations derived in this work, specifically Eq. 20, reduces to the previously obtained result  for the restricted class of MDPs with reducible dynamics (see Supplementary Materials)
The above procedure was applied to develop a novel approach to the shortest path problem. Consider a maze with an absorbing goal state with reward function such that there is a constant penalty () for such that is not the goal state, and for the absorbing goal state. With this setup, determining the trajectory that minimizes the energetic costs is equivalent to determining the shortest path. For the case of uniform prior policy and deterministic dynamics all allowed trajectories are equally likely (i.e. ). From Eq. (4) we see that the trajectory obtained by following the greedy version of the optimal policy corresponds to the trajectory that minimizes the energetic costs, i.e. the shortest path. It is interesting to note that this procedure of identifying shortest paths does not depend on the temperature parameter . The proposed approach for determining shortest paths has been validated in Figure (3). In this experiment, we have a reducible MDP where the goal state is an absorbing state. An optimal stochastic policy is obtained through the left eigenvector of the corresponding twisted matrix. The figure shows the greedy version of this policy which, in this setting, is independent of the factor used for the computations. The result shown matches the ground truth for this problem.
4.5 Model-Free Learning: -Learning Algorithm
The framework developed shows how several quantities of interest in MaxEnt RL can be obtained using the dominant eigenvalue and the corresponding left eigenvector of the twisted matrix. In the following, we show how these quantities can be obtained in a model-free setting by system exploration using the original transition dynamics. By taking the sum over the columns of the driven matrix in equation (13), we note that the left eigenvector elements can be written as an expectation value over the original transition dynamics. Correspondingly the dominant eigenvalue and left eigenvector can be obtained through a learning process based on the following equation
The corresponding update equations for learning and are
Where and are learning rates . Note that the prior policy is used for sampling actions during the training process (see Equation (29)). Thus this model-free approach to RL, which we term -learning, is fundamentally an off-policy approach  wherein the optimal policy is obtained via system exploration using the prior policy. Figure (4) shows that optimal policies obtained using Equation (30) and (31) are in excellent agreement with the results obtained using dynamic programming.
In conclusion, we have established a mapping between MaxEnt RL and recent research in NESM involving Markov processes conditioned on rare events. The results derived include analytical expressions that address open problems in the field and lead to linear solutions for a general class of MDPs. While previous linear approaches have focused on reducible dynamics with goal states that are absorbing states with maximum reward, our approach considers the case of irreducible dynamics and can thus be readily applied to exploratory problems in, e.g., robotics. The results derived also lead to a novel algorithm for model-free RL which we term learning. The main limitation of the current results is the restriction to discrete states and actions. We will explore the continuous case in our subsequent work. Another essential research direction is to generalize the current calculation to determine the optimal policy for the case of stochastic dynamics with the constraint that the posterior probabilities of the initial state and dynamics are kept fixed to the original probabilities. In this case, the challenge is to perform structured variational inference and it is of interest to determine if the framework developed in the current work can also be applied to this constrained optimization problem. The results obtained have thus established a new framework for analyzing optimization problems using MaxEnt RL and generalizations of this approach hold great promise for obtaining solutions to a broader range of problems in stochastic optimal control.
6 Implementation and code availability
We used OpenAI’s gym framework 
, which is an open source project under the MIT license. We are making our code available athttps://github.com/argearriojas/maxent-rl-mdp-scripts. Additional details regarding the implementation are provided in the Supplementary Materials.
AA and RK would like to acknowledge funding from the NSF through Award DMS-1854350. ST would like to acknowledge funding from Berkeley Deep Drive. AA thanks Oracle for funding this project through the CMS Doctoral Fellowship at UMass Boston.
-  David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144, Dec 2018.
-  Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, Dec 2020.
Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine.
Reinforcement Learning with Deep Energy-Based Policies.
International Conference on Machine Learning, pages 1352–1361. PMLR, Jul 2017.
-  Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
-  Benjamin Eysenbach and Sergey Levine. Maximum Entropy RL (Provably) Solves Some Robust RL Problems. arXiv, Mar 2021.
-  Sergey Levine. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv, May 2018.
-  Emanuel Todorov. General duality between optimal control and estimation. In 2008 47th IEEE Conference on Decision and Control, pages 4286–4292. IEEE, 2008.
-  Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. Proceedings of Robotics: Science and Systems VIII, 2012.
-  Hilbert J. Kappen, Vicenç Gómez, and Manfred Opper. Optimal control as a graphical model inference problem. Mach. Learn., 87(2):159–182, May 2012.
Pankaj Mehta, Marin Bukov, Ching-Hao Wang, Alexandre GR Day, Clint Richardson,
Charles K Fisher, and David J Schwab.
A high-bias, low-variance introduction to machine learning for physicists.Physics reports, 810:1–124, 2019.
Juan P Garrahan, Robert L Jack, Vivien Lecomte, Estelle Pitard, Kristina van
Duijvendijk, and Frédéric van Wijland.
First-order dynamical phase transition in models of glasses: an approach based on ensembles of histories.Journal of Physics A: Mathematical and Theoretical, 42(7):075007, 2009.
-  Robert L Jack and Peter Sollich. Large deviations and ensembles of trajectories in stochastic models. Progress of Theoretical Physics Supplement, 184:304–317, 2010.
-  Raphaël Chetrite and Hugo Touchette. Nonequilibrium Microcanonical and Canonical Ensembles and Their Equivalence. Phys. Rev. Lett., 111(12):120601, Sep 2013.
-  Raphaël Chetrite and Hugo Touchette. Nonequilibrium Markov Processes Conditioned on Large Deviations. Annales Henri Poincaré, 16(9):2005–2057, Sep 2015.
-  Raphaël Chetrite and Hugo Touchette. Variational and optimal control representations of conditioned and driven processes. Journal of Statistical Mechanics: Theory and Experiment, 2015(12):P12001, 2015.
-  Vladimir Y Chernyak, Michael Chertkov, Joris Bierkens, and Hilbert J Kappen. Stochastic optimal control as non-equilibrium statistical mechanics: Calculus of variations over density and current. Journal of Physics A: Mathematical and Theoretical, 47(2):022001, 2013.
-  Marc Toussaint. Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, pages 1049–1056, 2009.
Planning by probabilistic inference.
Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, volume R4 of Proceedings of Machine Learning Research, pages 9–16. PMLR, 03–06 Jan 2003.
-  Thomas Minka. Expectation propagation for approximate bayesian inference. In In Uncertainty in Artificial Intelligence (UAI), 2001.
-  Emanuel Todorov. Linearly-solvable markov decision problems. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19. MIT Press, 2007.
-  Sylvie Méléard, Denis Villemonais, et al. Quasi-stationary distributions and population processes. Probability Surveys, 9:340–410, 2012.
-  Erik A van Doorn and Philip K Pollett. Quasi-stationary distributions for discrete-state models. European journal of operational research, 230(1):1–14, 2013.
-  Emanuel Todorov. Efficient computation of optimal actions. Proc. Natl. Acad. Sci. U.S.A., 106(28):11478–11483, Jul 2009.
-  Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adverserial inverse reinforcement learning. In International Conference on Learning Representations, 2018.
-  Jordi Grau-Moya, Felix Leibfried, and Peter Vrancx. Soft q-learning with mutual-information regularization. In International Conference on Learning Representations, 2018.
-  Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 2018.
-  Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv, May 2020.
-  Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv, Jun 2016.