 # On Solving a Stochastic Shortest-Path Markov Decision Process as Probabilistic Inference

Previous work on planning as active inference addresses finite horizon problems and solutions valid for online planning. We propose solving the general Stochastic Shortest-Path Markov Decision Process (SSP MDP) as probabilistic inference. Furthermore, we discuss online and offline methods for planning under uncertainty. In an SSP MDP, the horizon is indefinite and unknown a priori. SSP MDPs generalize finite and infinite horizon MDPs and are widely used in the artificial intelligence community. Additionally, we highlight some of the differences between solving an MDP using dynamic programming approaches widely used in the artificial intelligence community and approaches used in the active inference community.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A core problem in the field of artificial intelligence (AI) is building agents capable of automated planning under uncertainty. Problems involving planning under uncertainty are typically formulated as an instance of a Markov Decision Process (MDP). At a high level, an MDP comprises 1) a set of world states, 2) a set of actions, 3) a transition model describing the probability of transitioning to a new state when taking an action in the current state, and 4) an objective function (e.g. minimizing costs over a sequence of time steps). An MDP solution determines the agent’s actions at each decision point. An optimal MDP solution is one that optimizes the objective function. These are typically obtained using dynamic programming algorithms

111Linear programming approaches are also popular methods for solving MDPs [2, 12, 22, 7]

. Additionally, other methods exist in the reinforcement learning community such as policy gradient methods

[28, 14, 27].
[17, 26].

Recent work based on the active inference framework  poses the planning problem as a probabilistic inference problem. Several papers have been published showing connections between active inference and dynamic programming to solve an MDP [15, 9, 8]. However, the planning problem being solved in the two communities is not equivalent.

First, dynamic programming approaches used to solve an MDP, such as policy iteration, are valid for finite, infinite and indefinite horizons. Indefinite horizons are finite but of which the length is unknown a priori. For instance, consider an agent navigating from a starting state to a goal state in a grid world where the outcome is uncertain (e.g. the grid world in Figure 1 shown in the appendix). Before starting to act in the environment, there is no way for the agent to know how many time steps it will take to reach the goal. Algorithms based on dynamic programming, such as policy iteration, are valid for such settings. They can solve the Stochastic Shortest-Path Markov decision process (SSP MDP)[17, 2]. However, work from active inference is only formulated for finite horizons .

Second, the optimal solution to an SSP MDP is a stationary deterministic policy . This refers to a mapping from states to actions independent of time. Computing this optimal policy can be done offline (without interaction with the environment) or online (while interacting). In the active inference literature however, solving the planning problem is performed by computing a stochastic plan (a sequence of actions given the current state). This is only valid during online planning. Additionally, the solution is only optimal given a certain horizon, which is specified a priori. If the horizon chosen is too short, the agent will not find a solution to reach the target. If it is too long, the solution will be sub-optimal.

The main contribution of this paper is presenting a novel algorithm to solve a Stochastic Shortest-Path Markov Decision Process using probabilistic inference. This is an MDP with an indefinite horizon. Additionally, highlighting the several gaps between solving an MDP in the AI community and the active inference community.

Section 2 discusses the SSP MDP and section 3 presets an approach for solving an SSP MDP as probabilistic inference. The equivalence between the two methods is shown in Section 4.1. Furthermore, the difference between world states and temporal state is highlighted in Section 4.2. Policies, plans and probabilistic plans are discussed in Section 4.3. Finally, a discussion on online vs offline planning can be found in Section 5.

## 2 Stochastic Shortest Path MDP

An SSP MDP is defined as a tuple . is the set of states, is the set of actions, is the cost function, and is the transition function. is the set of goal states. Each goal state is absorbing and incurs zero cost. The expected cost of applying action in state is The minimum expected cost at state is . A policy maps a state to a distribution over action choices. A policy is deterministic if it chooses a single action at each step. A policy is proper if it reaches starting from any with probability 1. In an SSP MDP, the following assumptions are made : a) there exists a proper policy, and b) every improper policy incurs infinite cost at all states where it is improper.

The goal is to find the an optimal policy with the minimum expected cost and can be computed as

 π∗(s)=argmina∈A[∑s′∈ST(s,a,s′)[C(s,a,s′)+V∗(s′)]].

is referred to as the optimal value for a state and is defined as:

 V∗(s)=mina∈A[∑s′∈ST(s,a,s′)[C(s,a,s′)+V∗(s′)]].

Crucially, the optimal policy corresponding to the optimal value function is Markovian (only dependant on the current state) and deterministic . Solving an SSP MDP means finding a policy that minimizes expected cost, as opposed to one that maximizes reward. This difference is purely semantic as the problems are dual. We can define a reward function and move to a reward maximization formulation. A more fundamental distinction is the presence of a special set of (terminal) goal states, in which staying forever incurs no cost.

Solving an SSP MDP can be done using standard dynamic programming algorithms such as policy iteration. Policy iteration can be divided in two steps, policy evaluation and improvement. In policy evaluation, for a policy , the value function is recursively evaluated until convergence as

 Vπ(s)←∑s′∈ST(s,π(s),s′)[C(s,π(s),s′)+Vπ(s′)].

In the policy improvement step, the state-action value function is computed as:

 Q(s,a)=∑s′∈ST(s,a,s′)[C(s,a,s′)+V(s′)].

Then we compute a new policy as for every state in . Iterating between these two steps guarantees convergence to an optimal policy.

#### 2.0.1 Properties of an SSP MDP.

An SSP MDP can be shown to generalize finite, infinite and indefinite horizon MDPs [3, 17]. Thus algorithms valid for an SSP MDP are also valid for the finite and infinite horizon MDPs. Additionally, it can be proven that each SSP MDP has an optimal deterministic policy independent of time. Therefore, the claims made in  about active inference being more general since it computes stochastic policies are unjustified when solving an MDP. However, these results do not hold in partially observable cases or in the presence of uncertain models. But in the SSP MDP defined above (which is commonly used in the AI community), there always exists a deterministic optimal policy. Note that there is an infinite number of stochastic policies but only a finite number of deterministic policies (). This greatly speeds up the algorithms while still guaranteeing optimality.

## 3 Solving an SSP MDP as probabilistic inference

In this section we discuss a novel approach for solving an SSP MDP as probabilistic inference. We use an inference algorithm that exactly solves an SSP MDP as defined in the previous section. This approach is inspired by work from [32, 31] which solves an MDP with an indefinite horizon. This approach has been successfully applied to solve problems of planning under uncertainty, e.g. [18, 30].

### 3.1 Definitions

The definition of an SSP MDP includes a set of world states and actions . In probabilistic inference we instead reason about temporal states and actions. A temporal state

is a random variable defined over all world states. Conceptually it represents the state that the agent will visit at the time-step

. For the grid world in Figure 1, there are 16 world states but the number of temporal states is unknown a priori since the horizon is unknown.

The transition probability is defined as a probability distribution over temporal states and actions as

. If the random variables are fixed to specific world states and and an action , the transition probability would be equivalent to the transition function defined for an SSP MDP. The probability of taking a certain action in a state is parameterized by a policy as . This policy is defined exactly the same as in the case of an SSP MDP.

The cost function is defined differently. The temporal cost variables are defined as binary random variables . Translating an arbitrary cost function to temporal costs can be done by scaling the cost function (as defined in the previous section) between the minimum cost (min()) and maximum cost (max()) as:

 P(ct=1∣at=a,st=s)=C(a,s)−min(C)max(C)−min(C).

Any expression with can be though of as ‘the probability of a cost being maximal’. Thus, the probability of a cost being maximal given a state and an action is . Now we can reason about the highest possible cost for a state and action as one where and the lowest possible cost as = 0. Any other cost will have a probability in-between, according to its magnitude.

Finally, we model the horizon as a random variable. The temporal states and actions are considered up to the end of the horizon . However, the horizon is generally unknown. We thus model itself as a random variable. Combining all this information we can define the SSP MDP using a probabilistic model.

### 3.2 Mixture of finite MDPs

In this section we define the SSP MDP in terms of a mixture of finite MDPs with only a final cost variable. Given every horizon (for instance ) the finite MDP can be given as . Note that we dropped the time-index for since there is only one cost variable now. This model can be factorized as

 P(c,s0:T,a0:T∣T;π)=P(c∣aT,sT)P(a0∣s0;π)P(s0)⋅∏Tt=1P(at∣st;π)P(st∣at−1,st−1)

To reason about the full MDP, we consider the mixture model of the joint given by the joint probability distribution

 P(c,s0:T,a0:T,T;π)=P(c,s0:T,a0:T∣T;π)P(T)

where

is a prior over the total time, which we choose to be a flat prior (uniform distribution).

### 3.3 Computing an optimal policy

Our objective is to find a policy that minimizes the expected cost. Similarly to policy iteration, we do not assume any knowledge about the initial state. Expectation-Maximization

222An expectation-maximization algorithm can be viewed as performing free-energy minimization [21, 16]. In the E-step, the free-energy is computed and the M-step updates the parameters to minimize the free-energy. can be used to find the optimal parameters of our model: the policy . The E-step will, for a given , compute a posterior over state-action sequences. The M-step then adapts the model parameters to optimize the expected likelihood with respect to the quantities calculated in the E-step.

#### 3.3.1 E-step: a backwards pass in all finite MDPs.

We use the simpler notation and Further, as a ‘base’ for backward propagation, we define

 β0(i) =P(c=1∣xT=i;π)=∑aP(c=1∣aT=a,xT=i)πai.

This is the immediate cost when following a policy . It is the expected cost if there is only one time-step remaining. Then, we can recursively compute all the other backward messages. We use the index to indicate a backwards counter. This means that , where is total (unknown) horizon length. This is computed as

 βτ(i)=P(c=1∣xT−τ=i;π)=∑jp(j∣i;π)βτ−1(j).

Intuitively, the backward messages are the expected cost if one incurs a cost at the last time step only. So, , is the expected cost if the agent follows the policy for two time-steps and only incurs a cost after that. Using these messages, we can compute a value function dependent on time, actions and states given as:

 qτ(a,i) =P(c=1∣at=a,st=i,T=t+τ;π) ={∑jp(j∣i,a)βτ−1(j)τ>1P(c=1∣aT=a,sT=i)τ=0.

Marginalizing out time, we get the state-action value-function

 P(c=1∣at=a,st=i;π)=1C∑τP(T=t+τ)qτ(a,i)

where is a normalization constant. This quantity is the probability of getting a maximum cost given a state and action. It is similar to the function computed in policy iteration.

#### 3.3.2 M-step: the policy improvement step.

The standard M-step in an EM-algorithm maximizes the expected complete log-likelihood with respect to the new parameters . Given that the optimal policy for an MDP is deterministic, a greedy M-step can be used. However, our goal is to minimize the log-likelihood in this case as it refers to a the probability of receiving a maximal cost. This can done as

 π′=argmina(P(c=1∣at=a,st=i;π)) (1)

This update converges much faster than in a standard M-step. Here an M-step can be used to to obtain a stochastic policy. However, this is unnecessary since the optimal policy is deterministic. Note that there is an infinite number of stochastic policies but a finite number of deterministic ones. In conclusion, a greedy M-step is faster to converge but still guarantees an optimal policy.

## 4 Connections between the two views

### 4.1 Exact relationship between policy iteration and planning as probabilistic inference

The messages computed during backward propagation are exactly equal to the value functions for a single MDP of finite time. The full value function is can therefore be written as the sum of the s,

 Vπ(i)=∑TβT(i)

since the prior over time is a uniform prior. If is not a uniform distribution, this would result in a mixture rather than a sum. The same applies to the relationship between the Q-value function:

 Qπ(a,i)=∑TqT(a,i).

Hence, the E-step essentially performs a policy evaluation which yields the classical value function. Given this relation to policy evaluation, the M-step performs an operation exactly equivalent to standard policy improvement. Thus, the EM-algorithm using exact inference is equivalent to Policy Iteration but computes the necessary quantities differently.

One unanswered question is when to stop computing the backward messages. In  messages are computed up to a number . From this perspective, the planning as inference algorithm presented is equivalent to the so-called truncated policy iteration algorithm as opposed to the more common -greedy version.

In the policy evaluation step, one iterates through the state space to update the value for every state until a termination criterion is met. An -greedy criterion means that we stop iterating though the state space once the maximum difference in for any is smaller than a positive small number . In truncated policy iteration, however, we iterate through the state space times and then perform the policy improvement step. The probabilistic inference algorithm presented in this paper is equivalent to truncated policy iteration if we restrict the maximum number of messages to be computed.

### 4.2 World states vs temporal states

In dynamic programming, one reasons over the world states. In the grid world example in Figure 1, this refers to a grid cell. This grid world has 16 world states. In probabilistic inference, one reasons about a temporal state. This is a random variable over all world states. The number of temporal states is dependent on how many time-steps the agents acts in the environment (which is often unknown beforehand). An illustration of the difference is given in Fig. 2.

### 4.3 Policies, plans and probabilistic plans

A classical planning algorithm computes a plan: a sequence of actions. An algorithm like A* or Dijkstra’s algorithm  can be used to find the optimal path from a stating state to a goal state, given a deterministic world. Crucially, this solution can be computed offline (without interactions with the environment). In a stochastic world, this does not work since the agent can not predict in which states it will end up. However, one can use deterministic planning algorithms for stochastic environments if the path is re-planned online at every time-step. Determinization-based methods have found success in solving planning under uncertainity problems such as the famous FF-replan algorithm 

Active inference approaches computes a probabilistic plan. The active inference literature calls this a policy; however, we use a different term to avoid confusion.333The distinction between a plan and a policy when using active inference has been briefly discussed in . Additionally, other methods computing plans as probabilistic inference have been proposed before active inference in [1, 33] In active inference, the agents computes a finite plan while interacting with the environment. However, rather than assuming a deterministic world (like FF-replan ), the probabilities are taken into account. This can be shown to compute the optimal solution to an MDP (when planning online). We thus refer to it as a probabilistic plan, a plan that was computed while taking the transition probabilities into account.

Finally, a policy is a mapping from states to actions, i.e. the agent has a preferred action to take for every state. Policies can be stochastic or time-dependent; however, for an SSP MDP the optimal policy is deterministic and independent of time. An agent can compute a policy offline and use it online without needing any additional computation while interacting with the environment. The difference between a plan and policy is illustrated in Fig. 1.

To summarize, a plan or a probabilistic plan can only be used for online planning. Since the outcome of an action is inherently uncertain. Probabilistic plans (as used in active inference) find an optimal solution when used to plan online. A policy also provides an optimal solution and can be computed offline or online.

## 5 Discussion

In this paper we present a novel approach to solve a stochastic shortest path Markov decision process (SSP MDP) as probabilistic inference. The SSP MDP generalizes many models, including finite and infinite MDPs. Crucially, the dynamic programming algorithms (such as policy iteration) classically used to solve an SSP MDP are valid for indefinite horizons (finite but of unknown length); this is not the case for active inference approaches.

The exact connections between solving an MDP using policy iteration and the presented algorithm are discussed. Afterwards, we discussed the gap between solving an MDP in active inference and the approaches in the artificial intelligence community. This included the difference between world states and temporal states, the difference between plans, probabilistic plans and policies. An interesting question now is, which approach is more appropriate? This depends on the problem at hand and whether it can be solved online or offline.

Online and offline planning. As discussed in Section 4.3, a policy is mapping from states to actions and can be used for offline and online planning. Computing a policy is somewhat computationally expensive; however, a look-up is very cheap. Thus if one operates in an environment where the transition and cost function do not change, it is best to compute an optimal policy offline then use it online (while interacting with the environment). This is the case for many planning and scheduling problems, such as a set of elevators operating in sync , task-level planning in robotics , multi-objective planning [23, 11] and playing games [4, 25]. The challenges in these problems are often that the state-space is incredibly large and thus approximations are needed. However, the problem is fully observable and the cost and transition models are static; the rules of chess do not change half way, for instance.

If the transition or cost functions vary while interacting with the environment (e.g. [24, 10]), an offline solution is not optimal. In this case, the agent can plan online by re-evaluating a policy or computing probabilistic plans (as done in active inference). Computing the latter is cheaper and requires less memory. This is because a probabilistic plan is a distribution over actions up to a time horizon while a (finite policy) is a conditional distribution over all world states in . For any time-step, the posterior over the action is related to the policy such that .

Consider the work in [29, 24]. In both cases a robot operates in an environment susceptible to changes. If the environment changes, the agent can easily construct a new model by varying the cost or transition function but needs to recompute a solution. In  the authors recompute a policy at every time-step while in  a probabilistic plans is computed using active inference. Since in both cases the solution is recomputed at every time-step, active inference is preferred since it requires less memory and can be computationally cheaper. On the other hand, if the environment only changes occasionally, computing a policy might remain preferable.

To conclude, if the transition and cost functions ( and ) are static, it is preferable to compute a policy offline. If and change occasionally, one may still compute an offline policy and recompute a policy only when a change occurs. However, if the environment is dynamic, computing a probabilistic plan (using active inference) is preferable to recomputing a policy at every time-step.

## References

•  H. Attias (2003) Planning by probabilistic inference.. In AISTATS, Cited by: footnote 3.
•  D. P. Bertsekas and J. N. Tsitsiklis (1991) An analysis of stochastic shortest path problems. Mathematics of Operations Research 16 (3), pp. 580–595. Cited by: §1, footnote 1.
•  D. P. Bertsekas and J. N. Tsitsiklis (1995) Neuro-dynamic programming: an overview. In Proceedings of 1995 34th IEEE conference on decision and control, Vol. 1, pp. 560–564. Cited by: §2.0.1.
•  M. Campbell, A. J. Hoane, and F. Hsu (2002) Deep blue. Artif. Intell. 134, pp. 57–83. Cited by: §5.
•  T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein (2009) Introduction to algorithms. MIT press. Cited by: §4.3.
•  R. H. Crites, A. G. Barto, et al. (1996) Improving elevator performance using reinforcement learning. Advances in neural information processing systems, pp. 1017–1023. Cited by: §5.
•  F. d’Epenoux (1963) A probabilistic production and inventory problem. Management Science 10 (1), pp. 98–108. Cited by: footnote 1.
•  L. Da Costa, T. Parr, N. Sajid, S. Veselic, V. Neacsu, and K. Friston (2020) Active inference on discrete state-spaces: a synthesis. arXiv preprint arXiv:2001.07203. Cited by: §1.
•  L. Da Costa, N. Sajid, T. Parr, K. Friston, and R. Smith (2020) The relationship between dynamic programming and active inference: the discrete, finite-horizon case. arXiv preprint arXiv:2009.08111. Cited by: §1, §1, §2.0.1.
•  P. Duckworth, B. Lacerda, and N. Hawes (2021) Time-bounded mission planning in time-varying domains with semi-mdps and gaussian processes. Cited by: §5.
•  K. Etessami, M. Kwiatkowska, M. Y. Vardi, and M. Yannakakis (2007) Multi-objective model checking of markov decision processes. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems, pp. 50–65. Cited by: §5.
•  V. Forejt, M. Kwiatkowska, G. Norman, and D. Parker (2011) Automated verification techniques for probabilistic systems. In International school on formal methods for the design of computer, communication and software systems, pp. 53–113. Cited by: footnote 1.
•  K. Friston, T. FitzGerald, F. Rigoli, P. Schwartenbeck, and G. Pezzulo (2017) Active inference: a process theory. Neural computation 29 (1), pp. 1–49. Cited by: §1.
•  I. Grondman, L. Busoniu, G. A. Lopes, and R. Babuska (2012) A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42 (6), pp. 1291–1307. Cited by: footnote 1.
•  R. Kaplan and K. J. Friston (2018) Planning and navigation as active inference. Biological cybernetics 112 (4), pp. 323–343. Cited by: §1.
•  D. Koller and N. Friedman (2009) Probabilistic graphical models: principles and techniques. MIT press. Cited by: footnote 2.
•  A. Kolobov (2012) Planning with markov decision processes: an ai perspective. Vol. 6, Morgan & Claypool Publishers. Cited by: §1, §1, §1, §2.0.1, §2, §2.
•  A. Kumar, S. Zilberstein, and M. Toussaint (2015) Probabilistic inference techniques for scalable multiagent decision making. Journal of Artificial Intelligence Research 53, pp. 223–270. Cited by: §3.
•  B. Lacerda, F. Faruq, D. Parker, and N. Hawes (2019) Probabilistic planning with formal performance guarantees for mobile service robots. The International Journal of Robotics Research 38 (9), pp. 1098–1123. Cited by: §5.
•  B. Millidge, A. Tschantz, A. K. Seth, and C. L. Buckley (2020) On the relationship between active inference and control as inference. In International Workshop on Active Inference, pp. 3–11. Cited by: footnote 3.
•  K. P. Murphy (2012) Machine learning: a probabilistic perspective. MIT press. Cited by: footnote 2.
•  J. L. Nazareth and R. B. Kulkarni (1986) Linear programming formulations of markov decision processes. Operations research letters 5 (1), pp. 13–16. Cited by: footnote 1.
•  M. Painter, B. Lacerda, and N. Hawes (2020) Convex hull monte-carlo tree-search. In Proceedings of the International Conference on Automated Planning and Scheduling, Vol. 30, pp. 217–225. Cited by: §5.
•  C. Pezzato, C. Hernandez, and M. Wisse (2020) Active inference and behavior trees for reactive action planning and execution in robotics. arXiv preprint arXiv:2011.09756. Cited by: §5, §5.
•  D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. (2017) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815. Cited by: §5.
•  R. S. Sutton, A. G. Barto, et al. (1998) Introduction to reinforcement learning. Vol. 135, MIT press Cambridge. Cited by: §1.
•  R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: footnote 1.
•  P. S. Thomas and E. Brunskill (2017) Policy gradient methods for reinforcement learning with function approximation and action-dependent baselines. arXiv preprint arXiv:1706.06643. Cited by: footnote 1.
•  M. Tomy, B. Lacerda, N. Hawes, and J. L. Wyatt (2020) Battery charge scheduling in long-life autonomous mobile robots via multi-objective decision making under uncertainty. Robotics and Autonomous Systems 133, pp. 103629. Cited by: §5.
•  M. Toussaint, L. Charlin, and P. Poupart (2008) Hierarchical pomdp controller optimization by likelihood maximization.. In UAI, Vol. 24, pp. 562–570. Cited by: §3.
•  M. Toussaint, S. Harmeling, and A. Storkey (2006) Probabilistic inference for solving (po) mdps. University of Edinburgh, School of Informatics Research Report EDI-INF-RR-0934. Cited by: §3.
•  M. Toussaint and A. Storkey (2006) Probabilistic inference for solving discrete and continuous state markov decision processes. In Proceedings of the 23rd international conference on Machine learning, pp. 945–952. Cited by: §3, §4.1.
•  D. Verma and R. P. Rao (2006) Goal-based imitation as probabilistic inference over graphical models. In Advances in neural information processing systems, pp. 1393–1400. Cited by: footnote 3.
•  S. W. Yoon, A. Fern, and R. Givan (2007) FF-replan: a baseline for probabilistic planning.. In ICAPS, Vol. 7, pp. 352–359. Cited by: §4.3, §4.3.

## Appendix 0.A Appendix: Illustrations Figure 1: An illustration of a 4×4 grid world (right). The initial state is blue and goal state is green. An illustration for a policy (left) and a plan (middle). Figure 2: Annotated world states (left) and a posterior over a temporal state (right).