Markov decision processes (MDPs) model sequential decision-making in stochastic systems with nondeterministic choices. A policy, i.e., a decision strategy, resolves the nondeterminism in an MDP and induces a stochastic process. In this regard, an MDP represents a (infinite) family of stochastic processes. In this paper, for a given MDP, we aim to synthesize a policy that induces a process with maximum entropy among the ones whose paths satisfy a temporal logic specification.
Entropy, as an information-theoretic quantity, measures the unpredictability of outcomes in a random variable. Considering a stochastic process as an infinite sequence of (dependent) random variables, we define the entropy of a stochastic process as the joint entropy of these random variables by following ,. Therefore, intuitively, our objective is to obtain a process whose paths satisfy a temporal logic specification in the most unpredictable way to an observer.
Typically, in an MDP, a decision-maker is interested in satisfying certain properties  or accomplishing a task . Linear temporal logic (LTL) is a formal specification language  that has been widely used to check the reliability of software , describe tasks for autonomous robots [8, 9] and verify the correctness of communication protocols . For example, in a robot navigation scenario, it allows to specify tasks such as safety (never visit the region A), liveness (eventually visit the region A) and priority (first visit the region A, then B).
The entropy of paths of a (Markovian) stochastic process is introduced in  and quantifies the randomness of realizations with fixed initial and final states. We first extend the definition for the entropy of paths to realizations that reach a certain set of states, rather than a fixed final state. Then, we show that the entropy of a stochastic process is equal to the entropy of paths of the process, if the process has a finite entropy. The established relation provides a mathematical basis to the intuitive idea that maximizing the entropy of an MDP minimizes the predictability of paths.
We observe that the maximum entropy of an MDP under stationary policies may not exist, i.e., for any given level of entropy, using stationary policies, one can induce a process whose entropy is greater than that level. In this case, we say that the maximum entropy of the MDP is unbounded. Additionally, if there exists a process with the maximum entropy, the entropy of such a process can be finite or infinite. Hence, before attempting to synthesize a policy that maximizes the entropy of an MDP, we first verify whether there exists a policy that attains the maximum entropy.
The contributions of this paper are fourfold. First, we provide necessary and sufficient conditions on the structure of the MDP under which the maximum entropy of the MDP is finite, infinite or unbounded. We also present a polynomial-time algorithm to check whether the maximum entropy of an MDP is finite, infinite or unbounded. Second, we present a polynomial-time algorithm based on a convex optimization problem to synthesize a policy that maximizes the entropy of an MDP. Third, we show that maximizing the entropy of an MDP with non-infinite maximum entropy is equivalent to maximizing the entropy of paths of the MDP. Lastly, we provide a procedure to obtain a policy that maximizes the entropy of an MDP subject to a general LTL specification.
The applications of this theoretical framework range from motion planning and stochastic traffic assignments to software security. In a motion planning scenario, for security purposes, an autonomous robot might need to randomize its paths while carrying out a mission [12, 13]. In such a scenario, a policy synthesized by the proposed methods both provides probabilistic guarantees on the completion of the mission and minimizes the predictability of the robot’s paths through the use of online randomization mechanisms. Additionally, such a policy allows the robot to explore different parts of the environment , and behave robustly against uncertainties in the environment . The proposed methods can also be used to distribute traffic assignments over a network, which is known as stochastic traffic assignments , as it promotes the use of different paths. Finally, as it is shown in , the maximum information that an adversary can leak from a (deterministic) software, which is modeled as an MDP, can be quantified by computing the maximum entropy of the MDP.
Related Work. A preliminary version  of this paper considered entropy maximization problem for MDPs subject to expected reward constraints. This considerably extended version includes an additional section establishing the relation between the maximum entropy of an MDP and the entropy of paths of the MDP, detailed proofs for all theoretical results, and additional numerical examples.
The computation of the maximum entropy of an MDP is first considered in , where the authors present a robust optimization problem to compute the maximum entropy for an MDP with finite maximum entropy. However, their approach does not allow to incorporate additional constraints due to the formulation of the problem. References  and  compute the maximum entropy of an MDP for special cases without providing a general algorithm.
The work 
provides the necessary and sufficient conditions for an interval Markov chain (MC) to have a finite maximum entropy. Therefore, some of the results provided in this paper, e.g., the necessary and sufficient conditions for an MDP to have finite, unbounded or infinite maximum entropy, can be seen as an extension of the results given in.
In [19, 20], the authors study the problem of synthesizing a transition matrix with maximum entropy for an irreducible MC subject to graph constraints. The problem studied in this paper is considerably different from that problem since MDPs represent a more general model than MCs, and an MC induced from an MDP by a policy is not necessarily irreducible.
In , the authors maximize the entropy of a policy while keeping the expected total reward above a threshold. They claim that the entropy maximization problem is not convex. Their formulation is a special case of the convex optimization problem that we provide in this paper. Therefore, here, we also prove the convexity of their formulation.
The entropy of paths of absorbing MCs is discussed in , , . The reference  establishes the equivalence between the entropy of paths and the entropy of an absorbing MC. We establish this relation for a general MC and show the connections to the maximum entropy of an MDP.
We also note that none of the above work discusses the unbounded and infinite maximum entropy for an MDP or considers LTL to specify desired system properties.
Organization. We provide the preliminary definitions and formal problem statement in Sections II and III, respectively. We analyze the properties of the maximum entropy of an MDP and present an algorithm to synthesize a policy that maximizes the entropy of an MDP in Section IV. The relation between the maximum entropy of an MDP and the entropy of paths is established in Section V. We present a procedure to synthesize a policy that maximizes the entropy of an MDP subject to an LTL specification in Section VI. We provide numerical examples in Section VII and conclude with suggestions for future work in Section VIII. Proofs for all results are provided in Appendix A, and a procedure to synthesize a policy that maximizes the entropy of an MDP with infinite maximum entropy is presented in Appendix B.
Notation: For a set , we denote its power set and cardinality by and , respectively. For a matrix , we use and to denote the k-th power of and the -th component of the k-th power of , respectively. All logarithms are to the base 2 and the set denotes .
Ii-a Markov chains and Markov decision processes
A Markov decision process (MDP) is a tuple where is a finite set of states, is the initial state, is a finite set of actions, is a transition function such that for all and , is a set of atomic propositions, and is a function that labels each state with a subset of atomic propositions.
We denote the transition probabilityby , and all available actions in a state by . The set of successor states for a state action pair is defined as . The size of an MDP is the number of triples such that .
A Markov chain (MC) is an MDP such that . We denote the transition function (matrix) for an MC by , and the set of successor states for a state by . The expected residence time in a state for an MC is defined as
The expected residence time represents the expected number of visits to state starting from the initial state . A state is recurrent for an MC if and only if , and is transient otherwise; it is stochastic if and only if it satisfies , and is deterministic otherwise; and it is reachable if and only if , and is unreachable otherwise.
A policy for an MDP is a sequence where each is a function such that for all . A stationary policy is a policy of the form . For an MDP , we denote the set of all policies and all stationary policies by and , respectively.
We denote the probability of choosing an action in a state under a stationary policy by . For an MDP , a stationary policy induces an MC denoted by . We refer to as induced MC and specify the transition matrix for by , whose -th component is given by
Throughout the paper, we assume that for a given MDP , for any state there exists an induced MC for which the state is reachable. This is a standard assumption for MDPs , which ensures that each state in the MDP is reachable under some policy.
An infinite sequence of states generated in under a policy is called a path, starting from the initial state and satisfies for all . Any finite prefix of that ends in a state is a finite path fragment. We define the set of all paths and finite path fragments in under the policy by and , respectively.
We use the standard probability measure over the outcome set . For a path , let the sequence be the finite path fragment of length , and let denote the set of all paths in starting with the prefix . The probability measure defined on the smallest -algebra over that contains for all is the unique measure that satisfies
Ii-B The entropy of stochastic processes
For a (discrete) random variable, its support defines a countable sample space from which takes a value according to a probability mass function (pmf) . The entropy of a random variable with countable support and pmf is defined as
We use the convention that . Let be a pair of random variables with the joint pmf and the support . The joint entropy of is
and the conditional entropy of given is
The definitions of the joint and conditional entropies extend to collection of random variables as it is shown in . A discrete stochastic process is a discrete time-indexed sequence of random variables, i.e., .
(Entropy of a stochastic process)  The entropy of a stochastic process is defined as
Note that this definition is different from the entropy rate of a stochastic process, which is defined as when the limit exists . The limit in (7) either converges to a non-negative real number or diverges to positive infinity .
An MC is equipped with a discrete stochastic process where each is a random variable over the state space . For a given k-dimensional pmf , this process respects the Markov property, i.e., for all . Then, the entropy of a Markov chain is given by
For an MDP , a policy induces a discrete stochastic process . We denote the entropy of an MDP under a policy by . Using the next proposition, we restrict our attention to stationary policies for maximizing the entropy of an MDP.
The following equality holds:
If the supremum in (9) is infinite, the set of stationary policies may not be sufficient to attain the supremum while a non-stationary policy can attain it. In particular, there exists a family of distributions that are defined over a countable support and have infinite entropy (see equation (7) in  ). It can be shown that for some MDPs, there exists a non-stationary policy that induces a stochastic process with such a probability distribution, and hence, have infinite entropy, while stationary policies can only induce stochastic processes with finite entropies
). It can be shown that for some MDPs, there exists a non-stationary policy that induces a stochastic process with such a probability distribution, and hence, have infinite entropy, while stationary policies can only induce stochastic processes with finite entropies111 A preliminary version  of this paper relied on Proposition 36 from . This proposition is not valid in general. Here, we provide the corrected results by defining the maximum entropy of an MDP over stationary policies..
(Maximum entropy of an MDP) The maximum entropy of an MDP is
A policy maximizes the entropy of an MDP if . Finally, we define the properties of the maximum entropy of an MDP as follows.
(The properties of the maximum entropy) The maximum entropy of an MDP is
finite, if and only if
infinite, if and only if
unbounded, if and only if the following two conditions hold.
Although it is not defined here, there is a fourth possible property which is unachievable finite maximum entropy, i.e., . In Theorem 1, we show that it is not possible for the maximum entropy of an MDP to have this property.
Ii-C Linear temporal logic
We employ linear temporal logic (LTL) to specify tasks and refer the reader to  for the syntax and semantics of LTL.
An LTL formula is built up from a set of atomic propositions, logical connectives such as conjunction () and negation (), and temporal modal operators such as always () and eventually (). An infinite sequence of subsets of defines an infinite word, and an LTL formula is interpreted over infinite words on . We denote by that a word satisfies an LTL formula .
A deterministic Rabin automaton (DRA) is a tuple where is a finite set of states, is the initial state, is the alphabet, is the transition relation, and is the set of accepting state pairs.
A run of a DRA , denoted by , is an infinite sequence of states in such that for each , for some . A run is accepting if there exists a pair and an such that (i) for all we have , and (ii) there exists infinitely many such that .
For any LTL formula built up from , a DRA can be constructed with input alphabet that accepts all and only words over that satisfy .
For an MDP under a policy , a path generates a word where for all . With a slight abuse of notation, we use to denote the word generated by . For an LTL formula , the set is measurable . We define
as the probability of satisfying the LTL formula for an MDP under the policy .
Iii Problem Statement
The first problem we study concerns the synthesis of a policy that maximizes the entropy of an MDP.
(Entropy Maximization) For a given MDP , provide an algorithm to verify whether there exists a policy such that . If such a policy exists, provide an algorithm to synthesize it. If it does not exist, provide a procedure to synthesize a policy such that for a given constant .
For an MDP , the synthesis of a policy such that allows one to induce a stochastic process with the desired level of entropy, even if there exists no stationary policy that maximizes the entropy of .
In the second problem, we introduce linear temporal logic (LTL) specifications to the framework. In particular, we consider the problem of synthesizing a policy that induces a stochastic process with maximum entropy whose paths satisfy a given LTL formula with desired probability. The formal statement of the second problem is deferred to Section VI since it requires the introduction of additional notations.
Iv Entropy maximization for MDPs
In this section, we focus on the entropy maximization problem. We refer to a policy as an optimal policy for an MDP if it maximizes the entropy of the MDP.
Iv-a The entropy of MCs versus MDPs
For an MC, the local entropy of a state is defined as
The following proposition characterizes the relationship between the local entropy of states and the entropy of an MC.
(Theorem 1 in ) For an MC ,
An MC has a finite entropy if and only if all of its recurrent states have zero local entropy . That is, if and only if for all states , implies . If the entropy of an MC is finite, each recurrent state has a single successor state, i.e., . Consequently, recurrent states have no contribution to the sum in (8). In this case, we take the sum in (16) only over the transient states.
For an MDP, different policies may induce stochastic processes with different entropies. For example, consider the MDP given in Fig. 0(a) and suppose that the action at state is taken with probability . If we let range over , then the entropy of the resulting stochastic processes ranges over . The optimal policy for this MDP is , which uniformly randomizes actions.
Unlike the MDP given in Fig. 0(a), the maximum entropy of an MDP is not generally achieved by a policy that chooses available actions at each state uniformly. For example, consider the MDP given in Fig. 0(b). The optimal policy for this MDP is , .
Examples given in Fig. 1 show that finding an optimal policy for an MDP may not be trivial. To analyze the maximum entropy of an MDP, we first obtain a compact representation of the maximum entropy as follows. For an MC induced from an MDP by a policy , let the expected residence time in a state be
Additionally, let the local entropy of a state in be . Then, the maximum entropy of can be written as
Note that the right hand side of (18) can still be infinite or unbounded. We analyze the properties of the maximum entropy of MDPs in the next section.
Iv-B Properties of the maximum entropy of MDPs
The maximum entropy of an MDP can be infinite or unbounded even for simple cases. For example, consider MDPs given in Fig. 2. For the MDP shown in Fig. 1(a), let the action be taken with probability in state . Then, the expected residence time in state is equal to , and the entropy of the induced MC is given by
which satisfies as . Note also that if , the entropy of the induced MC is zero due to (16). Hence, the maximum entropy is unbounded, and there is no optimal stationary policy for this MDP.
For the MDP given in Fig. 1(b), choosing a policy such that for , yields and , . Then, the maximum entropy of this MDP is infinite, and the maximum can be attained by any randomized policy.
Examples in Fig. 2 show that we should first verify the existence of optimal policies before attempting to synthesize them. We need the following definitions about the structure of MDPs to state the conditions that cause an MDP to have finite, infinite or unbounded maximum entropy.
A directed graph (digraph) is a tuple where is a set of vertices and
is a set of ordered pairs of vertices. For a digraph , a path from vertex to is a sequence of vertices such that for all . A digraph is strongly connected if for every pair of vertices , there is a path from to , and to .
A sub-MDP of an MDP is a pair where and is a function such that (i) is non-empty for all , and (ii) and imply that . An end component is a sub-MDP such that the digraph induced by is strongly connected.
A maximal end component (MEC) in an MDP is an end component such that there is no end component with , and and for all .
A MEC in an MDP is bottom strongly connected (BSC) if for all , . For a given state , we define the set of all actions under which the MDP can leave the MEC as . Note that in a BSC MEC , for all .
For an MDP with MECs , let and . Then, there exists an induced MC for which a state is both stochastic and recurrent if and only if .
For an MDP with MECs , let and . Then, the following statements hold.
(i) is infinite if and only if there exists an induced MC for which a state is both stochastic and recurrent.
(ii) is unbounded if and only if for all , and there exists a MEC that is not bottom strongly connected.
(iii) is finite if and only if it is not infinite and not unbounded.
Proofs for above results can be found in Appendix A. Informally, Theorem 1 states that for an MDP to have finite maximum entropy, all recurrent states of all MCs that are induced from the MDP by a stationary policy should be deterministic. Although necessary conditions for the finiteness of the maximum entropy is quite restrictive, there are some special cases, such as stochastic shortest path (SSP) problems , where MDP structures actually satisfy the necessary conditions. Specifically, since all proper policies in SSP problems are guaranteed to reach an absorbing target state within finite time steps with probability 1, the problem of synthesizing a proper policy with maximum entropy has a finite solution.
If , then we have
We present Algorithm 1 which, for an MDP , verifies whether is finite, infinite or unbounded by checking the necessary conditions in Theorem 1. For , its MECs can be found in time , can be found in time, and the necessary conditions can be verified in time since no state can belong to more than one MEC. Hence, Algorithm 1 runs in polynomial-time in the size of .
Iv-C Policy synthesis
We now provide algorithms to synthesize policies that solve the entropy maximization problem.
Iv-C1 Finite maximum entropy
We first modify a given MDP by making all states in its MECs absorbing.
Let be an MDP such that , be MECs in , , and be the modified MDP that is obtained from by making all states absorbing, i.e., if , then for all in . Then, we have .
There is a one-to-one correspondence between the paths of and since all states in the set must have a single successor state in an MDP with finite maximum entropy due to Theorem 1. Moreover, for a given policy on , the policy induced by on is the same policy with , i.e. . Therefore, we synthesize an optimal policy for by synthesizing an optimal policy for .
The constraints (21)-(21c) represent the balance between the “inflow” to and “outflow” from states. The constraints (21d) and (21e) are used to simplify the notation and define the variables and , respectively. The constraints (21f) and (21g) ensure that the expected residence time in the state-action pair and the probability of reaching the state is non-negative, respectively. We refer the reader to ,  for further details about the constraints.
The above result indicates that a global maximum for the problem in (21a)-(21g) can be computed efficiently. We now introduce Algorithm 2 to synthesize an optimal policy for a given MDP with finite maximum entropy.
Let be an MDP such that , be MECs in , and . For the input (, Algorithm 2 returns an optimal policy for , i.e. .
Proofs for above results can be found in Appendix A. Computationally, the most expensive step of Algorithm 2 is to solve the convex optimization problem (21a)-(21g). A solution whose objective value is arbitrarily close to the optimal value of (21a) can be computed in time polynomial in the size of via interior-point methods , . Hence, the time complexity of Algorithm 2 is polynomial in the size of .
Iv-C2 Unbounded maximum entropy
There is no optimal policy for this case due to (13)-(14). Therefore, for a given MDP and a constant , we synthesize a policy such that . Let be the union of all states in BSC MECs of , which can be found by using Algorithm 1. We modify the MDP by making all states absorbing and denote the modified MDP by . It can be shown that by using arguments similar to the ones used in the proof of Proposition 3. As the first approach, we solve a convex feasibility problem. Specifically, we remove the objective in (21a) and add the constraint
Recall from Theorem 1 that the unboundedness of the maximum entropy is caused by the existence of non-BSC MECs in . In particular, we can induce MCs with arbitrarily large entropy by making the expected residence time in states contained in non-BSC MECs arbitrarily large. As the second approach, we bound the expected residence time in states in and relax this bound according to the desired level of entropy. Specifically, we add the constraint
to the problem in (21a)-(21g). For the constraint (23), is a predefined value and limits the expected residence time in states . Let denote the maximum entropy of subject to the constraint (23). Then, we have
for , and for . Therefore, by choosing an arbitrarily large value, we can synthesize a policy that induces an MC with arbitrarily large entropy.
Iv-C3 Infinite maximum entropy
V Relating the maximum entropy of an MDP with the probability distribution of paths
In this section, we establish a link between the maximum entropy of an MDP and the entropy of paths in an MC induced from by a stationary policy .
We begin with an example demonstrating the probability distribution of paths in an MC induced by a policy that maximizes the entropy of an MDP. Consider the MDP shown in Fig. 2(a) which is used in . The policy that maximizes the entropy of the MDP is given by , , . The MC induced by this policy is shown in Fig. 2(b). There are three paths that reach the MECs, i.e., and , of the MDP, each of which is followed with probability in the induced MC, i.e., the probability distribution of paths is uniform.
Note that for the example given in Fig. 2(a), the optimal policy that maximizes the entropy of the MDP is randomized, and action-selection at each state is performed in an online manner. In particular, an agent that follows the optimal policy chooses its action at each stage according to the outcomes of an online randomization mechanism. Therefore, it does not commit to follow a specific path at any state.
To rigorously establish the relation, illustrated in Fig. 2(a), between the maximum entropy of an MDP and the entropy of paths in an induced MC, we need the following definitions.
A strongly connected component (SCC) in an MC induced by a policy is a maximal set of states in such that for any ,, for some . A bottom strongly connected component (BSCC) in is an SCC such that for all , for all and for all .
In this section, for an induced MC