Given the importance of MDPs, it is hardly surprising that they have attracted significant interest in the theory community. Past research on MDPs included the study of complexity issues  as well as the design and analysis of algorithms for solving MDPs [22, 24, 38, 39]. In this paper, we provide a fresh look on one of the most familiar algorithms for MDPs: value iteration (VI). Introduced by Bellman in the 1950s , VI makes use of the optimality principle: the maximal -step reward achievable from a state , which we denote by , satisfies the recurrence
with . Consequently, a finite-horizon policy is optimal if and only if it chooses, in a situation when the current state is and steps are remaining, an action maximizing the right-hand side (RHS) of (1). Thus, to solve an MDP with a finite horizon , the VI algorithm computes the values for all and all states , by iterating the recurrence (1). Using these values, VI then outputs (using some tie-breaking rule) some policy satisfying the aforementioned optimality characterization. VI can be deployed also for infinite-horizon MDPs: one can effectively compute a horizon such that action is optimal in state for an infinite horizon111In infinite-horizon MDPs, there is always an optimal stationary policy, which makes decisions based only on the current state.  if it maximizes the RHS of (1) for . This has a bit-size which is polynomial in the size of the original MDP, but the magnitude of can be exponential in the size of the MDP if the discount factor is given in binary .
VI is one of the most popular MDP-solving algorithms due to its versatility (as shown above, it can be used for several MDP-related problems) and conceptual simplicity, which makes it easy to implement within different programming paradigms [30, 37], including implementation via neural nets . Several variants of VI with improved performance were developed [36, 14]. For instance, the recent paper by Sidford et al.  presented a new class of randomized VI techniques with the best theoretical runtime bounds (for certain values of parameters) among all known MDP solvers. The paper also expresses hope that their techniques “will be useful in the development of even faster MDP algorithms.” To get insight into the underlying structure of VI, which might enable or limit further such accelerations, we take a complexity-theoretic vantage point and study the theoretical complexity of computing an outcome of a VI execution. That is, we consider the following decision problem ValIt: given an MDP with a finite horizon (encoded as a binary number), does a given action maximize the RHS of (1) for ? This problem is inspired by the paper of Fearnley and Savani , where they show -hardness (and thus also completeness) for the problem of determining an outcome of policy iteration, another well-known algorithm for MDP solving. To the best of our knowledge, VI has not yet been explicitly subjected to this type of analysis. However, questions about the complexity of ValIt were implicitly raised by previous work on the complexity of finite-horizon MDPs, as discussed in the next paragraph.
The complexity of finite-horizon MDPs is a long-standing open problem. Since “finding an optimal policy” is a function problem, we can instead consider the decision variant: “In a given finite-horizon MDP, is it optimal to use a given action in the first step?” As discussed above, this is exactly the ValIt problem in disguise.
In the seminal 1987 paper on the complexity of MDPs , Papadimitriou and Tsitsiklis showed ¶-completeness of a special case of finite-horizon optimization where the horizon has magnitude polynomial in the size of the MDP. At the same time, they noted that in the general case of binary-encoded
, VI can be executed on an -bounded Turing machine (sinceis represented using bits, the number of iterations is exponential in the size of the input). Hence ValIt is in . However, the exact complexity of the general finite-horizon optimization remained open ever since, with the best lower bound being the ¶-hardness inherited from the “polynomial ” sub-problem. Tseng  presented a more efficient (though still exponential) algorithm for finite-horizon MDPs satisfying a certain stability condition; in the same paper, he comments that “in view of the stability assumptions needed to obtain an exact solution and the absence of negative results, we are still far from a complete complexity theory for this problem.”
In this paper, we address this issue, provide the missing negative results, and provide tight bounds on the computational complexity of ValIt and finite-horizon MDP optimization.
The main result of the paper is that ValIt is -complete (Section 2.1). In the rest of this section, we first explain some challenges we needed to overcome to obtain the result. Then we sketch our main techniques and conclude with discussing the significance of our results, which extends beyond MDPs to several areas of independent interest.
Bitsize of numbers.
One might be tempted to believe that ValIt is in , since the algorithm needs to store only polynomially many values at a time. However, the bitsize of these values may become exponentially large during the computation (e.g., the quantity may halve in every step). Hence, the algorithm cannot be directly implemented by a polynomial-space Turing machine (TM). One could try to adapt the method of Allender et al. [20, 2] based on an intricate use of the Chinese remainder representation (CRR) of integers. However, there is no known way of computing the operation directly and efficiently on numbers in CRR.
Complex optimal policies.
Another hope for membership would be a possibly special structure of optimal policies. Fixing any concrete policy turns an MDP into a Markov chain, whose-step behavior can be evaluated in polynomial space (using, e.g., the aforementioned CRR technique of Allender et al.). If we could prove that (A) an optimal policy can be represented in polynomial space and (B) that the Markov chain induced by such a policy is polynomially large in the size of the MDP, we would get the following algorithm: cycle through all policies that satisfy (A) and (B), evaluate each of them, and keep track of the best one found so far. Tseng  commented that optimal policies in finite-horizon MDPs are “poorly understood”. Hence, there was still hope that optimal Markovian deterministic policies may have a shape that satisfies both (A) and (B). Unless = , our results put such hopes to rest.
No hardness by succinctness.
One might try to prove -hardness using a succinctness argument. The results of  show that ValIt is -hard when the horizon is written in unary, and many optimization problems over discrete structures incur an exponential blow-up in complexity when the discrete structure is encoded succinctly, e.g., by a circuit . Giving a horizon in binary amounts to a succinct encoding of an exponentially large MDP obtained by “unfolding” the original MDP into a DAG-like MDP of depth . This unfolded MDP is “narrow” in the sense that it consists of many polynomial-sized layers, while standard -hardness-by-succinctness proofs, use succinct structures of an exponential “width” and “depth”, accommodating the tape contents of an -bounded TM. Hence, straightforward succinctness proofs do not apply here; e.g., there does not seem to be a direct reduction from the succinct circuit value problem.
To obtain -hardness of ValIt, we proceed by a sequence of non-trivial reductions. Below we outline these reductions in the order in which they appear in the sequence, see Figure 1. In the main text, we present the reductions in a different order (indicated by the numbering of propositions and theorems), so that we start with MDPs and gradually introduce more technical notions.
We start from a canonical -complete problem: the halting problem for an exponential-time TM. We then present a reduction to a halting problem for a class of counter programs (CPs; simple imperative programs with integer variables) that allow for linear variable updates. In this way, we encode the tape contents into numerical values (4). The crucial feature of this reduction is that the produced CP possesses a special simplicity property, which imposes certain restrictions on the use of tests during the computation.
Next, we introduce straight-line programs (SLPs) with , , and operations. SLPs are a standard model of arithmetic computation  and they can be equivalently viewed as arithmetic circuits consisting (in our case) of , , and gates. We also consider a sub-class of SLPs with only operations, so called monotone SLPs. We define the following powering problem: given a function represented as an SLP, a horizon , an initial argument , and two indices , is it true that the -component of , i.e. the image of with respect to the -fold composition of , is greater than the -component of ? Although VI in MDPs does not necessarily involve integers, the powering problem for monotone SLPs captures the complexity inherent in iterating the recurrence (1). To obtain a reduction from CPs to SLP powering, we construct SLP gadgets with , and (minus) operations to simulate the tests in CPs; the simplicity of the input CP is crucial for this reduction to work (Section 4.1). To get rid of the minus operation, we adapt a technique by Allender et al. , which introduces a new “offset” counter and models subtraction by increasing the value of the offset (Section 3.2).
A final step is to show a reduction from monotone SLP powering to ValIt. The reduction proceeds via an intermediate problem of synchronizing reachability in MDPs
(maximize the probability of being in a target set of statesafter exactly steps ). This divides a rather technical reduction into more comprehensible parts. We present novel reductions from monotone SLP powering to synchronizing reachability (Section 3.2), and from the latter problem to ValIt (Section 2.2). As a by-product, we present a reduction proving -hardness of finite-horizon reachability in MDPs, arguably the conceptually simplest objective in probabilistic decision-making (Section 2.2).
As our main result, we characterize the complexity of computing an outcome of VI, one of the fundamental algorithms for solving both finite- and infinite-horizon MDPs. As a consequence, we resolve a long-standing complexity issue  of solving finite-horizon MDPs.
On our way to proving this result, we encounter non-trivial stepping stones which are of an independent interest. First, we shed light on the complexity of succinctly represented arithmetic circuits, showing that comparing two output wires of a given -circuit incurs an exponential blow-up in complexity already when employing a very rudimental form of succinctness: composing a single -circuit with copies of itself, yielding a circuit of exponential “height” but only polynomial “width.” Second, we obtain new hardness results for the bounded reachability problem in linear-update counter programs. CPs are related to several classical abstractions of computational machines, such as Minsky machines and Petri nets , see  for a recent breakthrough in this area. Our work establishes a novel connection between counter programs and MDPs.
Further Related Work
Our work is also related to a series of papers on finite-horizon planning [21, 17, 18, 23]. The survey paper  provides a comprehensive overview of these results. These papers consider either MDPs with a polynomially large horizon, or succinctly represented MDPs of possibly exponential “width” (the succinctness was achieved by circuit-encoding). The aforementioned hardness-by-succinctness proofs are often used here. The arbitrary horizon problem for standard MDPs, which we study, is left open in these papers, and our work employs substantially different techniques. The complexity of finite-horizon decentralized MDPs was studied in .
2 Markov Decision Processes and Finite-Horizon Problems
We start with some preliminaries. A probability distribution over a finite set is a function such that . We denote by
the set of all (rational) probability distributions over. The Dirac distribution on assigns probability 1 to .
A Markov decision process (MDP) consists of a finite set of states, a finite set of actions, a transition function , a reward function , and a discount factor . The transition function assigns to each state and action a distribution over the successor states, while the reward function assigns to and a rational reward.
A path is an alternating sequence of visited states and played actions in (that starts and ends in a state); write for the length of . We may use to denote that path goes from to . We extend the reward function from single state-action pairs to paths by .
A policy for the controller is a function that assigns to each path a distribution over actions. Let denote the probability of a path starting in when the controller follows the policy . This probability is defined inductively by setting if , and otherwise. For a path , we set
We omit the subscripts from if they are clear from the context. Additionally, we extend to sets of paths of the same length by summing the probabilities of all the paths in the set.
In this paper, we focus on a special class of policies: A (deterministic) Markov policy is a function . Intuitively, a controller following a Markov policy plays from if it is the -th visited state, irrespective of the other states in the path. Markov policies suffice for the problems we consider.
2.1 Finite-Horizon Problems
Given an MDP , the core problem of MDPs is computing the values of states with respect to the maximum expected reward. Let
denote the vector of-step maximum expected rewards obtainable from each state of the MDP. That is, for all we have that
Note that by this definition. The vector can be computed by value iteration, i.e. by iterating the recurrence stated in Equation (1). From that recurrence, for each and state , one can extract an (optimal) Markov policy that achieves the maximum value after steps: for each and for we have
Papadimitriou and Tsitsiklis posed the finite-horizon reward problem which asks to compute such an optimal policy for the controller . Formally, given an MDP , an initial state , a distinguished action , and a horizon encoded in binary, the finite-horizon reward problem asks whether there exists a policy achieving by choosing as the first action from . Note that this problem is equivalent to the ValIt problem defined in the introduction.
Consider the MDP depicted in Figure 2 with . By iterating the indicated recurrence, we have that . The value of is due to the second argument of (corresponding to action ), hence a policy to maximize starts with in .
The finite-horizon reward problem can be decided by value iteration in exponential time by unfolding recurrence (1) for steps , while the best known lower bound is ¶-hardness . Our main result closes this long-standing complexity gap:
The finite-horizon reward problem (and thus also the ValIt problem) is -complete.
To prove -completeness of the finite-horizon reward problem, we introduce a variant of reachability, which we call synchronized reachability . Let be a target state. For reachability, the objective is to maximize the probability of taking a path from to , whereas in synchronized reachability only a subset of such paths with the same length are considered.
Let be an MDP, an initial state, and an action. Define as the vector of maximum probabilities of taking a path to within steps. Similarly, define to be the vector of maximum probabilities of taking such a path with length exactly . Formally, for all we have that
Given a horizon , encoded in binary, the finite-horizon reachability problem asks whether an optimal policy achieving chooses action as the first action from ; the finite-horizon synchronized-reachability problem asks whether an optimal policy achieving chooses action as the first action from .
2.2 Connections Among Finite-Horizon Problems
We now prove the following theorem.
The finite-horizon synchronized-reachability problem reduces, in polynomial time, to the finite-horizon reward problem.
Consider an MDP , an initial state , an action and a target state . The following recurrence can be used to compute :
where and for all . We construct a new MDP obtained from by replacing all transitions by two consecutive transitions. The construction is such that the probability of going from to with a path of length in is equal to the probability of going from to with a path of length in . More formally, for all and with , the transition is replaced with if and with otherwise; where is an arbitrary chosen discount factor for , and the intermediate state in both cases is a new state. The MDP in Figure 2 is the result of applying the construction to in Figure 3 with .
For the constructed MDP , one can show that for all states , an action is optimal to maximize if and only if it is optimal to maximize . Consider the MDPs from Figure 2 as an example. We have previously argued that a policy maximizing in starts with action . Observe that the optimal first choice to maximize is also . This implies that an optimal policy of for synchronized-reachability with starts with , too. By the above argument, the finite-horizon synchronized-reachability problem reduces to the finite-horizon reward problem.
Hence, to obtain Section 2.1, it remains to determine the complexity of the finite-horizon synchronized-reachability problem. To this aim, we show a close connection between MDPs and a class of piecewise-affine functions represented by straight line programs (SLPs). Section 3 provides the details.
We also show the finite-horizon synchronized-reachability problem reduces to the finite-horizon reachability problem. We remark that the natural probability- variants of these problems have different complexities: specifically, the problem of reaching from within steps with probability is in ¶; however, the analogous problem of reaching from in exactly steps with probability is -complete .
The finite-horizon synchronized reachability problem reduces, in polynomial time, to the finite-horizon reachability problem.
3 Straight-Line Programs and The Powering Problem
We now establish the connection between MDPs and SLP powering. We start with preliminaries.
For all , define the set of variables and the collection of terms
A straight-line program (SLP) of order is a sequence of commands of the form , where and is non-empty. We refer to commands as initializations. Recall that .
For complexity analyses we shall assume that , for every command, is given explicitly as a list of terms. Each term is also assumed to be explicitly represented as a constant, a list of coefficients , and a list of indices , both lists having length (i.e. the number of variables). The size of , and also that of the command, corresponds to the length of its list of terms; the size of the SLP, the sum of the sizes of its commands.
A valuation is a vector in , where the -th coordinate gives the value of . The semantics of a command is a function , transforming a valuation into another. An SLP defines the function obtained by composing the constituent commands: . Clearly this is a piecewise-affine function. Given a function , we define its -th power as where
is the -fold composition of .
We denote by the set of terms where the coefficients are in . An SLP that only uses terms in is called monotone. Note that monotone SLPs induce monotone functions from to (subtraction and are not allowed).
3.1 The Powering Problem
For an SLP of order , a valuation and (encoded in binary), let . Given two variables of the SLP, the powering problem asks whether . Since the initial valuations are always non-negative, all valuations obtained by powering monotone SLPs are non-negative. The above problem is ¶-complete if the exponent is written in unary .
Observe that all numbers generated by powering an SLP can be represented using exponentially-many bits in the bitsize of the exponent. It follows that the powered SLP can be explicitly evaluated in exponential time. We provide a matching lower bound in Section 4. Before that, we show the connection of SLP powering to MDPs.
3.2 Synchronized Reachability and SLP Powering
The connection is stated in the following Theorem.
The powering problem for monotone SLPs reduces, in polynomial time, to the finite-horizon synchronized reachability problem in MDPs.
To illustrate this reduction, let us consider the SLP of order :
This SLP is normalized, that is to say all its commands have exactly two arguments and furthermore have exactly two summands. (Note that focusing on normalized SLPs is no loss of generality.) We are interested in the -nd power of with initial valuation and . In Figure 3, two copies of are shown on the right to visualize the concept of powering it. To obtain an MDP, we consider a set of actions and have each variable become a state. In the example, and are the corresponding states for and . The arguments of commands determine the successors of actions , respectively, where each successor has probability . The command translates to and , as shown in the MDP in Figure 3. Since , we make a target state. Now the -th iteration of value iteration of (2) (corresponding to the -th step before the horizon) is tightly connected to the -th power of the SLP. Indeed, letting , one can prove that and .
SLP vs. monotone SLP powering.
It thus remains to provide a lower bound for the Monotone SLP powering problem. The crucial step, which we cover in Section 4, is providing lower bounds for the non-monotone variant. The remaining step from non-monotone to monotone powering can be made by adapting the techniques of Allender et al. .
The powering problem for arbitrary SLPs reduces, in polynomial time, to the powering problem for monotone SLPs.
4 Main Reductions
To show -hardness of all the problems introduced so far, we introduce a class of counter programs that allow linear updates on counters and show that a (time-)bounded version of the termination problem for these programs is -complete. Finally, we reduce this bounded termination problem to the powering problem.
A deterministic linear-update counter program (CP) consists of counters , ranging over , and a sequence of instructions. We consider instructions of the form
where and , and the final instruction is always . More precisely, the instructions allow
adding or subtracting two counters, assigning the result to a third one, and continuing to the next instruction;
testing two counters against each other, and jumping to some given instruction if the result of the test is positive, continuing to the next instruction otherwise.
The halt instruction only loops to itself.
A configuration of a CP is a tuple consisting of an instruction and values of the counters (e.g., is the value for the counter ). We equip CPs with a fixed initial configuration lying in . Given a CP, the termination problem asks whether the halt instruction is reached. The bounded termination problem additionally takes as input an integer , encoded in binary, and asks whether the halt instruction is reached within steps.
The bounded termination problem is in : in a computation with steps, the magnitude of the counters is bounded by , so each step can be simulated in time exponential in the bitsize of . We will now show that the problem is -hard already for a certain subclass of CPs which facilitates the reductions to the powering problem.
Simple counter programs.
A CP is simple if it satisfies the following conditions. First, all values in all reachable configurations are non-negative: for all (one may “guard” subtractions by test instructions to achieve this). Second, all test instructions use counters and exclusively. Moreover, for each such instruction , there are counters such that in all reachable configurations we have that
and with . That is, the values of tested counters are “scaled-up” versions of the values of other counters.
Additionally, the absolute difference of the values of the tested counters is larger than the values of all other counters, in symbols
Note that the class of simple CPs is a semantically defined subclass of all CPs. Further observe that for every test instruction we necessarily have that .
The following proposition kick-starts our sequence of reductions. The bounded termination problem for simple CPs is -complete.
To prove the proposition, we follow the classical recipe of first simulating a Turing machine using a machine with two stacks, and then simulating the two-stack machine by a CP. We note two key differences between our construction and the classical reduction: () We use the expressiveness of linear updates in CPs to simulate pushing and popping on the stack in a linear number of steps of the CP. () We instrument the two-stack machine to ensure that the height of the two stacks differs by at most along any computation. This is crucial to allow us to simulate the two-stack machine by a simple linear-update counter program.
4.1 From the Termination Problem to the Powering Problem
We now sketch the main ideas behind the last (and most technically involved) missing link in our sequence of reductions.
The bounded termination problem for simple CPs reduces, in polynomial time, to the powering problem for SLPs.
Given a CP we construct an SLP of order with variables including . Let us denote by for . The reduction is such that a configuration of is encoded as a valuation of the SLP with the property that and for all . In this way, the instruction of the CP is encoded in the variables of the SLP (recall that SLPs are stateless).
Given this encoding, the main challenge is to realize the transition function of the CP as a function computed by an SLP. Once this is accomplished, for every , the -th power of the SLP represents the -step transition function of the CP.
Intuitively, to encode the transition function we would like to equip the SLP with conditional commands, whose execution depends on a conditional. Specifically, we want to implement the following two kinds of conditional updates
in terms of primitive commands of an SLP. In both commands, if the condition is not satisfied, the command is not executed, and the value of or remains unchanged. For example, one can simulate the first type of conditional commands by executing , where is an expression that is if the test is passed and less than otherwise. Intuitively, we think of as “masking” the assignment if the test fails.
For the following result, which formalizes how we implement conditional commands, we call a valuation valid if there exists with and for all .
Let and be distinct. The following equation holds for all valid valuations :
Moreover, if , then the following holds:
The equations follow directly from the assumption that is valid, since if then we also have . In addition, if and , we will have . ∎
Using the property that the simulated program is simple, Equation (3) can be used to simulate the conditional update where masks the update. Likewise, Equation (4) can be used to simulate the second type of conditional update where the masking expression is . Finally, the multiplication-by-a-constant required for the second type of the conditional update is achieved via repeated addition.
Encoding the instructions.
We recall that we encode being at the instruction of the CP by a valuation such that for all .
Using the aforementioned conditional commands, we can construct the SLP as the composition of smaller SLPs. Each sub-SLP simulates an instruction from the given CP . Hence , when applied upon a valid valuation (i.e., a properly-encoded configuration of ), simulates all of its instructions at once. By using conditional commands, we make sure that only one sub-SLP results in a non-zero update: executing has no effect on the valuation unless for all .
In this way, powering allows us to simulate consecutive steps of . In particular, for all we have that where is the instruction, holds if and only if halts after at most steps.
By the virtue of our chain of reductions (see Figure 1), we get the following theorem.
All the following problems are -complete:
The finite-horizon reward problem for MDPs, and thus also the ValIt problem.
The finite-horizon reachability and synchronized reachability problems for MDPs.
The powering problem for SLPs and for monotone SLPs.
The bounded termination problem for simple counter programs.
The exact complexity of the following variant of the problem remains open: given an MDP and a horizon encoded in binary, determine whether there exists a policy achieving some given expected-reward threshold (with no restriction on the actions used to do so).
Pieter Abbeel and Andrew Y. Ng.
Learning first-order markov models for control.In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1–8. MIT Press, 2005. URL: http://papers.nips.cc/paper/2569-learning-first-order-markov-models-for-control.pdf.
-  Eric Allender, Nikhil Balaji, and Samir Datta. Low-depth uniform threshold circuits and the bit-complexity of straight line programs. In Erzsébet Csuhaj-Varjú, Martin Dietzfelbinger, and Zoltán Ésik, editors, Mathematical Foundations of Computer Science 2014 - 39th International Symposium, MFCS 2014, Budapest, Hungary, August 25-29, 2014. Proceedings, Part II, volume 8635 of Lecture Notes in Computer Science, pages 13–24. Springer, 2014. URL: https://doi.org/10.1007/978-3-662-44465-8_2, doi:10.1007/978-3-662-44465-8_2.
-  Eric Allender, Peter Bürgisser, Johan Kjeldgaard-Pedersen, and Peter Bro Miltersen. On the complexity of numerical analysis. SIAM Journal on Computing, 38(5):1987–2006, 2009.
-  Eric Allender, Andreas Krebs, and Pierre McKenzie. Better complexity bounds for cost register automata. Theory of Computing Systems, pages 1–19, 2017.
-  Christel Baier and Katoen Joost-Pieter, editors. Principles of Model Checking. MIT Press, 2008.
-  Richard Bellman. Dynamic Programming. Princeton University Press, 1957.
-  Daniel S Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4):819–840, 2002.
-  Dimitri P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1987.
-  Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. Athena scientific Belmont, MA, 2005.
-  Vincent D Blondel and John N Tsitsiklis. A survey of computational complexity results in systems and control. Automatica, 36(9):1249–1274, 2000.
-  Nicole Bäuerle and Ulrich Rieder. Markov Decision Processes with Applications to Finance. Springer-Verlag Berlin Heidelberg, 2011.
-  Edmund M. Clarke, Thomas A. Henzinger, Helmut Veith, and Bloem Roderick, editors. Handbook of Model Checking. Springer International Publishing, 2018.
-  Wojciech Czerwinski, Slawomir Lasota, Ranko Lazic, Jérôme Leroux, and Filip Mazowiecki. The reachability problem for petri nets is not elementary (extended abstract). CoRR, abs/1809.07115, 2018. URL: http://arxiv.org/abs/1809.07115, arXiv:1809.07115.
-  Peng Dai, Mausam, Daniel S. Weld, and Judy Goldsmith. Topological value iteration algorithms. J. Artif. Intell. Res., 42:181–209, 2011. URL: http://jair.org/papers/paper3390.html.
-  Laurent Doyen, Thierry Massart, and Mahsa Shirmohammadi. Limit synchronization in Markov decision processes. In Anca Muscholl, editor, Foundations of Software Science and Computation Structures - 17th International Conference, FOSSACS 2014, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2014, Grenoble, France, April 5-13, 2014, Proceedings, volume 8412 of Lecture Notes in Computer Science, pages 58–72. Springer, 2014. URL: https://doi.org/10.1007/978-3-642-54830-7_4, doi:10.1007/978-3-642-54830-7_4.
-  John Fearnley and Rahul Savani. The complexity of the simplex method. In Proceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing, STOC ’15, pages 201–208, New York, NY, USA, 2015. ACM. URL: http://doi.acm.org/10.1145/2746539.2746558, doi:10.1145/2746539.2746558.
-  Judy Goldsmith, Michael L Littman, and Martin Mundhenk. The complexity of plan existence and evaluation in probabilistic domains. In Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence (UAI’97), pages 182–189. Morgan Kaufmann Publishers Inc., 1997.
-  Judy Goldsmith and Martin Mundhenk. Complexity issues in Markov decision processes. In Proceedings of the 13th Annual IEEE Conference on Computational Complexity, Buffalo, New York, USA, June 15-18, 1998, pages 272–280. IEEE Computer Society, 1998. URL: https://doi.org/10.1109/CCC.1998.694621, doi:10.1109/CCC.1998.694621.
-  R. Greenlaw, H.J. Hoover, and W.L. Ruzzo. Limits to Parallel Computation: P-completeness Theory. Oxford University Press, 1995. URL: https://books.google.fr/books?id=YZHnCwAAQBAJ.
-  William Hesse, Eric Allender, and David A. Mix Barrington. Uniform constant-depth threshold circuits for division and iterated multiplication. J. Comput. Syst. Sci., 65(4):695–716, 2002. URL: https://doi.org/10.1016/S0022-0000(02)00025-9, doi:10.1016/S0022-0000(02)00025-9.
-  Michael L Littman. Probabilistic propositional planning: Representations and complexity. In AAAI’97, pages 748–754, 1997.
-  Michael L. Littman, Thomas L. Dean, and Leslie Pack Kaelbling. On the complexity of solving Markov decision problems. In Philippe Besnard and Steve Hanks, editors, UAI ’95: Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence, Montreal, Quebec, Canada, August 18-20, 1995, pages 394–402. Morgan Kaufmann, 1995. URL: https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=457&proceeding_id=11.
-  Michael L Littman, Judy Goldsmith, and Martin Mundhenk. The computational complexity of probabilistic planning. Journal of Artificial Intelligence Research, 9:1–36, 1998.
-  Yishay Mansour and Satinder P. Singh. On the complexity of policy iteration. In Kathryn B. Laskey and Henri Prade, editors, UAI ’99: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, July 30 - August 1, 1999, pages 401–408. Morgan Kaufmann, 1999. URL: https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=192&proceeding_id=15.
-  Ernst W. Mayr. An algorithm for the general petri net reachability problem. In Proceedings of the thirteenth annual ACM Symposium on Theory of computing (STOC’81), pages 238–246. ACM, 1981.
-  Martin Mundhenk, Judy Goldsmith, Christopher Lusena, and Eric Allender. Complexity of finite-horizon Markov decision process problems. J. ACM, 47(4):681–720, July 2000. URL: http://doi.acm.org/10.1145/347476.347480, doi:10.1145/347476.347480.
-  Christos H. Papadimitriou and John N. Tsitsiklis. The complexity of Markov decision processes. Math. Oper. Res., 12(3):441–450, 1987. URL: https://doi.org/10.1287/moor.12.3.441, doi:10.1287/moor.12.3.441.
-  Christos H. Papadimitriou and Mihalis Yannakakis. A note on succinct representations of graphs. Information and Control, 71(3):181–185, 1986. URL: https://doi.org/10.1016/S0019-9958(86)80009-2, doi:10.1016/S0019-9958(86)80009-2.
-  Martin L. Puterman. Markov Decision Processes. Wiley-Interscience, 2005.
-  Tim Quatmann and Joost-Pieter Katoen. Sound value iteration. In Hana Chockler and Georg Weissenbacher, editors, Computer Aided Verification, pages 643–661. Springer International Publishing, 2018.
-  Manfred Schäl. Markov decision processes in finance and dynamic options. In Handbook of Markov Decision Processes, International Series in Operations Research & Management Science, pages 461–487. Springer, 2002.
-  Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance Reduced Value Iteration and Faster Algorithms for Solving Markov Decision Processes, pages 770–787. SIAM, 2018. URL: https://epubs.siam.org/doi/abs/10.1137/1.9781611975031.50, arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9781611975031.50, doi:10.1137/1.9781611975031.50.
-  Olivier Sigaud and Olivier Buffet. Markov Decision Processes in Artificial Intelligence. John Wiley & Sons, 2013.
-  R.S. Sutton and A.G Barto. Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning. MIT Press, 2018.
-  Aviv Tamar, YI WU, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2154–2162. Curran Associates, Inc., 2016. URL: http://papers.nips.cc/paper/6046-value-iteration-networks.pdf.
-  Paul Tseng. Solving H-horizon, stationary Markov decision problems in time proportional to log(H). Operations Research Letters, 9(5):287–297, 1990.
-  Zhimin Wu, Ernst Moritz Hahn, Akin Günay, Lijun Zhang, and Yang Liu. GPU-accelerated value iteration for the computation of reachability probabilities in mdps. In Gal A. Kaminka, Maria Fox, Paolo Bouquet, Eyke Hüllermeier, Virginia Dignum, Frank Dignum, and Frank van Harmelen, editors, ECAI 2016 - 22nd European Conference on Artificial Intelligence, pages 1726–1727. IOS Press, 2016. URL: https://doi.org/10.3233/978-1-61499-672-9-1726, doi:10.3233/978-1-61499-672-9-1726.
-  Yinyu Ye. A new complexity result on solving the Markov decision problem. Mathematics of Operations Research, 30(3):733–749, 2005.
-  Yinyu Ye. The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate. Mathematics of Operations Research, 36(4):593–603, 2011.
Appendix A Proof of Theorem 2.2
In this section, we will show that there is a polynomial-time many-one reduction from the finite-horizon synchronized-reachability problem to the finite-horizon reward problem — for any discount factor .
Proof of Theorem 2.2.
Consider an instance of the finite-horizon synchronized-reachability problem, i.e., an MDP , an initial state , a target vertex , and horizon . In polynomial time we can construct a new MDP with an initial state and horizon such that:
is such that for each and we have: and for all other arguments, returns zero;
For we construct (also in polynomial time), the reward function as follows: for each such that we have and . If , then and . This ensures that under any strategy in the -step (expected) value, which we denote by , is equal to
if is even, and otherwise it is equal to
where . Finally, we set .
Intuitively, is formed by subdividing each probabilistic transition into two transitions by using newly added “middle” states (those of the form ). Hence, there is a one-to-one correspondence between runs of some length in and finite paths of length in . The correspondence naturally extends to sets of runs and strategies (since there is no real choice in states of the form ). Moreover, under corresponding strategies, the probabilities of corresponding sets of runs are identical. Hence, for all and all there exists a strategy in such that in if and only if there is a strategy in such that in . But by the discussion in the previous paragraph, such a strategy exists if and only if there is in such that
It follows that an action is an optimal first action for the finite-horizon synchronized-reachability problem in (with horizon ) if and only if it is an optimal first action for the finite-horizon reward problem in (with horizon ). ∎
Appendix B Proof of Theorem 2.2
We show that there is a polynomial-time many-one reduction from the finite-horizon synchronized-reachability problem to the finite-horizon reachability problem.
Proof of Theorem 2.2.
Given an instance of the finite-horizon synchronized-reachability problem — where is the target state, is an initial state, and is a binary-encoded horizon — we compute an instance of the finite-horizon reachability problem where the MDP is a modification of . In particular, (standing for “good” and “bad”, respectively) are two fresh states, both sinks in . For all other states and all actions we define a modified transition function : with probability action leads to the new target , and with probability it does whatever it did in ; formally, we put and for all states in . Finally, we introduce a fresh action, , which leads to and , both with probability , when played from ; formally, we put and for all . Observe that from state and at time it is best to play , as . Strictly before time , any optimal strategy does not play — regardless of the current state — as playing two different consecutive actions instead leads to with probability at least ; rather, any optimal strategy maximizes the probability of reaching state at time (because then it is beneficial to play ). It follows that any optimal strategy (for synchronization) in corresponds naturally to an optimal strategy (for reachability) in , and vice versa. Specifically, the first action played in the two strategies is the same. This gives the reduction. ∎
Appendix C Proof of Theorem 3.2
We show how to remove all but one last subtraction — which is then subsumed by the comparison — in a Max-Plus-Minus SLP.
Proof of Theorem 3.2.
Let be a (general) SLP of order . Without loss of generality, we suppose consists only of (binary) addition, subtraction, and commands. To eliminate subtractions, we closely follow the proof of Theorem in Allender et al. . We construct a monotone SLP of order from by first introducing a new variable that will help us maintain the invariant that for all and all we have .
We proceed as follows: First we initialize . Whenever we encounter a command in , we replace this with the following sequence of commands in .
We leave the commands in the SLP unchanged in . Notice that the sequence of commands above are monotone (i.e. they include no subtraction commands), and the number of commands used in is at most times the number of commands used in plus an additional command to initialize . To complete the proof, it suffices to note that for every command in , the corresponding sequence of commands in satisfy the invariant that the value of every variable for all of , is obtained as the difference of variables and in . ∎
Appendix D Proof of Proposition 4
The proof is by a reduction from the termination problem of a -time bounded Turing machine. We suppose the tape alphabet of the Turing machine is with denoting the empty-cell symbol. For convenience, we also assume that the machine always overwrites empty-cell symbols it reads and replaces them with some sequence of symbols meant to internally represent an empty-and-read cell. This is clearly no loss of generality. Moreover it implies that cells with ’s and ’s are never separated by ’s.
The tape can be encoded into the two stacks and , such that the top of the stack encodes the contents of the tape cell the head of the Turing machine is currently at; the top of , those of the tape cell immediately right of it; the bottom of , the leftmost part of the simulated tape; the bottom of , its rightmost part. The stack alphabet we use consists of binary codes 110, 101, 100, corresponding to the symbols from the Turing-machine tape , , . We then keep the contents of and