1 Introduction
Markov Decision Processes (MDPs) are a framework for decision making with broad applications to financial, robotics, operations research and many other domains. At their root, MDPs are formulated as the tuple where is the state at a given time , is the action taken by the agent at time as a result of the decision process, is the reward received by the agent as a result of taking the action, and
is a transition function that describes the dynamics of the environment and capture the probability
of transitioning to a state given the action taken from state . An MDP is said to be deterministic if there is no uncertainty or randomness on the transition between and given . The output of an MDP is termed a policy, , which describes an action that should be taken at every state . When an MDP is solved completely such that the policy is optimal, it is typically denoted as . The optimal policy has the property that it maximizes the expected cumulative reward from any initial starting state. Alternatively, the MDP solution can also be viewed as a value function that describes the value of being at each state, or also as a value function that describes the value of taking a specific action from a given state. Given one representation, it is possible to recover the other representations. We use the notation of for the value function and for the optimal value function. MDPs which contain states which "terminate", meaning that once the state is reached, no further actions are taken. In chess, for example, a terminating state would be checkmate. In other problems there may be no natural terminating state. Such problems are said to be continuous.The reward function defines the reward that the agent receives for taking action from state . Reward functions can be based off only the state, , off the state and action, , and occasionally off the resulting next state, .
There are many well known methods for solving MDPs exactly including value iteration and policy iteration, which are iterative methods based on the dynamic programming approach proposed by Bellman. These algorithms use a tablebased approach to represent the stateaction space exactly and iteratively converge to the optimal policy and corresponding value function . These tablebased methods have a well known disadvantage that they quickly become intractable. As the number of states and actions increases in number or dimension, the number of entries in the (multidimensional) table increases exponentially. Many realworld problems quickly exhaust the resources of even high performing computers.
This curse of dimensionality is typically overcome by resorting to various forms of approximation of the optimal value function or optimal policy, some of which also have convergence guarantees or bounds on the error. Other techniques have focused on managing the size of the state space explosion through factorization or through aggregation and tiling.
Bertram bertram proposes an algorithm to solve MDPs named Exact that treats an MDP as a graph and uses the connectivity of the graph and the distance between nodes in the graph to solve an MDP in time complexity and memory complexity. The output of the algorithm was the full tablebased representation of the value function . This paper was restricted to deterministic continuous MDPs and provided performance improvements when the number of rewards was small compared to the number of states () and only supported reward functions based on state, . While performance was improved for this small class of MDPs, it continued to have a dependence on the size of the state space and therefore ultimately only partially mitigates the curse of dimensionality. It may be able to support larger state spaces than typical value iteration or policy iteration, but it eventually too will become intractable as the state action space itself grows exponentially – even linear dependence on the size of the state action space is a significant limitation.
In this paper, we propose an extension to Exact which we name Memoryless that removes the dependence on the size of the state space resulting in time complexity of and memory complexity of for the same restricted class of MDPs. Rather than outputting the full value function, the algorithm in this paper outputs an ordered list in which rewards should be processed using the same techniques as described in bertram . We propose a companion algorithm that can efficiently follow the optimal policy by calculating the value of neighboring states ondemand. We show performance against both value iteration and the prior algorithm for tractable state spaces.
2 Related Work
Dealing with the ’curse of dimensionality’ bellman2013dynamic
has been an ongoing struggle within the machine learning and optimization communities for many decades, especially within the Markov Decision Process community. Many attempts have been made to allow MDPs to scale to larger problems. Factored MDPs
schuurmans2002direct ; guestrin2003efficient attempt to alleviate the problem of state space explosion by identifying subsets of the MDP that can be broken into smaller problems. Primarily, though, approximation methods under the general umbrella of Approximate Dynamic Programming have been used as a compromise to obtain reasonable approximations of the underlying true value function in cases where the stateaction space (or the transition matrix ) is too large to represent with traditional exact methods, which are summarized by the excellent texts bertsekas1995dynamic ; powell2007approximate . Notably, linear function approximation methods such as GradientTD methods gradientTD ; tdcandgtd2 ; precup2001off , statistical evaluation methods such montecarlo tree search kocsis2006bandit , and nonlinear function approximation methods such as TDC with nonlinear function approximation bhatnagar2009convergent and DQN DQNatari are good examples of some of the approaches taken using approximation.bertram identifies a new method for a restricted class of MDPs that solves MDPs exactly by calculating the effect of each reward in the state space based off the distance to each reward. This paper relies heavily on that work and proposes an optimization to that algorithm that removes the dependency on the size of the state space.
3 Methodology
See bertram for details, but here we briefly recall for the reader that the key insight from this work was to describe an MDP in terms of a graph, to take advantage of known structure of the MDP, and to utilize discoveries on how the expected reward from multiple reward sources interplay and result in the value function.
The algorithm locates "peaks" in the value function due to the collection of rewards. It iteratively selects the most valuable peak and calculates an intermediate value function which represents the optimal solution of an MDP with the same environment but a subset of the rewards. The iterations continue until all rewards have been considered, and results in the optimal value function for the original MDP.
In the proof for the algorithm in bertram , it was shown that the complete value function can be determined from these peaks. As the algorithm processes each peak, it examines neighboring states, referring to the intermediate value function to look up values of these neighboring states. Note however that the number of neighboring states that are looked up is typically a very small number (on the order of ); in essence, because the values of a few states are needed, the values of all states are computed each iteration of the algorithm.
Instead, this paper changes the algorithm to compute the neighboring state values as needed from a list of the peaks sorted by order in which they were processed by the algorithm. Additionally, this paper proposes a mechanism to calculate the value of any state from this ordered list. During each iteration of the algorithm, this method calculates the value of any required states and results in a final list of how all peaks were processed.
This change to the algorithm severs its dependence on the size of the state space , effectively trading between additional computation time and memory storage. The intermediate value function can be viewed as a lookup table that improves computational efficiency; the new method essentially sacrifices this lookup table method for a slower computation based method that requires a pass through the list of peaks, a operation. However, when the number of rewards is small, this tradeoff can be acceptable, especially considering that the algorithm is no longer dependent on the size of the state space .
3.1 Algorithm
The only changes to the Exact algorithm from bertram are related to intermediate computation of the value function and subsequent lookup of the value of states which neighbor states that are under consideration during operation of the algorithm. To differentiate this algorithm from Exact , we name it Memoryless .
We begin by discussing the algorithm used to compute the value of a state of the intermediate (or final) value function, which requires the (possibly empty) ordered list of peaks that have been processed by previous iterations of the algorithm. In the event that the list of peaks is empty, a value of 0 is assumed. Note again that, as with bertram , only positive realvalued rewards are considered here.
The function iterates over all previously selected peaks, keeping track of the maximum value that could be derived from any of the previous peaks, which is the value of the state given the rewards that are represented by the selected peaks. This is at worst a operation, which grows from to as the rewards are processed. Note here that the data structure alluded to here for a peak contains fields for a primary and secondary state. For baseline and delta peaks only the primary is used, for combined peaks both the primary and secondary field are filled in; this is an artifact of implementation details of how the code represents combined peaks.
The remaining changes to the algorithm are simply to replace references to lookup of states in the intermediate value function with calls to this new function, and then removal of the allocation of memory and update of the value function, as shown below. See the Appendix for full pseudocode.
Line 2 initializes an empty list to track which peaks have been processed by the algorithm. Line 3 precomputes baseline peaks and combined peaks based off a list of reward sources and stores them in the form of a sorted list, sorted by value of each peak. Lines 48 continue until we have exhausted the potential peaks and each iteration of the loop whittles away at the list of possible peaks. Line 5 computes delta peaks for any remaining reward sources by calculating neighboring states values ondemand. Line 6 removes any peaks that have become invalid due to broken minimum cycles. Line 7 selects the peak with maximum value. Line 8 removes any other potential peaks in the list that are affected by selecting the peak with maximum value. Rather than returning a value function, we instead return the ordered list of peaks that have been processed by the algorithm.
To recover the full value function, we could simply call ValueOnDemand for each state in the state space and then follow the optimal policy normally. However, we also present a simple algorithm that follows the optimal policy given the final list of peaks produced by the algorithm that does not require computation of the full value function representation. Starting at the initial state, it computes the value of all neighboring states from the list of peaks. Once the value of all neighboring states are known, we then have enough information to determine which action is optimal. We see this as navigating the global value function using only local information about nearby states.
Line 2 initializes the current state, and (due to it being a continuous MDP) lines 36 loop forever. Line 4 finds the maximum neighbor for the current state (which consists of calling ValueOnDemand for each state that can be reached from the current state). Line 5 determines the (deterministic) action that leads to the neighbor with maximum value (which could be combined with line 4 but is separated here for clarity). Line 6 represents executing the selected action in the environment and receiving a new state from the environment, which is treated as the current state for the next pass through the loop.
As can be seen from the above, this ends up being a fairly straightforward extension of the prior work once the concepts in bertram are understood.
4 Experiments
Figure 2(a) shows the effects of varying the number of reward sources on the performance of the algorithm. For this result, a 50x50 grid world was used. The xaxis shows the number of reward sources used for a test configuration and the yaxis shows the length of time required to solve the MDP. For each test configuration, 10 randomly generated configurations were created for the number of reward sources specified in the test configuration with reward values ranging from 1 to 10. For each generated configuration, value iteration, the prior work Exact and this paper’s algorithm Memoryless
were run to obtain performance measurements. As an additional check, the exact solution calculated by this algorithm was compared to the value iteration result to ensure they produced the same result (within a tolerance due to value iteration approximating the exact solution due to the use of a bellman residual as a terminating condition.) In the plot, the bold line is the average and the colored envelope shows the standard deviation for each test configuration.
The figure shows that as the number of reward sources increases, value iteration remains invariant of the number of reward sources and the prior work grows slowly. In contrast, we see the tradeoff of increase in time complexity which is traded for not having to hold the value function in memory. For the algorithm proposed in this paper, for small numbers of reward sources the algorithm clearly continues to outperform value iteration. As the number of reward sources increases, however, an intersection point will occur and value iteration will begin to perform better. However, as the size of the state space increases so to does the execution time of value iteration, so the exact point where the intersection occurs will be problemspecific.
Figure 2(b) shows the effects of varying the size of the state space on the performance of the algorithm. For this a fixed number of reward sources (5) were used, and only the size of the state space was varied (by making the grid world larger). The x axis shows the number of states in the grid world (e.g., ) and the y axis shows length of time required to solve the MDP. For each grid world size, 10 randomly generated reward configurations with the fixed number of reward sources were generated. The results show that value iteration quickly increases in execution time, the prior work grows very slowly, and the algorithm proposed in this paper is invariant of the state space size.
Figure 2(c) shows the effects of varying the discount factor on the performance of the algorithm. For this test, a fixed number of reward sources (5) and state space size (50x50) were used, and only the discount factor was varied. The x axis shows the discount factor and the y axis shows the length of time required to solve the MDP. For each discount factor, 10 randomly generated reward configurations with the fixed discount factor were generated. The results show that value iteration increases apparently exponentially with the discount factor, whereas the prior work and the algorithm proposed in this paper are both invariant to the discount factor. This follows from the exact calculation of the value based off the distance, where the discount factor is simply a constant that is used in the calculation.
All tests were performed on a highend "gaming class" Alienware laptop with a quadcore Intel i7 running at 4.4 GHz with 32GB RAM without using any GPU hardware acceleration (i.e., CPU only). All code is single threaded, python only and no special optimization libraries other than numpy were used (for example, the python numba library was not used to accelerate numpy calculations.) Both value iteration and the proposed algorithm use numpy. The results presented here are meant to most fairly present the performance differences between the algorithms, thus further optimizations should yield improved performance beyond what is presented here.
5 Conclusion
In this paper, we have presented a novel approach to solving a certain subclass of deterministic continuous MDPs exactly that has no dependency on the size of the state space. This new algorithm’s computational speed greatly exceeds that of value iteration for sparse reward sources and, furthermore, is invariant to both the discount factor and the number of states in the state space. Performance of the algorithm is , where is the number of reward sources, is the number of actions, and is the number of states. Memory complexity for the algorithm is . We also propose an algorithm to follow the optimal policy using this technique which at each iteration is that leads to an efficient method to both solve the MDP and follow the optimal policy at runtime. Given the quick time to solve the MDP, it also lends itself to allowing the reward source locations to change arbitrarily between time steps. Given the lack of dependence on the size of the state space, this algorithm provides a way to solve previously intractable MDPs for which the stateaction space was too large to solve exactly.
For deterministic environments with sparse rewards such as certain robotics and unmanned vehicle problems, this new method’s performance allows computation to be performed with very minimal memory footprint allowing computations to be performed on very lowperforming and lowpower embedded hardware. If the number of rewards is sufficiently small, the algorithm could also perform sufficiently well to allow for realtime constraints to be met in an embedded environment such as a robot or unmanned vehicle.
To our knowledge, this is the first time that MDPs can be solved exactly without a full representation of the state space held in memory or relying on iterative convergence to the optimal policy or value function. If this method can be appropriately extended to a larger subset of MDPs (e.g., stochastic MDPs), it could result in broad impacts to the efficiency of solving certain types of MDPs useful in robotics and related spaces.
References
 [1] J. R. Bertram, X. Yang, and P. Wei. Fast Online Exact Solutions for Deterministic MDPs with Sparse Rewards. ArXiv eprints, May 2018.
 [2] Richard Bellman. Dynamic programming. Courier Corporation, 2013.

[3]
Dale Schuurmans and Relu Patrascu.
Direct valueapproximation for factored mdps. In Advances in Neural Information Processing Systems, pages 1579–1586, 2002. 
[4]
Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman.
Efficient solution algorithms for factored mdps.
Journal of Artificial Intelligence Research
, 19:399–468, 2003.  [5] Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995.
 [6] Warren B Powell. Approximate Dynamic Programming: Solving the curses of dimensionality, volume 703. John Wiley & Sons, 2007.
 [7] Richard S Sutton, Hamid R Maei, and Csaba Szepesvári. A convergent o(n) temporaldifference algorithm for offpolicy learning with linear function approximation. In Advances in neural information processing systems, pages 1609–1616, 2009.
 [8] Richard S Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradientdescent methods for temporaldifference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993–1000. ACM, 2009.
 [9] Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. Offpolicy temporaldifference learning with function approximation. In ICML, pages 417–424, 2001.
 [10] Levente Kocsis and Csaba Szepesvári. Bandit based montecarlo planning. In European conference on machine learning, pages 282–293. Springer, 2006.
 [11] Shalabh Bhatnagar, Doina Precup, David Silver, Richard S Sutton, Hamid R Maei, and Csaba Szepesvári. Convergent temporaldifference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, pages 1204–1212, 2009.
 [12] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
6 Appendix 1: Detailed Pseudocode
Line 2 initializes an empty list to track which peaks have been processed by the algorithm. Line 3 precomputes baseline peaks and combined peaks based off a list of reward sources and stores them in the form of a sorted list, sorted by value of each peak. Lines 48 continue until we have exhausted the potential peaks and each iteration of the loop whittles away at the list of possible peaks. Line 5 computes delta peaks for any remaining reward sources by calculating neighboring states values ondemand. Line 6 removes any peaks that have become invalid due to broken minimum cycles. Line 7 selects the peak with maximum value. Line 8 removes any other potential peaks in the list that are affected by selecting the peak with maximum value. Rather than returning a value function, we instead return the ordered list of peaks that have been processed by the algorithm.
We next examine the ValueOnDemand function, presenting it out of the calling tree order so that we can characterize its computational complexity to understand its impact on the rest of the code:
The function iterates over all previously selected peaks, keeping track of the maximum value that could be derived from any of the previous peaks, which is the value of the state given the rewards that are represented by the selected peaks. This is at worst a operation, which grows from to as the rewards are processed. Note here that the data structure alluded to here for a peak contains fields for a primary and secondary state. For baseline and delta peaks only the primary is used, for combined peaks both the primary and secondary field are filled in; this is an artifact of implementation details of how the code represents combined peaks.
Line 2 initializes a sorted list that is sorted by value of the peaks. In Lines 34, a baseline peak is computed for each reward source. In lines 58, if any reward sources are next to each other, their combined peaks are computed. Note that at this stage, the new ValueOnDemand function is not called; because no peaks have been selected, the value function at this point is assumed to be zeros everywhere.
PrecomputePeaks() is a algorithm that is done one time at the beginning of the algorithm and yields a list with worst case length of entries (but only if the reward sources are all adjacent to each other).
Line 2 initializes a sorted list that is sorted by value of the peaks. Lines 37 compute delta peak for any reward sources that remain. Lines 46 use the new ValueOnDemand function to compute the value of the current and neighboring states. Line 67 properly sort the delta with respect to neighboring states.
ComputeDeltas( valueFunction ) in [1] was a algorithm that is done for each pass of the loop, but with the addition of the ValueOnDemand function, the complexity grows to .
Lines 25 remove any peaks that have become invalid.
PruneInvalidPeaks() in [1] was a algorithm that is done for each pass of the loop. With the ValueOnDemand function, it now grows to .
Lines 24 remove any peaks that have been eliminated by the choice of the peak with maximum value.
RemoveAffectedPeaks operates over the sortedPeaks list, but this also shrinks by entries each pass.
6.0.1 Time Complexity
The main loop of the Memoryless function is a function, but the ComputeDelta and PruneInvalidPeaks functions are both due to their usage of the ValueOnDemand function, bringing the overall algorithm complexity to . Note here there is no dependence upon the size of the state space .
For environments where the connected distance is not easily determined (arbitrary transition graph), then the complexity to determine the distance between states must be taken into consideration. However, it is assumed that this can be precomputed offline because is assumed to be stationary.
For environments like the 2D grid world where the structure of the space is known, determining the connected distance between states is a calculation, which can be represented as a simple function call to determine the neighbors of each state ondemand.
6.0.2 Memory Complexity
Memory complexity for the algorithm is
Comments
There are no comments yet.