I Introduction and Related Work
Decision-making in uncertain environments is a basic problem in the area of artificial intelligence[1, 2], and Markov decision processes (MDPs) have become very popular for modeling non-deterministic planning problems with full observability [3, 4].
In this paper, we investigate the exact solution methods for solving the MDPs.We are especially interested in the MDPs that are widely applied in the artificial intelligence domain, such as the decision-theoretic planning , where the agent following the optimal policy eventually enter a terminal state defined as a goal/destination.
Our objective is to further improve the convergence behavior of the MDPs’ solving mechanism. To realize that, we are introducing a new MDP solution method with a novel design, and we are showing that such design allows us to gain insights on designing efficient heuristics to speed up the convergence of the MDPs. In greater detail, we introduce the notion ofreachability landscape which is essentially a “grid-map" measuring the degree of difficulty for each state transiting to the terminal/goal state. By “reachability" we mean that based on the current discrete stochastic system, how hard it is for the agent to transit from the current state to the given goal state. In other words, reachability can be viewed as the reachability with regards to arriving at the terminal state.
To compute the reachability landscape, we use the Mean First Passage Time (MFPT) which can be formulated into a simple linear system. We show that such reachability characterization of each state reflects the importance of this state, and thus provides a natural basis that can be used to upgrade and accelerate both value iteration (VI) and policy iteration (PI) mechanisms, e.g., it can be utilized for prioritizing states for the standard VI process and approximating policy for the standard PI procedure. More specifically, we design two novel algorithms that have been modified from the state-of-the-art solution methods. Our two algorithms include the Mean First Passage Time based Value Iteration (MFPT-VI) and the Mean First Passage Time based Policy Iteration (MFPT-PI). Experimental evaluations showed that both MFPT-VI and MFPT-PI are superior to the state-of-the-art methods in terms of both the practical runtime and the number of iterations needed to converge.
and reinforcement learning problems[7, 8]. In general, the fundamental solving mechanism is to exploit the dynamic programming (DP) structure which can be computed in polynomial time. Value iteration and policy iteration are two of the most famous and most widely used algorithms to solve the MDPs [9, 10].
An important related heuristic for efficiently solving MDPs is the prioritized sweeping , which has been broadly employed to further speed up the value iteration process. This heuristic evaluates each state and obtains a score based on the state’s contribution to the convergence, and then prioritizes/sorts all states based on their scores (e.g., those states with larger difference in value between two consecutive iterations will get higher scores) [12, 13]. Then in the immediately next dynamic programming iteration, evaluating the value of states follow the newly prioritized order. The prioritized sweeping heuristic is also leveraged in our MFPT based value iteration procedure, and comparisons with baseline approaches have been conducted in our experimental section.
Important related frameworks for solving MDPs also include compact representations such as linear function representation and approximation [9, 3] used in the policy iteration algorithms. The linear equation based techniques (detailed formulation is provided in the paper) do not exploit regions of uniformity in value functions associated with states, but rather a compact form of state features that can somewhat reflect values . Our method for computing the MFPT can also be formulated into a linear system. However, the intermediate results generated from MFPT are more direct: they very well capture – and also allow us to visualize – the “importance" of states, and can lead to a faster convergence speed which is demonstrated in the experiments.
Another relevant strategy is called real-time dynamic programming (RTDP)  where states are not treated uniformly. Specifically, in each DP iteration, only a subset of most important states might be explored, and the selection of the subset of states are usually built on and related to agent’s exploration experience. For a single DP iteration, the RTDP usually requires less computation time in comparison to the classic DP where all states need to be swept, and thus can be extended as an online process and integrated into the real-time reinforcement learning framework . Similar strategies also include the state abstraction [17, 18], where states with similar characteristics are hierarchically and/or adaptively grouped together, either in offline static or online dynamic aggregation style. Although we believe our proposed framework can be easily extended to the fashions of RTDP’s partial states exploration and the adaptive states abstraction/computation, in this work we consider the complete and full exploration of all states, and compare with state-of-the-art methods that evaluate across the entire state space.
Ii-a Markov Decision Processes
A Markov Decision Process (MDP) is a tuple , where is a set of states and is a set of actions. The state transition function is a probability function such that
is a probability function such thatis the probability that action in state will lead to state , and is a reward function where returns the immediate reward received on taking action in state that will lead to state .
A Markov system is defined as absorbing if from every non-terminal state it is possible to eventually enter a terminal state such as a goal/destination state . We restrict our attention to absorbing Markov systems so that the agent can arrive and stop at a goal.
Ii-B MDP Solution Methods
Here we discuss prevalent solution methods including the value iteration and policy iteration, as well as state-of-the-art heuristics/variants.
Ii-B1 Value Iteration
The value iteration (VI) is an iterative procedure that calculates the value (or utility in some literature) of each state based on the values of the neighbouring states that can directly transit to. The value of state at each iteration can be calculated by the Bellman equation shown below
where is a reward discounting parameter. The stopping criteria for the algorithm is when the values calculated on two consecutive iterations are close enough, i.e., , where is an optimization tolerance/threshold value, which determines the level of convergence accuracy.
Prioritized Sweeping: The standard Bellman recursion evaluates values of all states in a sweeping style, following the index of the states stored in the memory. To speed up the convergence, a heuristic called prioritized sweeping has been proposed and widely used as a benchmark framework for non-domain-specific applications. The algorithm labels a state as more “interesting" or more “important" during a particular iteration, if the change in the state value is higher when compared to its previous iteration. The essential idea is that, the larger the value changes, the higher impact that updating that state will change its dependent states, thereby taking a larger step towards convergence.
Ii-B2 Policy Iteration
The policy iteration (PI) is an anytime algorithm because at any iteration there will be a feasible policy (although not necessarily optimal). The PI involves the following two steps. The first step also known as policy evaluation is responsible for calculating the value of each state given some fixed policy until convergence. This is followed by a step called policy improvement. Here, the policy is updated based on the resulting converged values of each state. The optimal policy is obtained at convergence when , where is the iteration index. Each policy improvement step is based on the following equation:
There are two ways of computing the policy evaluation step.
The first approach is to perform Bellman updates iteratively till convergence given some fixed policy using the below equation.
We represent this version of policy iteration where policy evaluation is performed using Bellman updates as Policy Iteration (PI) throughout the paper.
The second approach to perform policy evaluation is through linear equation approximation where we solve a linear system as shown in Eq. (4). This outputs the optimal values of each state corresponding to a given policy.
The variables in this system of linear equation are , whereas are constants given a fixed policy. We call this version of policy iteration as Policy Iteration - Linear Equation (PI-LE) throughout the paper.
Ii-C Mean First Passage Times
The first passage time (FPT), , is defined as the number of state transitions involved in reaching states when started in state for the first time. The mean first passage time (MFPT), from state to is the expected number of hopping steps for reaching state given initially it was in state . The MFPT analysis is built on the Markov chain, and has nothing to do with the agent’s actions. Remember that, when a MDP is associated to a fixed policy, it then behaves as a Markov chain .
Formally, let us define a Markov chain with states and transition probability matrix, . If the transition probability matrix is regular, then each MFPT, , satisfies the below conditional expectation formula:
where, represents the transition probability from state to , and is an event where the first state transition happens from state to . From the definition of mean first passage times, we have, . So, we can rewrite Eq. (5) as follows.
Since, = 1, Eq. (6) can be formulated as per the below equation:
Solving all MFPT variables can be viewed as solving a system of linear equations
The values , , , represents the MFPTs calculated for state transitions from states , , , to and . To solve above equation, efficient decomposition methods  may help to avoid a direct matrix inversion.
Iii Technical Approach
Iii-a Reachability Characterization using MFPT
The notion of MFPT allows us to define the reachability of a state. By “reachability of a state" we mean that based on current fixed policy, how hard it is for the agent to transit from the current state to the given goal/absorbing state. With all MFPT values obtained, we can construct a reachability landscape which is essentially a “map" measuring the degree of difficulty of all states transiting to the goal.
Fig. 1 shows a series of landscapes represented in heatmap in our simulated environment. The values in the heatmap range from 0 (cold color) to 100 (warm color). In order to better visualize the low MFPT spectrum that we are most interested, any value greater than 100 has been clipped to 100. Fig. 1-1 show the change of landscapes as the MFPT-PI algorithm proceeds. Initially, all states except the goal state are initialized as unreachable, as shown by the high MFPT color in Fig. 1.
We observe that the reachability landscape conveys very useful information on potential impacts of different states. More specifically, a state with a better reachability (smaller MFPT value) is more likely to make a larger change during the MDP convergence procedure, leading to a bigger convergence step. With such observation and the new metric, we can design state prioritization schemes for the VI and policy update heuristics for the PI, which are discussed next.
Iii-B Mean First Passage Time based Value Iteration (MFPT-VI)
Classic VI converges the solution through Bellman update as shown in Eq. (1), which essentially sweeps all states by treating them equally important. The “prioritized sweeping" mechanism has further improved the convergence rate by exploiting and ranking the states based on their potential contributions, where the metric is based on comparing the difference of values for each state between two consecutive iterations.
The new MFPT-VI method is also built on the prioritized sweeping mechanism, but we propose a metric using the reachability (MFPT values), because as aforementioned, the reachability characterization of each state reflects the potential impact/contribution of this state, and thus provides a natural basis for prioritization. Our reachability based metric is different from the existing value-difference metric as follows: 0.7cm The reachability landscape can very well capture the degree of importance for all states from a global viewpoint, whereas the classic value-difference strategy evaluates states locally, and may fail grasping the correct global “convergence direction" due to the local viewpoint. Note that, since the MFPT computation is relatively expensive, it is not necessary to compute the MFPT at every iteration, but rather after every few iterations. The computational process of MFPT-VI is pseudo-coded in Alg. 1.
Iii-C Mean First Passage Time based Policy Iteration (MFPT-PI)
Remember that the classic PI algorithm involves two steps: policy evaluation and policy improvement. In this sub-section, we propose a new policy improvement mechanism.
Policy Evaluation: Classic policy evaluation utilizes either local policy optimization (shown in Eq. (3)) or a linear equation approximation (shown in Eq. (4)) to obtain a new set of values . Our MFPT-PI also employs the local optimization rule described in Eq. (3).
Policy Improvement: The policy improvement of MFPT-PI is based on MFPT values. To approximate a good and feasible policy, the operation can be even simpler than the classic approach. We propose an intuitively simple and computationally cheap heuristic which shows great performance: the action selection can be following (or weighted by) the “gradient" of reachability landscape. For instance, in the agent motion planning scenario where each state transits to some set of states in the vicinity, the locally optimal action can be selected to transit to the neighbouring state with the least MFPT value.
Alternating the above two steps allow us to converge to the solution. Fig. 1 illustrates the basic idea. In the beginning, a random policy is initialized and all states except the goal are pre-set with high MFPT values, see Fig. 1. In the first iteration, after completing the policy evaluation, the reachability landscape is shown as Fig. 1, which characterizes the reachability regions at a high level. The result is more like an image segmentation solution where regions with distinct reachability levels are roughly partitioned: since in the middle area there are a few obstacles that separate the space, the reachability landscape is split into two regions, with upper area being well reachable and the lower part fully un-reachable. With a couple more iterations, the reachability landscape is refined until low level details are well characterized (see Figs 1-1). In essence, a major benefit of the proposed MFPT-PI algorithm is that: at earlier stage of the convergence, it captures the big picture; and later it focuses on improving the details once the policy becomes better.
It is worth mentioning that, if there are multiple goal/absorbing states, then each goal state will require to compute its own MFPT landscape, and the final reachability landscape will be normalized by all MFPTs corresponding to all terminal states.
Iii-D Time Complexity Analysis
The classic VI algorithm has a time complexity of per iteration where represents number of states and represents number of actions. The classic PI has a time complexity of if a linear equation solver is used (the matrix decomposition has a time complexity of if state-of-the-art algorithms are employed , given that the size of matrix is the number of states ). Since, both the proposed MFPT-VI and MFPT-PI algorithms involve a key step for calculating the MFPT which also needs to solve a linear system, therefore, for each iteration, both the MFPT-VI and MFPT-PI algorithms have a time complexity of . Although, the MFPT is an extra component (and cost) introduced to the VI and PI mechanisms, our reachability characterization enables the MDP solving process to converge in much fewer steps. An important point to note here is that the MFPT values don’t need to be computed frequently at every iteration and it is only calculated when characterizing global feature is needed, which also reduces the practical runtime.
It is also worth mentioning that, although the time complexity for matrix decomposition is in general, for sparse matrices, efficient decomposition techniques perform much faster than where the time complexity depends on the number of non-zero elements in the matrix. For example, in certain robotic motion planning scenario, the robot transits to states within a certain vicinity causing the transition probability matrix to be sparse by nature. Inspite of this additional time complexity per iteration, MFPT based algorithms have faster overall runtime than existing solution methods because they converge in much fewer steps. Results are presented in the following section.
Iv Experimental Results
Iv-a Experimental Setting
We validated our method through numerical evaluations with two different simulators running on a Linux machine. We consider the MDP problem where each action can lead to transitions to all other states with certain transition probabilities. However, in many practical scenarios, the probability of transiting from a state to another state that is “weakly connected" can be small, even close to 0. This can potentially result in non-dense transition matrix. For example, in the robotic motion planning scenario, the state transition probability from state to state can be correlated with the time or distance of traveling from to , and it is more likely for a state to transit to some states within certain vicinity.
For the first task, we developed a simulator in C++ using OpenGl. To obtain the discrete MDP states, we tessellate the agent’s workspace into a 2D grid map, and represent the center of each grid as a state. In this way, the state hopping between two states represents the corresponding motion in the workspace. Each non-boundary state has a total of nine actions, i.e., a non-boundary state can move in the directions of N, NE, E, SE, S, SW, W, NW, plus an idle action returning to itself. A demonstration is shown in Fig. 2.
For the second task, we developed a simulator in C++ using ROS and rviz. The agent’s workspace is partitioned into a 3D grid map where the center of each grid represents a MDP state. Each non-boundary state has a total of five actions, i.e., a non-boundary state can move in the directions of N, E, S, W, TOP, BOTTOM plus an idle action.
In both tasks, the objective of the agent is to reach a goal state from an initial start location in the presence of obstacles. The reward function for both setups is represented as high positive reward at the goal state and -1 for obstacle states. All experiments were performed on a laptop computer with 8GB RAM and 2.6 GHz quad-core Intel i7 processor.
Iv-B 2D Grid Setup
In this setup, we compare our proposed algorithms with their corresponding baseline algorithms in terms of the practical runtime performance, iterations required to converge and their convergence profile. Later, we also performed some analysis on time costs of individual components and MFPT invoking frequency.
Practical Runtime Performance: We first investigate the time taken by the algorithms with changing number of states. Fig. 3 compares the time taken by VI, VI-PS and MFPT-VI algorithms. In MFPT-VI, the MFPT component is computed every three iterations. Also, the convergence thresholds are set the same for all methods. The results show that VI is the slowest, and VI-PS is faster than VI due to the prioritized sweeping heuristic. And our proposed MFPT-VI is much faster than the other two algorithms. Although, time taken per iteration for MFPT-VI is higher than that of the classic VI, overall the MFPT-VI takes the least time due to much fewer convergence steps and also because MFPT values are computed every three iterations. Below, we will further elaborate on why we have decided not to calculate the MFPT values every iteration.
Fig. 3 compares the time taken by PI, PI-LE, and MFPT-PI. The results also show that the MFPT-PI is the fastest compared to the other two algorithms. Note that, both MFPT-PI and PI-LE require to compute linear equations, and in our implementation we use the Eigen library. (Since the reachability characterization does not require high MFPT accuracy, among many provided decomposition solvers we chose one of the fast variants which however do not provide the highest accuracy.)
Number of Iterations: We then analyze the number of iterations taken by the algorithms as the number of states change. Fig. 4 compares the number of iterations taken by VI, VI-PS and MFPT-VI, respectively. The results reveal that MFPT-VI converges the fastest compared to the other two algorithms. Fig. 4 compares the iterations taken by PI, PI-LE, and MFPT-PI. Again, we can see that MFPT-PI converges the fastest among all three algorithms.
Convergence Profile: Here, we investigate the detailed convergence profile along with the increasing number of iterations. Since one important criterion to judge the convergence is to see if the error/difference between two consecutive iterations is small enough, thus, we utilize the maximum error
across all states as an evaluation metric. Specifically, for VI, VI-PS, MFPT-VI, the erroris defined as
where , represent the values of states at iteration and . For PI, PI-LE and MFPT-PI, the is defined as the total number of mismatches between two consecutive iterations’ policies:
where and represent the policy at iteration and , respectively.
Fig. 5 shows that initially VI, VI-PS and MFPT-VI start with the same value. It is very obvious that our MFPT-VI approach has an extremely steep decrement in the first few iterations. For example, in our simulation scenario with around 2500 states and a threshold value of 0.1, the MFPT-VI converged in only 19 iterations; in contrast, VI took 55 iterations and VI-PS took 45.
Fig. 5 shows profiles for PI, PI-LE and MFPT-PI. Note that, with the policy obtained after the first iteration, MFPT-PI already produced a smaller value compared to the other two algorithms. After that, the values in case of MFPT-PI are lower than those of PI and PI-LE. For instance, in the same simulation example mentioned above, our MFPT-PI method converged in 9 iterations whereas PI took 13 iterations and PI-LE took 11 iterations. The profiles also indicate that the PI based methods take much fewer iterations, but more time per iteration, than those of VI based approaches.
Time Costs of Components: We also looked into the detailed time taken by critical components of each algorithm. The VI related algorithms are analyzed in Fig. 6-6. Specifically, for VI most time is used for Bellman backup operations as shown in Fig. 6; for VI-PS there is an additional step involved in sorting the states for the purpose of states prioritization, which is negligible when compared with the cost of Bellman backup, as illustrated by Fig. 6. In contrast, in our MFPT-VI algorithm, a big chunk of time is used for computing the MFPT values, as shown in Fig. 6.
Fig. 6-6 show results for PI based algorithms. We recorded the time taken by the policy evaluation and policy improvement phases. In addition, for MFPT-PI, we also recorded the time used by MFPT calculation. From the results we can see that, policy evaluation dominates the time in PI and PI-LE, whereas for MFPT-PI, the MFPT calculation takes most of the time as shown in Fig. 6. However, MFPT-PI compensated for the expensive MFPT calculation by converging in fewer iterations, leading to a faster convergence in general.
MFPT Invoking Frequency: As mentioned earlier, the MFPT values are computed after every few iterations when there is a need for characterizing global features. Next, we present some analysis related to this.
In Fig. 7, the number on the -axis represents that the MFPT values are computed after every p VI iterations. The time taken by MFPT-VI initially decreases to a certain point and then increases from there on as p increases. We also observe a similar behavior in Fig. 7 when we analyze the number of iterations taken to converge under varying .
This implies that, (1) invoking MFPT calculation at every iteration is not the optimal way; and (2) for different application scenarios with different patterns of transition matrices, the best range of can be obtained by running a few trials, using probing mechanisms such as the binary search. In our testing scenario, a ranging from three to five is a good choice.
Iv-C 3D Grid Setup
A demonstration of the trajectory taken by the agent in 3D simulation environment is shown in Fig. 8. Next, we compare our proposed algorithms with their corresponding baseline algorithms in terms of the practical runtime performance and iterations required to converge.
We first investigate the time taken by the algorithms with changing number of states. Fig. 9 compares the time taken by VI, VI-PS and MFPT-VI algorithms. The results show that our proposed MFPT-VI is much faster than the VI and VI-PS. Also, as expected, due to the prioritized sweeping heuristic, VI-PS is faster than VI. In Fig. 9, we compare the time taken by PI, PI-LE, and MFPT-PI. The results show that MFPT-PI is the fastest compared to the other two algorithms.
The outstanding performance of MFPT-VI and MFPT-PI indicate that, reachability characterization based framework is superior to state-of-the-art solutions, yet the implementation of our methods are as simple as the classic ones.
As in the case of 2D grid setup, we also analyze the number of iterations required by the algorithms to converge as the number of states change. In Fig. 10,we compare the number of iterations taken by VI, VI-PS and MFPT-VI, respectively. Similar to the 2D grid setup, our results reveal that MFPT-VI converges the fastest compared to VI and VI-PS. Fig. 10 compares the iterations taken by PI, PI-LE, and MFPT-PI. Again, we can see that MFPT-PI converges the fastest among all three algorithms.
Another interesting observation is that as the number of states increase, the number of iterations required to converge flattens out for both MFPT-VI and MFPT-PI. These results clearly show the power of reachability characterization, which very well captures the convergence optimization feature, thereby requiring much fewer iterations to reach optimality.
In this paper, we propose a new framework for efficiently solving the MDPs. The proposed method explores reachability of states using MFPT values, which characterizes the degree of difficulty of reaching given goal states. Different from the classic VI and PI methods, the proposed solutions based on MFPTs reflect a patterned landscape of states, which very well captures – and also allows us to visualize – the degree of importance of states. The reachability characterization enables one to design efficient heuristics such as value prioritization and policy approximation, and we propose two specific algorithms called MFPT-VI and MFPT-PI. We show that the implementation of proposed methods is as simple as classic methods, but our algorithms converge much faster, with less runtime and fewer iterations than state-of-art approaches.
-  Stuary Russell and Peter Norvig. Artifical intelligence: A modern approach. Accessed October 22, 2004 at http://aima.cs.berkeley.edu/, 2002.
-  Olivier Sigaud and Olivier Buffet. Markov decision processes in artificial intelligence. John Wiley & Sons, 2013.
-  Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
-  Douglas J White. A survey of applications of markov decision processes. Journal of the Operational Research Society, 44(11):1073–1096, 1993.
-  Craig Boutilier, Thomas Dean, and Steve Hanks. Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11:1–94, 1999.
Richard S Sutton.
Integrated architectures for learning, planning, and reacting based
on approximating dynamic programming.
Proceedings of the seventh international conference on machine learning, pages 216–224, 1990.
-  Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien Ernst. Reinforcement learning and dynamic programming using function approximators, volume 39. CRC press, 2010.
-  Martijn van Otterlo and Marco Wiering. Reinforcement learning and markov decision processes. In Reinforcement Learning, pages 3–42. Springer, 2012.
-  Ronald A. Howard. Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA, 1960.
-  D. P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs, N.J., 1987.
-  Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. In Machine Learning, pages 103–130, 1993.
-  David Andre, Nir Friedman, and Ronald Parr. Generalized prioritized sweeping. Advances in Neural Information Processing Systems, 1998.
-  David Wingate and Kevin D Seppi. Prioritization methods for accelerating mdp solvers. Journal of Machine Learning Research, 6(May):851–881, 2005.
-  Craig Boutilier, Richard Dearden, and Moisés Goldszmidt. Stochastic dynamic programming with factored representations. Artificial intelligence, 121(1):49–107, 2000.
-  Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to act using real-time dynamic programming. Artif. Intell., 72(1-2):81–138, January 1995.
-  Blai Bonet and Hector Geffner. Labeled rtdp: Improving the convergence of real-time dynamic programming. In ICAPS, volume 3, pages 12–21, 2003.
-  David Andre and Stuart J Russell. State abstraction for programmable reinforcement learning agents. In AAAI/IAAI, pages 119–125, 2002.
-  Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for mdps. In ISAIM, 2006.
-  John G Kemeny, Hazleton Mirkill, J Laurie Snell, and Gerald L Thompson. Finite mathematical structures. Prentice-Hall, 1959.
-  David Assaf, Moshe Shared, and J. George Shanthikumar. First-passage times with pfr densities. Journal of Applied Probability, 22(1):185–196, 1985.
-  Gene H. Golub and Charles F. Van Loan. Matrix Computations (3rd Ed.). Johns Hopkins University Press, Baltimore, MD, USA, 1996.