The model-based approach to reinforcement learning (RL) focuses on predicting the dynamics of the environment to plan and make high-quality decisions (Kaelbling et al., 1996; Sutton & Barto, 1998). Although the behavior of model-based algorithms in tabular environments is well understood and can be effective (Sutton & Barto, 1998), scaling up to the approximate setting can cause instabilities. Even small model errors can be magnified by the planning process resulting in poor performance (Talvitie, 2014).
In this paper, we study model-based RL through the lens of Lipschitz continuity, intuitively related to the smoothness of a function. We show that the ability of a model to make accurate multi-step predictions is related to the model’s one-step accuracy, but also to the magnitude of the Lipschitz constant (smoothness) of the model. We further show that the dependence on the Lipschitz constant carries over to the value-prediction problem, ultimately influencing the quality of the policy found by planning.
We consider a setting with continuous state spaces and stochastic transitions where we quantify the distance between distributions using the Wasserstein metric. We introduce a novel characterization of models, referred to as a Lipschitz model class, that represents stochastic dynamics using a set of component deterministic functions. This allows us to study any stochastic dynamic using the Lipschitz continuity of its component deterministic functions. To learn a Lipschitz model class in continuous state spaces, we provide an Expectation-Maximization algorithm(Dempster et al., 1977).
One promising direction for mitigating the effects of inaccurate models is the idea of limiting the complexity of the learned models or reducing the horizon of planning (Jiang et al., 2015)
. Doing so can sometimes make models more useful, much as regularization in supervised learning can improve generalization performance(Tibshirani, 1996). In this work, we also examine a type of regularization that comes from controlling the Lipschitz constant of models. This regularization technique can be applied efficiently, as we will show, when we represent the transition model by neural networks.
We consider the Markov decision process (MDP) setting in which the RL problem is formulated by the tuple. Here, by we mean a continuous state space and by we mean a discrete action set. The functions and denote the reward and transition dynamics. Finally, is the discount rate. If , the setting is called a Markov reward process (MRP).
2.1 Lipschitz Continuity
Our analyses leverage the “smoothness” of various functions, quantified as follows.
Given two metric spaces and consisting of a space and a distance metric, a function is Lipschitz continuous (sometimes simply Lipschitz) if the Lipschitz constant, defined as
Equivalently, for a Lipschitz ,
The concept of Lipschitz continuity is visualized in Figure 1.
A Lipschitz function is called a non-expansion when and a contraction when . Lipschitz continuity, in one form or another, has been a key tool in the theory of reinforcement learning (Bertsekas, 1975; Bertsekas & Tsitsiklis, 1995; Littman & Szepesvári, 1996; Müller, 1996; Ferns et al., 2004; Hinderer, 2005; Rachelson & Lagoudakis, 2010; Szepesvári, 2010; Pazis & Parr, 2013; Pirotta et al., 2015; Pires & Szepesvári, 2016; Berkenkamp et al., 2017; Bellemare et al., 2017) and bandits (Kleinberg et al., 2008; Bubeck et al., 2011). Below, we also define Lipschitz continuity over a subset of inputs.
A function is uniformly Lipschitz continuous in if
Note that the metric is defined only on .
2.2 Wasserstein Metric
We quantify the distance between two distributions using the following metric:
Sometimes referred to as “Earth Mover’s distance”, Wasserstein is the minimum expected distance between pairs of points where the joint distribution is constrained to match the marginals and . New applications of this metric are discovered in machine learning, namely in the context of generative adversarial networks (Arjovsky et al., 2017) and value distributions in reinforcement learning (Bellemare et al., 2017).
Wasserstein is linked to Lipschitz continuity using duality:
This equivalence, known as Kantorovich-Rubinstein duality (Villani, 2008), lets us compute Wasserstein by maximizing over a Lipschitz set of functions , a relatively easier problem to solve. In our theory, we utilize both definitions, namely the primal definition (3) and the dual definition (4).
3 Lipschitz Model Class
We introduce a novel representation of stochastic MDP transitions in terms of a distribution over a set of deterministic components.
Given a metric state space and an action space , we define as a collection of functions: distributed according to where . We say that is a Lipschitz model class if
Our definition captures a subset of stochastic transitions, namely ones that can be represented as a state-independent distribution over deterministic transitions. An example is provided in Figure 2. We further prove in the appendix (see Claim 1) that any finite MDP transition probabilities can be decomposed into a state-independent distribution over a finite set of deterministic functions .
Associated with a Lipschitz model class is a transition function given by:
Given a state distribution , we also define a generalized notion of transition function given by:
We are primarily interested in , the Lipschitz constant of . However, since takes as input a probability distribution and also outputs a probability distribution, we require a notion of distance between two distributions. This notion is quantified using Wasserstein and is justified in the next section.
4 On the Choice of Probability Metric
We consider the stochastic model-based setting and show through an example that the Wasserstein metric is a reasonable choice compared to other common options.
Consider a uniform distribution over statesas shown in black in Figure 3 (top). Take a transition function in the environment that, given an action , uniformly randomly adds or subtracts a scalar . The distribution of states after one transition is shown in red in Figure 3 (middle). Now, consider a transition model that approximates by uniformly randomly adding or subtracting the scalar . The distribution over states after one transition using this imperfect model is shown in blue in Figure 3 (bottom).
We desire a metric that captures the similarity between the output of the two transition functions. We first consider Kullback-Leibler (KL) divergence and observe that:
unless the two constants are exactly the same.
The next possible choice is Total Variation (TV) defined as:
if the two distributions have disjoint supports regardless of how far the supports are from each other.
In contrast, Wasserstein is sensitive to how far the constants are as:
It is clear that, of the three, Wasserstein corresponds best to the intuitive sense of how closely approximates . This is particularly important in high-dimensional spaces where the true distribution is known to usually lie in low-dimensional manifolds. (Narayanan & Mitter, 2010)
5 Understanding the Compounding Error Phenomenon
To extract a prediction with a horizon , model-based algorithms typically apply the model for steps by taking the state input in step to be the state output from the step . Previous work has shown that model error can result in poor long-horizon predictions and ineffective planning (Talvitie, 2014, 2017). Observed even beyond reinforcement learning (Lorenz, 1972; Venkatraman et al., 2015), this is referred to as the compounding error phenomenon. The goal of this section is to provide a bound on multi-step prediction error of a model. We formalize the notion of model accuracy below:
Given an MDP with a transition function , we identify a Lipschitz model as -accurate if its induced satisfies:
We want to express the multi-step Wasserstein error in terms of the single-step Wasserstein error and the Lipschitz constant of the transition function . We provide a bound on the Lipschitz constant of using the following lemma:
A generalized transition function induced by a Lipschitz model class is Lipschitz with a constant:
Intuitively, Lemma 1 states that, if the two input distributions are similar, then for any action the output distributions given by are also similar up to a factor. We prove this lemma, as well as the subsequent lemmas, in the appendix.
Given the one-step error (Definition 5), a start state distribution and a fixed sequence of actions , we desire a bound on -step error:
where and is defined similarly. We provide a useful lemma followed by the theorem.
(Composition Lemma) Define three metric spaces , , and . Define Lipschitz functions and with constants and . Then, is Lipschitz with constant .
Similar to composition, we can show that summation preserves Lipschitz continuity with a constant bounded by the sum of the Lipschitz constants of the two functions. We omitted this result due to brevity.
Define a -accurate with the Lipschitz constant and an MDP with a Lipschitz transition function with constant . Let . Then :
We construct a proof by induction. Using Kantarovich-Rubinstein duality (Lipschitz property of not shown for brevity) we first prove the base of induction:
We now prove the inductive step. Assuming we can write:
There exist similar results in the literature relating one-step transition error to multi-step transition error and sub-optimality bounds for planning with an approximate model. The Simulation Lemma (Kearns & Singh, 2002; Strehl et al., 2009) is for discrete state MDPs and relates error in the one-step model to the value obtained by using it for planning. A related result for continuous state-spaces (Kakade et al., 2003) bounds the error in estimating the probability of a trajectory using total variation. A second related result (Venkatraman et al., 2015) provides a slightly looser bound for prediction error in the deterministic case—our result can be thought of as a generalization of their result to the probabilistic case.
6 Value Error with Lipschitz Models
We next investigate the error in the state-value function induced by a Lipschitz model class. To answer this question, we consider an MRP denoted by and a second MRP that only differs from the first in its transition function . Let be the action set with a single action . We further assume that the reward function is only dependent upon state. We first express the state-value function for a start state with respect to the two transition functions. By below, we mean a Dirac delta function denoting a distribution with probability 1 at state .
Next we derive a bound on .
Assume a Lipschitz model class with a -accurate with . Further, assume a Lipschitz reward function with constant . Then and
We first define the function It can be observed that . We now write:
Let . Then given :
We can derive the same bound for using the fact that Wasserstein distance is a metric, and therefore symmetric, thereby completing the proof. ∎
Regarding the tightness of our bounds, we can show that when the transition model is deterministic and linear then Theorem 1 provides a tight bound. Moreover, if the reward function is linear, the bound provided by Theorem 2 is tight. (See Claim 2 in the appendix.) Notice also that our proof does not require a bounded reward function.
7 Lipschitz Generalized Value Iteration
We next show that, given a Lipschitz transition model, solving for the fixed point of a class of Bellman equations yields a Lipschitz state-action value function. Our proof is in the context of Generalized Value Iteration (GVI) (Littman & Szepesvári, 1996), which defines Value Iteration (Bellman, 1957) for planning with arbitrary backup operators.
To prove the result, we make use of the following lemmas.
Given a Lipschitz function with constant :
The following operators (Asadi & Littman, 2017) are Lipschitz with constants:
For any non-expansion backup operator outlined in Lemma 4, GVI computes a value function with a Lipschitz constant bounded by if .
Two implications of this result: First, PAC exploration in continuous state spaces is shown assuming a Lipschitz value function (Pazis & Parr, 2013). However, the theorem shows that it is sufficient to have a Lipschitz model, an assumption perhaps easier to confirm. The second implication relates to value-aware model learning (VAML) objective (Farahmand et al., 2017). Using the above theorem, we can show that minimizing Wasserstein is equivalent to minimizing the VAML objective (Asadi et al., 2018).
Our first goal in this section111We release the code here: github.com/kavosh8/Lip is to compare TV, KL, and Wasserstein in terms of the ability to best quantify error of an imperfect model. To this end, we built finite MRPs with random transitions, states, and . In the first case the reward signal is randomly sampled from , and in the second case the reward of an state is the index of that state, so small Euclidean norm between two states is an indication of similar values. For trials, we generated an MRP and a random model, and then computed model error and planning error (Figure 4). We understand a good metric as the one that computes a model error with a high correlation with value error. We show these correlations for different values of in Figure 5.
It is known that controlling the Lipschitz constant of neural nets can help in terms of improving generalization error due to a lower bound on Rademacher complexity (Neyshabur et al., 2015; Bartlett & Mendelson, 2002). It then follows from Theorems 1 and 2 that controlling the Lipschitz constant of a learned transition model can achieve better error bounds for multi-step and value predictions. To enforce this constraint during learning, we bound the Lipschitz constant of various operations used in building neural network. The bound on the constant of the entire neural network then follows from Lemma 2. In Table 1, we provide Lipschitz constant for operations (see Appendix for proof) used in our experiments. We quantify these results for different -norms .
Given these simple methods for enforcing Lipschitz continuity, we performed empirical evaluations to understand the impact of Lipschitz continuity of transition models, specifically when the transition model is used to perform multi-step state-predictions and policy improvements. We chose two standard domains: Cart Pole and Pendulum. In Cart Pole, we trained a network on a dataset of tuples . During training, we ensured that the weights of the network are smaller than . For each , we performed 20 independent model estimation, and chose the model with median cross-validation error.
Using the learned model, along with the actual reward signal of the environment, we then performed stochastic actor-critic RL. (Barto et al., 1983; Sutton et al., 2000) This required an interaction between the policy and the learned model for relatively long trajectories. To measure the usefulness of the model, we then tested the learned policy on the actual domain. We repeated this experiment on Pendulum. To train the neural transition model for this domain we used samples. Notably, we used deterministic policy gradient (Silver et al., 2014) for training the policy network with the hyper parameters suggested by Lillicrap et al. (2015). We report these results in Figure 6.
Observe that an intermediate Lipschitz constant yields the best result. Consistent with the theory, controlling the Lipschitz constant in practice can combat the compounding errors and can help in the value estimation problem. This ultimately results in learning a better policy.
We next examined if the benefits carry over to stochastic settings. To capture stochasticity we need an algorithm to learn a Lipschitz model class (Definition 4). We used an EM algorithm to joinly learn a set of functions , parameterized by , and a distribution over functions . Note that in practice our dataset only consists of a set of samples and does not include the function the sample is drawn from. Hence, we consider this as our latent variable . As is standard with EM, we start with the log-likelihood objective (for simplicity of presentation we assume a single action in the derivation):
where we used Jensen’s inequality and concavity of log in the last line. This derivation leads to the following EM algorithm.
In the M step, find by solving for:
In the E step, compute posteriors:
Note that we assume each point is drawn from a neural network with probability:
and with a fixed variancetuned as a hyper-parameter.
We used a supervised-learning domain to evaluate the EM algorithm. We generated 30 points from 5 functions (written at the end of Appendix) and trained 5 neural networks to fit these points. Iterations of a single run is shown in Figure 7 and the summary of results is presented in Figure 8. Observe that the EM algorithm is effective, and that controlling the Lipschitz constant is again useful.
We next applied EM to train a transition model for an RL setting, namely the gridworld domain from Moerland et al. (2017). Here a useful model needs to capture the stochastic behavior of the two ghosts. We modify the reward to be -1 whenever the agent is in the same cell as either one of the ghosts and 0 otherwise. We performed environmental interactions for 1000 time-steps and measured the return. We compared against standard tabular methods(Sutton & Barto, 1998), and a deterministic model that predicts expected next state (Sutton et al., 2008; Parr et al., 2008). In all cases we used value iteration for planning.
Results in Figure 9 show that tabular models fail due to no generalization, and expected models fail since the ghosts do not move on expectation, a prediction not useful for planner. Performing value iteration with a Lipschitz model class outperforms the baselines.
We took an important step towards understanding model-based RL with function approximation. We showed that Lipschitz continuity of an estimated model plays a central role in multi-step prediction error, and in value-estimation error. We also showed the benefits of employing Wasserstein for model-based RL. An important future work is to apply these ideas to larger problems.
The authors recognize the assistance of Eli Upfal, John Langford, George Konidaris, and members of Brown’s Rlab specifically Cameron Allen, David Abel, and Evan Cater. The authors also thank anonymous ICML reviewer 1 for insights on a value-aware interpretation of Wasserstein.
- Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214–223, 2017.
- Asadi & Littman (2017) Asadi, K. and Littman, M. L. An alternative softmax operator for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pp. 243–252, 2017.
- Asadi et al. (2018) Asadi, K., Cater, E., Misra, D., and Littman, M. L. Equivalence between wasserstein and value-aware model-based reinforcement learning. arXiv preprint arXiv:1806.01265, 2018.
- Bartlett & Mendelson (2002) Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
- Barto et al. (1983) Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, pp. 834–846, 1983.
- Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pp. 449–458, 2017.
- Bellman (1957) Bellman, R. A markovian decision process. Journal of Mathematics and Mechanics, pp. 679–684, 1957.
- Berkenkamp et al. (2017) Berkenkamp, F., Turchetta, M., Schoellig, A., and Krause, A. Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems, pp. 908–919, 2017.
- Bertsekas (1975) Bertsekas, D. Convergence of discretization procedures in dynamic programming. IEEE Transactions on Automatic Control, 20(3):415–419, 1975.
- Bertsekas & Tsitsiklis (1995) Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic programming: an overview. In Decision and Control, 1995, Proceedings of the 34th IEEE Conference on, volume 1, pp. 560–564. IEEE, 1995.
- Bubeck et al. (2011) Bubeck, S., Munos, R., Stoltz, G., and Szepesvári, C. X-armed bandits. Journal of Machine Learning Research, 12(May):1655–1695, 2011.
- Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pp. 1–38, 1977.
Farahmand et al. (2017)
Farahmand, A.-M., Barreto, A., and Nikovski, D.
Value-Aware Loss Function for Model-based Reinforcement Learning.In
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pp. 1486–1494, 2017.
- Ferns et al. (2004) Ferns, N., Panangaden, P., and Precup, D. Metrics for finite markov decision processes. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 162–169. AUAI Press, 2004.
- Fox et al. (2016) Fox, R., Pakman, A., and Tishby, N. G-learning: Taming the noise in reinforcement learning via soft updates. Uncertainty in Artifical Intelligence, 2016.
- Gao & Pavel (2017) Gao, B. and Pavel, L. On the properties of the softmax function with application in game theory and reinforcement learning. arXiv preprint arXiv:1704.00805, 2017.
- Hinderer (2005) Hinderer, K. Lipschitz continuity of value functions in Markovian decision processes. Mathematical Methods of Operations Research, 62(1):3–22, 2005.
- Jiang et al. (2015) Jiang, N., Kulesza, A., Singh, S., and Lewis, R. The dependence of effective planning horizon on model accuracy. In Proceedings of AAMAS, pp. 1181–1189, 2015.
- Kaelbling et al. (1996) Kaelbling, L. P., Littman, M. L., and Moore, A. W. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996.
- Kakade et al. (2003) Kakade, S., Kearns, M. J., and Langford, J. Exploration in metric state spaces. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 306–312, 2003.
- Kearns & Singh (2002) Kearns, M. and Singh, S. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209–232, 2002.
Kleinberg et al. (2008)
Kleinberg, R., Slivkins, A., and Upfal, E.
Multi-armed bandits in metric spaces.
Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, pp. 681–690. ACM, 2008.
- Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Littman & Szepesvári (1996) Littman, M. L. and Szepesvári, C. A generalized reinforcement-learning model: Convergence and applications. In Proceedings of the 13th International Conference on Machine Learning, pp. 310–318, 1996.
- Lorenz (1972) Lorenz, E. Predictability: does the flap of a butterfly’s wing in Brazil set off a tornado in Texas? na, 1972.
- Moerland et al. (2017) Moerland, T. M., Broekens, J., and Jonker, C. M. Learning multimodal transition dynamics for model-based reinforcement learning. arXiv preprint arXiv:1705.00470, 2017.
- Müller (1996) Müller, A. Optimal selection from distributions with unknown parameters: Robustness of bayesian models. Mathematical Methods of Operations Research, 44(3):371–386, 1996.
- Nachum et al. (2017) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. arXiv preprint arXiv:1702.08892, 2017.
Narayanan & Mitter (2010)
Narayanan, H. and Mitter, S.
Sample complexity of testing the manifold hypothesis.In Advances in Neural Information Processing Systems, pp. 1786–1794, 2010.
- Neu et al. (2017) Neu, G., Jonsson, A., and Gómez, V. A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798, 2017.
- Neyshabur et al. (2015) Neyshabur, B., Tomioka, R., and Srebro, N. Norm-based capacity control in neural networks. In Proceedings of The 28th Conference on Learning Theory, pp. 1376–1401, 2015.
Parr et al. (2008)
Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., and Littman, M. L.
An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning.In Proceedings of the 25th international conference on Machine learning, pp. 752–759. ACM, 2008.
- Pazis & Parr (2013) Pazis, J. and Parr, R. Pac optimal exploration in continuous space markov decision processes. In AAAI, 2013.
- Pires & Szepesvári (2016) Pires, B. Á. and Szepesvári, C. Policy error bounds for model-based reinforcement learning with factored linear models. In Conference on Learning Theory, pp. 121–151, 2016.
- Pirotta et al. (2015) Pirotta, M., Restelli, M., and Bascetta, L. Policy gradient in lipschitz Markov decision processes. Machine Learning, 100(2-3):255–283, 2015.
- Rachelson & Lagoudakis (2010) Rachelson, E. and Lagoudakis, M. G. On the locality of action domination in sequential decision making. In International Symposium on Artificial Intelligence and Mathematics, ISAIM 2010, Fort Lauderdale, Florida, USA, January 6-8, 2010, 2010.
- Russell & Norvig (1995) Russell, S. J. and Norvig, P. Artificial intelligence: A modern approach, 1995.
- Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In ICML, 2014.
- Strehl et al. (2009) Strehl, A. L., Li, L., and Littman, M. L. Reinforcement learning in finite mdps: Pac analysis. Journal of Machine Learning Research, 10(Nov):2413–2444, 2009.
- Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. The MIT Press, 1998.
- Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pp. 1057–1063, 2000.
- Sutton et al. (2008) Sutton, R. S., Szepesvári, C., Geramifard, A., and Bowling, M. H. Dyna-style planning with linear function approximation and prioritized sweeping. In UAI 2008, Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence, Helsinki, Finland, July 9-12, 2008, pp. 528–536, 2008.
- Szepesvári (2010) Szepesvári, C. Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 4(1):1–103, 2010.
- Talvitie (2014) Talvitie, E. Model regularization for stable sample rollouts. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pp. 780–789. AUAI Press, 2014.
- Talvitie (2017) Talvitie, E. Self-correcting models for model-based reinforcement learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., 2017.
- Tibshirani (1996) Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
- Vaserstein (1969) Vaserstein, L. N. Markov processes over denumerable products of spaces, describing large systems of automata. Problemy Peredachi Informatsii, 5(3):64–72, 1969.
- Venkatraman et al. (2015) Venkatraman, A., Hebert, M., and Bagnell, J. A. Improving multi-step prediction of learned time series models. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA., 2015.
- Villani (2008) Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
In a finite MDP, transition probabilities can be expressed using a finite set of deterministic functions and a distribution over the functions.
Let denote the probability of a transiton from to when executing the action . Define an ordering over states with an additional unreachable state . Now define the cumulative probability distribution:
Further define as the set of distinct entries in :
Note that, since the MDP is assumed to be finite, then is finite. We sort the values of and denote, by , th smallest value of the set. Note that and . We now build determinstic set of functions as follows: and , define if and only if:
We also define the probability distribution over as follows:
Given the functions and the distribution , we can now compute the probability of a transition to from after executing action :