Function approximation for Markov decision processes (MDPs) is an important problem in reinforcement learning. Simply extending classical representations from supervised learning is not straightforward because of the specific non-linear structure of MDPs; for example the linear parametrization of the value function is not totally adapted to the algebraic structure of the Bellman operator which is central in their analysis and involves “max” operations.
Following [24, 1] which applied similar concepts to problems in optimal control, we consider a different semi-ring than the usual ring , namely the max-plus semi-ring . The new resulting algebra is natural for MDPs, as for example for deterministic discounted MDPs, the Bellman operator happens to be additive and positively homogeneous for the max-plus algebra.
In this paper, we explore classical concepts in linear representations in machine learning and signal processing, namely approximations from a finite basis and sparse approximations through greedy algorithms such as matching pursuit[23, 22, 32, 12]
, explore them for the max-plus algebra and apply it to deterministic MDPs with known dynamics (where the goal is to estimate the optimal value function). We make the following contributions, after briefly reviewing MDPs in Section2:
In Section 3, we apply max-plus algebra tools to approximate the value iteration algorithm by a smaller-dimensional iteration based on a representation on dictionaries of value functions.
As shown in Section 4, the set-up naturally leads to novel theoretical results which are simply formulated due to the max-plus algebra structure. For example, when considering a fixed (non adaptive) finite basis, the computational complexity of approximating the optimal value function is not directly related to the number of states, but to notions of covering numbers.
In Section 5, in order to circumvent the curse of dimensionality in factored state-spaces, we consider adaptive basis that can adapt to particular problems leading to an algorithm similar to matching pursuit. It currently comes with no theoretical guarantees but works empirically well in Section 6 on simple deterministic MDPs derived from low-dimensional control problems.
2 Markov Decision Processes
, transition probabilities, and a discount factor . In this paper, we focus on the goal of finding (or approximating) the optimal value function , from which the optimal policy , that leads to the optimal sum of discounted rewards, can be obtained as . We assume that the transition probabilities and the reward function are known.
We denote by the Bellman operator from to defined as, for a function ,
Value iteration algorithm.
The usual value iteration algorithm considers the recursion , which converges exponentially fast to if . More precisely, if is defined as , and we initialize at , we reach precision after at most iterations . In this paper, we consider discount factors which are close to , and the term dependent on or will always be the leading one—this thus excludes from consideration sampling-based algorithms with a better complexity in terms of and but worse dependence on the discount factor (see, e.g., [30, 17] and references therein). Throughout this paper, we are going to refer to as the horizon), which characterizes the expected number of steps in the future that need to be taken into account for computing rewards.
In this paper, we consider deterministic MDPs, i.e., MDPs for which given a state and an action , a deterministic state is reached. For these MDPs, choosing an action is equivalent to choosing the reachable state . Thus, the transition behavior if fully characterized by an edge set , and we obtain the resulting reward function , where is the maximal reward from actions leading from state to state , which is defined to be if . The Bellman operator then takes the form
We also consider large state spaces (typically coming from the discretization of control problems). A classical example will be factored state-spaces where , but we do not assume in general factorized dependences of the reward function on certain variables like in factored MDPs .
3 Max-plus Algebra applied to Deterministic MDPs
Many works consider a regular linear parameterization of the value function (see, e.g.,  and references therein), as where is a set of basis functions . Following , we consider max-plus linear combinations. For more general properties of max-plus algebra, see, e.g., [2, 8, 10].
3.1 Max-plus-linear operators and max-plus-linear combinations
In this section, we consider algebraic properties of our problem within the max-plus-algebra. For deterministic MDPs, the key property is that the Bellman operator is additive and max-plus-homogeneous, that is, (a) for any constant , which can be rewritten as (where the equality for , is an extension of the relationship ), and (b) , which can be rewritten as , where all operations are taken element-wise for all .
We now explore various max-plus approximation properties . First, regular linear combinations become: . We introduce the notation
so that the equation above may be rewritten as . Inverses of max-plus-linear operator are typically not defined, but a weaker notion of pseudo-inverse (often called “residuation” ) can be defined due to the idempotence of . We thus consider the operator defined as
We have, for the pointwise partial order on , , that is:
One approximation algorithm would be to replace by , that is, if we consider of the form , then would be of the form where
which requires to solve at each iteration an infimum problem over , which is computionally expensive as , which is typically worse than for classical value iteration. We are thus looking for an extra approximation that will lead to a decomposition where after some compilation, the iteration complexity is only dependent on the number of basis functions.
3.2 Max-plus transposition
An important idea from  is to first project onto a different low-dimensional different image, with a projection which is efficient. In the regular functional linear setting, this would be equivalent to imposing equality only for certain efficient measurements. We thus define, given a set of functions from to (which can be equal to or not),
The notation comes from the following definition of dot-product; we define the max-plus dot-product between functions and from to (and more generally for all functions defined on or ) as . We then have for any , , hence the transpose notation. We can also define the residuation as:
so that goes from to . The operator on functions from to is also the projection on the image of ; moreover, for all and is idempotent and non-expansive for the -norm. Note that so that properties of can be inferred from the ones of . Similarly and, .
3.3 Reduced value iteration
Extending  to MDPs, we consider of , the iteration If is represented as , then is represented as where , which we can decompose as and .
The key point is then that the two operators and , can be pre-computed at a cost that will be independent of the discount factor . Indeed, given and , we have:
Therefore, the computational complexity of the iteration is , once the values , and have been computed. This requires some compilation time which is independent of and the final required precision. Note that if these values are computed up to some precision , then we get an overall extra approximation factor of . The iterations then become
As seen below, they correspond to -contractant operators and are thus converging exponentially fast. In the worst case (full graph), the complexity is . Moreover, as presented below, a good approximation of by and leads to an approximation guarantee (see proof in App. A.2).
(a) The operator is -contractive and has a unique fixed point . (b) If and , then .
Therefore, an approximation guarantee for translates to an approximation error multiplied by the horizon . Thus large horizons (i.e., large ) will degrade performance (see examples in Figure 2). If we consider a discount factor of (corresponding to the operator instead of , for , see below), in the result above, is replaced by (see proof in App. A.2).
After steps of the approximate algorithms in Eq. (5) and Eq. (6), starting from zero, we get such that , and thus such that , with the overall bound if the assumption (b) in Prop. 1 is satisfied. Thus to get an error of for a fixed , it is sufficient to have an approximation error and a number of iterations . Thus, the overall complexity will be proportional to in addition to the compilation time required to compute the values and (which is independent of ).
We can apply the reasoning above to and replace by , with then fewer iterations in the leading terms, but a more expensive compilation time. The horizon is then equal to which is equivalent to for . Moreover, when , the approximation error is reduced and we only need to have . This will also be helpul within matching pursuit (see Section 5).
4 Approximation by Max-plus Operators
In this section, we consider several classes of functions and , with their associated approximation properties, that are needed for Prop. 1 to provide interesting guarantees. Some of these sets are already present in  (distance and squared distance functions), whereas others are new (piecewise constant approximations and Bregman divergences). We present examples of such approximations in Figure 1 and Figure 2, where we note a difference between the approximation of some function by or and their approximation with the same basis functions within an MDP, as obtained from Prop. 1 (with very large, it would be equivalent, but not for small ).
4.1 Indicator functions and piecewise constant approximations
We consider functions of the form if and otherwise, where is a subset of , with forming a partition of . For simplicity, we assume here that . The image of is the set of piecewise constant functions with respect to this partition. Given a function , is the lower approximation of by a piecewise constant function, while is the upper approximation of by a piecewise constant function (see Figure 1).
Since the image of and are the same, the iteration reduces to . Thus, representing as with , we get
where and is the set of such that is finite. This is is exactly a deterministic MDPs on the clusters. The complexity of the algorithm depends on some compilation time, in order to compute the reduced matrix (or the one corresponding to ), and some running-time per iteration. These depends whether we consider a dense graph or a sparse -dimensional graphs (corresponding to a -dimensional grid). The complexities are presented below (with all constants dependent on removed), and proved in App. C.
|Compilation time||Iteration time|
|Value iteration (dense)|
|Value iteration (sparse)|
|Reduced MDP (dense)|
|Reduced MDP (sparse)|
For all cases above, the optimization error (to approximate in Prop. 1) is ; thus, in the sparse case, the number of iterations to achieve precision is proportional to . Below, for simplicity, we are only considering trade-offs for the sparse graph cases. For plain value iteration, having larger than one could be beneficial only when (otherwise, the compilation time scales as and the iteration running time as , which are both increasing in ). When using the clustered representation, it seems that the situation is the same, but as shown later the approximation error comes into play and larger values of could be useful (both for running time with fixed dictionaries and for stability of matching pursuit).
If we denote by the smallest radius of a cover of by balls of a given radius, by considering the Voronoi partition associated with the ball centers, we get:
where is the -th order Hölder continuity constant of (i.e., so that for all , ).
While for factored spaces and with no assumptions on the optimal value function beyond continuity, may not scale well, the approximation could be much better in special cases (e.g., when depends only on a subset of variables). For factored spaces , with a -metric, then . If each factor is simple (e.g., a chain graph), then , and we get . Here, we do not escape the curse of dimensionality.
Going beyond exponential complexity.
In order to avoid the rate above, we can consider further assumptions: if is well approximated by a piecewise constant function with respect to a specific partition (dedicated to ), the bound in Eq. (8) can be greatly reduced. Note that the approximation with a small number of basis functions here is the same for and when (this will not be the case in Section 4.2 below).
We can get a reduced set of basis functions, if for example depends only on variables, then there is a partition for which the error in Eq. (8) is of the order , and we can then escape the curse of dimensionality if is small and we can find these relevant variables. This will be done algorithmically in Section 5, by considering a greedy algorithm in with sets and a split according to a single dimension at every iteration. This variable selection does not need to be global and the covering number can benefit from local independences.
Optimal choices of hyperparameters.
The approximation error from Prop. 2 is of order , and thus, for , in order to achieve a final approximation error of , we need to be of the order of . Without compilation time this leads to a complexity proportional to ; thus, it is advantageous to have large values of when (full dependence). This has to be mitigated by the compilation time that is less than , which grows with and . We will see in Section 6 that larger values are also better for a good selection of dictionary elements in matching pursuit.
4.2 Distance functions and piecewise-affine functions
Following , we consider functions of the form for , and a distance on . Thus may be identified to a subset of . When is a subset of and with or metrics, we get piecewise affine functions (see Figure 1). We then have (see proof in App. B.2):
(a) If and (Lipschitz-constant of ), then .
(b) If , then .
We thus get an approximation guarantee for Prop. 1 from a good covering of . The approximation also applies to . In terms of approximation guarantees, then we need to cover with sufficiently many elements of , and thus with points in we get an approximation of exactly the same order than for clustered partitions (but with the need to know the Lipschitz constant to set the extra parameter).
In terms of approximation by a finite basis, a problem here is that being well approximated by some , for small, does not mean that one can be well approximated by , with small (it is in one dimension, not in higher dimensions). Moreover, these functions are not local so computing and could be harder. Indeed, the compile time is while each iteration of the reduced algorithm is .
Distance functions for variable selection.
To allow variable selection, we can consider a more general family of distance functions, namely, function of the form or where for factored spaces. We consider the second option in our experiments in Section 6.
As outlined by , in the Euclidean continuous case, we can consider square distance funtions, which we generalize to Bregman divergences in Appendix B.3. Note that these will approximate smooth functions with more favorable approximation guarantees (but note that in MDPs obtained from the discretization of a continuous control problem, optimal value functions are non-smooth in general ). Finally, other functions could be considered as well, such as linear functions restricted to a subset, or ridge functions of the form .
If we select a random set of points from a Poisson process with fixed intensity, then the maximal diameter of the associated random Voronoi partition has a known scaling , similar to the covering number up to logarithmic terms. Thus, we can use distance functions from Section 4.2 (where we can sample the constant
from an exponential distribution) or clusters from Section4.1, or also squared distances like in Section B.3.
5 Greedy Selection by Matching Pursuit
The sets of functions proposed in Section 4 still suffer from the curse of dimensionality, that is, the cardinalities and should still scale (at most linearly) with , and thus exponentially in dimension if
is a factored state-space. We consider here greedily selecting new functions, to use dictionaries adapted to a given MDP, mimicking the similar approach of sparse decompositions in signal processing and unsupervised learning[23, 22].
We assume that we are given and , and we want to test new sets and , which are close to and (typically one function split in two, that is, or one additional function ). Note that pruning  could also be considered as well.
We assume that we have the optimal and (after convergence of the iterations defined in Eq. (5) and Eq. (6)), such that and . We denote by and . The criterion will be based on considering the best improvement of the new projections and to the relevant function.
Criterion for .
Given the current difference between and its projection , the criterion to minimize is , for a given norm . If , we have , and our criterion thus becomes for the -norm.
Criterion for .
Given the current difference between and its projection , the criterion to minimize is . If , we have , and and our criterion becomes .
Like in regular matching pursuit for classical linear approximation, allowing full flexibility for or would lead to trivial choices, here and : in our MDP situation, we essentially end up with few changes compared to plain value iteration as the new atoms are just obtained by using our own approximation or applying the transition operator to another approximation (i.e., ). Thus, within matching pursuit, in order to benefit from a reduction in global approximation error, ensuring the diversity of the dictionary of atoms which we select from is crucial. To go beyond selection from a finite set (and learn a few parameters), we propose a convex relaxation of estimating or in App. D.
Special case of partitions.
In this case , we have , and thus we only need to consider above. A simplification there is to consider the current set which contains attaining the bound above, and only try to split this cluster.
Variable selection within an MDP could be seen as a special case of factored MDPs  where the reward function depends on a subset of variables, it would be interesting to study theoretically more complex dependences. Beyond factored MDPs, variable selection and more generally function approximation within MDPs has been considered in several works (see, e.g., [5, 27, 18, 21, 7, 20]), but do not provide the theoretical analysis that we provide here or consider linear function approximations, while we consider max-plus combinations;  considers variable resolution within the discretization of an optimal control problem, but not for a generic Markov decision process.
All experiments can be exactly reproduced with the Matlab code which can be obtained from the author’s webpage.111www.di.ens.fr/~fbach/maxplus.zip
Simulations on discretizations of control problems on .
We consider the discretization of , which we can identify to . Given the set of actions , following , we consider the dynamical systems (going left or right). The goal of the control problem is to maximize where is the exit time of from (i.e., reaching the boundary). We then define the value function as the supremum over controls starting from .
Then satisfies the Hamilton-Jacobi-Bellman (HJB) equation [11, 25]: , with the boundary condition for . With the discretization above, with a step , we have the absorbing states and . We thus need to construct the reward from to , and from to , for any equal to the reweighted function for the reached state, and a discount factor equal to . From states and , no move can be made and the reward is equal to and . When goes to zero, then the MDP solution converges to the solution of the optimal control problem.
In our example in Figure 1 and Figure 2, we consider pairs that satisfy the HJB equation. In Figure 3 (top), for the same problem, we consider the performance of our greedy method (matching pursuit with an -norm criterion) or fixed basis for piecewise constant functions from Section 4.1 and piecewise affine functions from Section 4.2, when the horizon varies (from values of ) and the number of basis functions varies. We can see (a) the benefits of piecewise affine over piecewise constant functions in the sets and (that is, better approximation properties with the same number of basis functions), (b) the beneficial effect of larger (i.e., using instead of ), in particular for greedy techniques where the selection of good atoms does require a larger .
Simulations on discretizations of control problems on .
We consider two-dimensional extensions where the optimal value function only depends on a single variable (see details and more experiments in Appendix E.2, with a full dependence on the two variables) and we show performance plots in Figure 3 (bottom). Because of the sparsity assumption, the benefits of matching pursuit are greater than for the one-dimensional case (empirically, only relevant atoms are selected, and thus converges to zero faster as the number of basis functions increases).
In this paper, we have presented a max-plus framework for value function approximation with a greedy matching pursuit algorithm to select atoms from a large dictionary. While our current framework and experiments deal with low-dimensional deterministic MDPs, there are many avenues for further algorithmic and theoretical developments.
First, for non-deterministic MDPs, the Bellman operator is unfortunately not max-plus additive; however, there are natural extensions to general MDPs using measure-based formulations on probability measures on [16, 19], where using a policy , one goes from a measure to , with a reward going to and equal to
. The difficulty here is the increased dimensionality of the problem due a new problem defined on probability measures. Also, going beyond model-based reinforcement learning could be done by estimation of model parameters; the max-plus formalism is naturally compatible with dealing with confidence intervals. Finally, it would be worth exploring multi-resolution techniques which are common in signal processing, to deal with higher-dimensional problems, where short-term and long-term interactions could be partially decoupled.
We acknowledge support the European Research Council (grant SEQUOIA 724063).
-  Marianne Akian, Stéphane Gaubert, and Asma Lakhoua. The max-plus finite element method for solving deterministic optimal control problems: basic properties and convergence analysis. SIAM Journal on Control and Optimization, 47(2):817–848, 2008.
-  François Baccelli, Guy Cohen, Geert Jan Olsder, and Jean-Pierre Quadrat. Synchronization and Linearity: an Algebra for Discrete Event Systems. John Wiley & Sons, 1992.
-  Heinz H. Bauschke, Jérôme Bolte, and Marc Teboulle. A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Mathematics of Operations Research, 42(2):330–348, 2016.
-  Dimitri P. Bertsekas. Weighted sup-norm contractions in dynamic programming: A review and some new applications. Technical Report LIDS-P-2884, Dept. Elect. Eng. Comput. Sci., Massachusetts Institute of Technology, 2012.
-  Dimitri P. Bertsekas and David A. Castanon. Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Transactions on Automatic Control, 34(6):589–598, 1989.
-  Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
-  Lucian Busoniu, Damien Ernst, Bart De Schutter, and Robert Babuska. Cross-entropy optimization of control policies with adaptive basis functions. IEEE Transactions on Systems, Man, and Cybernetics, 41(1):196–209, 2011.
-  Peter Butkovič. Max-linear Systems: Theory and Algorithms. Springer Science & Business Media, 2010.
-  Pierre Calka and Nicolas Chenavier. Extreme values for characteristic radii of a Poisson-Voronoi tessellation. Extremes, 17(3):359–385, 2014.
-  Guy Cohen, Stéphane Gaubert, and Jean-Pierre Quadrat. Duality and separation theorems in idempotent semimodules. Linear Algebra and its Applications, 379(1):395–422, 2004.
-  Michael G. Crandall and Pierre-Louis Lions. Viscosity solutions of Hamilton-Jacobi equations. Transactions of the American Mathematical Society, 277(1):1–42, 1983.
-  Bogdan Dumitrescu and Paul Irofti. Dictionary Learning Algorithms and Applications. Springer, 2018.
-  Stéphane Gaubert, William McEneaney, and Zheng Qu. Curse of dimensionality reduction in max-plus based approximation methods: Theoretical estimates and improved pruning algorithms. In Conference on Decision and Control and European Control Conference, pages 1054–1061, 2011.
-  Matthieu Geist and Olivier Pietquin. A brief survey of parametric value function approximation. Technical report, Supélec, 2010.
Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman.
Efficient solution algorithms for factored MDPs.
Journal of Artificial Intelligence Research, 19:399–468, 2003.
-  Onésimo Hernández-Lerma and Jean-Bernard Lasserre. Discrete-Time Markov Control Processes: Basic Optimality Criteria, volume 30. Springer Science & Business Media, 2012.
-  Sham Kakade, Mengdi Wang, and Lin F. Yang. Variance reduction methods for sublinear reinforcement learning. Technical Report 1802.09184, arXiv, 2018.
-  Philipp W. Keller, Shie Mannor, and Doina Precup. Automatic basis function construction for approximate dynamic programming and reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 2006.
-  Jean-Bernard Lasserre, Didier Henrion, Christophe Prieur, and Emmanuel Trélat. Nonlinear optimal control via occupation measures and lmi-relaxations. SIAM Journal on Control and Optimization, 47(4):1643–1666, 2008.
-  De-Rong Liu, Hong-Liang Li, and Ding Wang. Feature selection and feature learning for high-dimensional batch reinforcement learning: A survey. International Journal of Automation and Computing, 12(3):229–242, 2015.
-  Sridhar Mahadevan and Mauro Maggioni. Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes. Journal of Machine Learning Research, 8(Oct):2169–2231, 2007.
-  Julien Mairal, Francis Bach, and Jean Ponce. Sparse modeling for image and vision processing. Foundations and Trends® in Computer Graphics and Vision, 8(2-3):85–283, 2014.
-  Stéphane Mallat. A Wavelet Tour of Signal Processing: the Sparse Way. Academic Press, 2008.
-  William M. McEneaney. Error analysis of a max-plus algorithm for a first-order HJB equation. In Stochastic Theory and Control, pages 335–351. Springer, 2002.
-  Rémi Munos. A study of reinforcement learning in the continuous case by the means of viscosity solutions. Machine Learning, 40(3):265–299, 2000.
-  Rémi Munos and Andrew Moore. Variable resolution discretization in optimal control. Machine Learning, 49(2-3):291–323, 2002.
-  Relu Patrascu, Pascal Poupart, Dale Schuurmans, Craig Boutilier, and Carlos Guestrin. Greedy linear value-approximation for factored Markov decision processes. In Proc. AAAI, 2002.
-  Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
-  Joan Serra-Sagristà. Enumeration of lattice points in -norm. Information Processing Letters, 76(1-2):39–44, 2000.
-  Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value iteration and faster algorithms for solving Markov decision processes. In Proceedings of the Symposium on Discrete Algorithms (SODA), 2018.
-  Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2018.
-  Vladimir Temlyakov. Sparse Approximation with Bases. Springer, 2015.
Appendix A Value iteration and reduced value-iteration convergence
In this appendix, we provide short proofs for the lemmas and propositions of the main paper related to (reduced) value iteration.
a.1 Approximation of the value function through approximate fixed points
This is taken from [4, Prop. 2.1] and presented because the proof structure is used later.
Lemma 1 ()
If , then .
. We have
, because is -contractive. This leads to
, thus by letting tend to infinity.
a.2 Proof Prop. 1
(a) This is consequence of the non-expansiveness of and , and the -contractiveness of .
(b) We have: . Using the non-expansivity of , we get
This imples implies that using the same reasoning as in the proof of Lemma 1.
The result extends to assumptions on instead of (and similarly for ). Indeed, we have, with ,
by non-expansivity of . By taking the infimum over , we get that