1 Introduction
Function approximation for Markov decision processes (MDPs) is an important problem in reinforcement learning. Simply extending classical representations from supervised learning is not straightforward because of the specific nonlinear structure of MDPs; for example the linear parametrization of the value function is not totally adapted to the algebraic structure of the Bellman operator which is central in their analysis and involves “max” operations.
Following [24, 1] which applied similar concepts to problems in optimal control, we consider a different semiring than the usual ring , namely the maxplus semiring . The new resulting algebra is natural for MDPs, as for example for deterministic discounted MDPs, the Bellman operator happens to be additive and positively homogeneous for the maxplus algebra.
In this paper, we explore classical concepts in linear representations in machine learning and signal processing, namely approximations from a finite basis and sparse approximations through greedy algorithms such as matching pursuit
[23, 22, 32, 12], explore them for the maxplus algebra and apply it to deterministic MDPs with known dynamics (where the goal is to estimate the optimal value function). We make the following contributions, after briefly reviewing MDPs in Section
2:
In Section 3, we apply maxplus algebra tools to approximate the value iteration algorithm by a smallerdimensional iteration based on a representation on dictionaries of value functions.

As shown in Section 4, the setup naturally leads to novel theoretical results which are simply formulated due to the maxplus algebra structure. For example, when considering a fixed (non adaptive) finite basis, the computational complexity of approximating the optimal value function is not directly related to the number of states, but to notions of covering numbers.

In Section 5, in order to circumvent the curse of dimensionality in factored statespaces, we consider adaptive basis that can adapt to particular problems leading to an algorithm similar to matching pursuit. It currently comes with no theoretical guarantees but works empirically well in Section 6 on simple deterministic MDPs derived from lowdimensional control problems.
2 Markov Decision Processes
We consider a standard MDP [28, 31], defined by a finite state space , a finite action space , a reward function
, transition probabilities
, and a discount factor . In this paper, we focus on the goal of finding (or approximating) the optimal value function , from which the optimal policy , that leads to the optimal sum of discounted rewards, can be obtained as . We assume that the transition probabilities and the reward function are known.We denote by the Bellman operator from to defined as, for a function ,
The optimal value function is the unique fixed point of . In order to find an approximation of , we simply need to find such that is small, as (see proof in App. A.1 taken from [4, Prop. 2.1]).
Value iteration algorithm.
The usual value iteration algorithm considers the recursion , which converges exponentially fast to if [31]. More precisely, if is defined as , and we initialize at , we reach precision after at most iterations [31]. In this paper, we consider discount factors which are close to , and the term dependent on or will always be the leading one—this thus excludes from consideration samplingbased algorithms with a better complexity in terms of and but worse dependence on the discount factor (see, e.g., [30, 17] and references therein). Throughout this paper, we are going to refer to as the horizon
of the MDP (this is the expectation of a random variable with geometric distribution proportional to powers of
), which characterizes the expected number of steps in the future that need to be taken into account for computing rewards.Deterministic MDPs.
In this paper, we consider deterministic MDPs, i.e., MDPs for which given a state and an action , a deterministic state is reached. For these MDPs, choosing an action is equivalent to choosing the reachable state . Thus, the transition behavior if fully characterized by an edge set , and we obtain the resulting reward function , where is the maximal reward from actions leading from state to state , which is defined to be if . The Bellman operator then takes the form
We focus primarily on deterministic MDPs but note in Section 7 that the framework can be applied to all MDPs by considering a measurebased formulation [16, 19].
Factored statespaces.
We also consider large state spaces (typically coming from the discretization of control problems). A classical example will be factored statespaces where , but we do not assume in general factorized dependences of the reward function on certain variables like in factored MDPs [15].
3 Maxplus Algebra applied to Deterministic MDPs
Many works consider a regular linear parameterization of the value function (see, e.g., [14] and references therein), as where is a set of basis functions . Following [1], we consider maxplus linear combinations. For more general properties of maxplus algebra, see, e.g., [2, 8, 10].
3.1 Maxpluslinear operators and maxpluslinear combinations
In this section, we consider algebraic properties of our problem within the maxplusalgebra. For deterministic MDPs, the key property is that the Bellman operator is additive and maxplushomogeneous, that is, (a) for any constant , which can be rewritten as (where the equality for , is an extension of the relationship ), and (b) , which can be rewritten as , where all operations are taken elementwise for all .
We now explore various maxplus approximation properties [10]. First, regular linear combinations become: . We introduce the notation
(1) 
so that the equation above may be rewritten as . Inverses of maxpluslinear operator are typically not defined, but a weaker notion of pseudoinverse (often called “residuation” [2]) can be defined due to the idempotence of . We thus consider the operator defined as
(2) 
We have, for the pointwise partial order on , , that is:
Moreover, as shown in [10, 1], and , and thus plays a role of pseudoinverse, and a role of projection on the image of ; moreover, for all , and is idempotent and nonexpansive for the norm.
One approximation algorithm would be to replace by , that is, if we consider of the form , then would be of the form where
which requires to solve at each iteration an infimum problem over , which is computionally expensive as , which is typically worse than for classical value iteration. We are thus looking for an extra approximation that will lead to a decomposition where after some compilation, the iteration complexity is only dependent on the number of basis functions.
3.2 Maxplus transposition
An important idea from [1] is to first project onto a different lowdimensional different image, with a projection which is efficient. In the regular functional linear setting, this would be equivalent to imposing equality only for certain efficient measurements. We thus define, given a set of functions from to (which can be equal to or not),
(3) 
The notation comes from the following definition of dotproduct; we define the maxplus dotproduct between functions and from to (and more generally for all functions defined on or ) as . We then have for any , , hence the transpose notation. We can also define the residuation as:
(4) 
so that goes from to . The operator on functions from to is also the projection on the image of ; moreover, for all and is idempotent and nonexpansive for the norm. Note that so that properties of can be inferred from the ones of . Similarly and, .
3.3 Reduced value iteration
Extending [1] to MDPs, we consider of , the iteration If is represented as , then is represented as where , which we can decompose as and .
The key point is then that the two operators and , can be precomputed at a cost that will be independent of the discount factor . Indeed, given and , we have:
Therefore, the computational complexity of the iteration is , once the values , and have been computed. This requires some compilation time which is independent of and the final required precision. Note that if these values are computed up to some precision , then we get an overall extra approximation factor of . The iterations then become
(5)  
(6) 
As seen below, they correspond to contractant operators and are thus converging exponentially fast. In the worst case (full graph), the complexity is . Moreover, as presented below, a good approximation of by and leads to an approximation guarantee (see proof in App. A.2).
Proposition 1
(a) The operator is contractive and has a unique fixed point . (b) If and , then .
Therefore, an approximation guarantee for translates to an approximation error multiplied by the horizon . Thus large horizons (i.e., large ) will degrade performance (see examples in Figure 2). If we consider a discount factor of (corresponding to the operator instead of , for , see below), in the result above, is replaced by (see proof in App. A.2).
Algorithmic complexity.
After steps of the approximate algorithms in Eq. (5) and Eq. (6), starting from zero, we get such that , and thus such that , with the overall bound if the assumption (b) in Prop. 1 is satisfied. Thus to get an error of for a fixed , it is sufficient to have an approximation error and a number of iterations . Thus, the overall complexity will be proportional to in addition to the compilation time required to compute the values and (which is independent of ).
Extensions.
We can apply the reasoning above to and replace by , with then fewer iterations in the leading terms, but a more expensive compilation time. The horizon is then equal to which is equivalent to for . Moreover, when , the approximation error is reduced and we only need to have . This will also be helpul within matching pursuit (see Section 5).
4 Approximation by Maxplus Operators
In this section, we consider several classes of functions and , with their associated approximation properties, that are needed for Prop. 1 to provide interesting guarantees. Some of these sets are already present in [1] (distance and squared distance functions), whereas others are new (piecewise constant approximations and Bregman divergences). We present examples of such approximations in Figure 1 and Figure 2, where we note a difference between the approximation of some function by or and their approximation with the same basis functions within an MDP, as obtained from Prop. 1 (with very large, it would be equivalent, but not for small ).
4.1 Indicator functions and piecewise constant approximations
We consider functions of the form if and otherwise, where is a subset of , with forming a partition of . For simplicity, we assume here that . The image of is the set of piecewise constant functions with respect to this partition. Given a function , is the lower approximation of by a piecewise constant function, while is the upper approximation of by a piecewise constant function (see Figure 1).
Reduced iteration.
Since the image of and are the same, the iteration reduces to . Thus, representing as with , we get
(7) 
where and is the set of such that is finite. This is is exactly a deterministic MDPs on the clusters. The complexity of the algorithm depends on some compilation time, in order to compute the reduced matrix (or the one corresponding to ), and some runningtime per iteration. These depends whether we consider a dense graph or a sparse dimensional graphs (corresponding to a dimensional grid). The complexities are presented below (with all constants dependent on removed), and proved in App. C.
Compilation time  Iteration time  

Value iteration (dense)  
Value iteration (sparse)  
Reduced MDP (dense)  
Reduced MDP (sparse) 
For all cases above, the optimization error (to approximate in Prop. 1) is ; thus, in the sparse case, the number of iterations to achieve precision is proportional to . Below, for simplicity, we are only considering tradeoffs for the sparse graph cases. For plain value iteration, having larger than one could be beneficial only when (otherwise, the compilation time scales as and the iteration running time as , which are both increasing in ). When using the clustered representation, it seems that the situation is the same, but as shown later the approximation error comes into play and larger values of could be useful (both for running time with fixed dictionaries and for stability of matching pursuit).
Approximation properties.
See the proof of the following proposition in App. B.1, which provided approximation guarantees for Prop. 1. We assume that is composed of piecewise constant functions.
Proposition 2
If we denote by the smallest radius of a cover of by balls of a given radius, by considering the Voronoi partition associated with the ball centers, we get:
(8) 
where is the th order Hölder continuity constant of (i.e., so that for all , ).
While for factored spaces and with no assumptions on the optimal value function beyond continuity, may not scale well, the approximation could be much better in special cases (e.g., when depends only on a subset of variables). For factored spaces , with a metric, then . If each factor is simple (e.g., a chain graph), then , and we get . Here, we do not escape the curse of dimensionality.
Going beyond exponential complexity.
In order to avoid the rate above, we can consider further assumptions: if is well approximated by a piecewise constant function with respect to a specific partition (dedicated to ), the bound in Eq. (8) can be greatly reduced. Note that the approximation with a small number of basis functions here is the same for and when (this will not be the case in Section 4.2 below).
We can get a reduced set of basis functions, if for example depends only on variables, then there is a partition for which the error in Eq. (8) is of the order , and we can then escape the curse of dimensionality if is small and we can find these relevant variables. This will be done algorithmically in Section 5, by considering a greedy algorithm in with sets and a split according to a single dimension at every iteration. This variable selection does not need to be global and the covering number can benefit from local independences.
Optimal choices of hyperparameters.
The approximation error from Prop. 2 is of order , and thus, for , in order to achieve a final approximation error of , we need to be of the order of . Without compilation time this leads to a complexity proportional to ; thus, it is advantageous to have large values of when (full dependence). This has to be mitigated by the compilation time that is less than , which grows with and . We will see in Section 6 that larger values are also better for a good selection of dictionary elements in matching pursuit.
4.2 Distance functions and piecewiseaffine functions
Following [1], we consider functions of the form for , and a distance on . Thus may be identified to a subset of . When is a subset of and with or metrics, we get piecewise affine functions (see Figure 1). We then have (see proof in App. B.2):
Proposition 3
(a) If and (Lipschitzconstant of ), then .
(b) If , then .
We thus get an approximation guarantee for Prop. 1 from a good covering of . The approximation also applies to . In terms of approximation guarantees, then we need to cover with sufficiently many elements of , and thus with points in we get an approximation of exactly the same order than for clustered partitions (but with the need to know the Lipschitz constant to set the extra parameter).
In terms of approximation by a finite basis, a problem here is that being well approximated by some , for small, does not mean that one can be well approximated by , with small (it is in one dimension, not in higher dimensions). Moreover, these functions are not local so computing and could be harder. Indeed, the compile time is while each iteration of the reduced algorithm is .
Distance functions for variable selection.
To allow variable selection, we can consider a more general family of distance functions, namely, function of the form or where for factored spaces. We consider the second option in our experiments in Section 6.
4.3 Extensions
Smooth functions.
As outlined by [1], in the Euclidean continuous case, we can consider square distance funtions, which we generalize to Bregman divergences in Appendix B.3. Note that these will approximate smooth functions with more favorable approximation guarantees (but note that in MDPs obtained from the discretization of a continuous control problem, optimal value functions are nonsmooth in general [11]). Finally, other functions could be considered as well, such as linear functions restricted to a subset, or ridge functions of the form .
Random functions.
If we select a random set of points from a Poisson process with fixed intensity, then the maximal diameter of the associated random Voronoi partition has a known scaling [9], similar to the covering number up to logarithmic terms. Thus, we can use distance functions from Section 4.2 (where we can sample the constant
from an exponential distribution) or clusters from Section
4.1, or also squared distances like in Section B.3.5 Greedy Selection by Matching Pursuit
The sets of functions proposed in Section 4 still suffer from the curse of dimensionality, that is, the cardinalities and should still scale (at most linearly) with , and thus exponentially in dimension if
is a factored statespace. We consider here greedily selecting new functions, to use dictionaries adapted to a given MDP, mimicking the similar approach of sparse decompositions in signal processing and unsupervised learning
[23, 22].We assume that we are given and , and we want to test new sets and , which are close to and (typically one function split in two, that is, or one additional function ). Note that pruning [13] could also be considered as well.
We assume that we have the optimal and (after convergence of the iterations defined in Eq. (5) and Eq. (6)), such that and . We denote by and . The criterion will be based on considering the best improvement of the new projections and to the relevant function.
Criterion for .
Given the current difference between and its projection , the criterion to minimize is , for a given norm . If , we have , and our criterion thus becomes for the norm.
Criterion for .
Given the current difference between and its projection , the criterion to minimize is . If , we have , and and our criterion becomes .
Like in regular matching pursuit for classical linear approximation, allowing full flexibility for or would lead to trivial choices, here and : in our MDP situation, we essentially end up with few changes compared to plain value iteration as the new atoms are just obtained by using our own approximation or applying the transition operator to another approximation (i.e., ). Thus, within matching pursuit, in order to benefit from a reduction in global approximation error, ensuring the diversity of the dictionary of atoms which we select from is crucial. To go beyond selection from a finite set (and learn a few parameters), we propose a convex relaxation of estimating or in App. D.
Special case of partitions.
In this case , we have , and thus we only need to consider above. A simplification there is to consider the current set which contains attaining the bound above, and only try to split this cluster.
Related work.
Variable selection within an MDP could be seen as a special case of factored MDPs [15] where the reward function depends on a subset of variables, it would be interesting to study theoretically more complex dependences. Beyond factored MDPs, variable selection and more generally function approximation within MDPs has been considered in several works (see, e.g., [5, 27, 18, 21, 7, 20]), but do not provide the theoretical analysis that we provide here or consider linear function approximations, while we consider maxplus combinations; [26] considers variable resolution within the discretization of an optimal control problem, but not for a generic Markov decision process.
6 Experiments
All experiments can be exactly reproduced with the Matlab code which can be obtained from the author’s webpage.^{1}^{1}1www.di.ens.fr/~fbach/maxplus.zip
Simulations on discretizations of control problems on .
We consider the discretization of , which we can identify to . Given the set of actions , following [25], we consider the dynamical systems (going left or right). The goal of the control problem is to maximize where is the exit time of from (i.e., reaching the boundary). We then define the value function as the supremum over controls starting from .
Then satisfies the HamiltonJacobiBellman (HJB) equation [11, 25]: , with the boundary condition for . With the discretization above, with a step , we have the absorbing states and . We thus need to construct the reward from to , and from to , for any equal to the reweighted function for the reached state, and a discount factor equal to . From states and , no move can be made and the reward is equal to and . When goes to zero, then the MDP solution converges to the solution of the optimal control problem.
In our example in Figure 1 and Figure 2, we consider pairs that satisfy the HJB equation. In Figure 3 (top), for the same problem, we consider the performance of our greedy method (matching pursuit with an norm criterion) or fixed basis for piecewise constant functions from Section 4.1 and piecewise affine functions from Section 4.2, when the horizon varies (from values of ) and the number of basis functions varies. We can see (a) the benefits of piecewise affine over piecewise constant functions in the sets and (that is, better approximation properties with the same number of basis functions), (b) the beneficial effect of larger (i.e., using instead of ), in particular for greedy techniques where the selection of good atoms does require a larger .
Simulations on discretizations of control problems on .
We consider twodimensional extensions where the optimal value function only depends on a single variable (see details and more experiments in Appendix E.2, with a full dependence on the two variables) and we show performance plots in Figure 3 (bottom). Because of the sparsity assumption, the benefits of matching pursuit are greater than for the onedimensional case (empirically, only relevant atoms are selected, and thus converges to zero faster as the number of basis functions increases).
7 Conclusion
In this paper, we have presented a maxplus framework for value function approximation with a greedy matching pursuit algorithm to select atoms from a large dictionary. While our current framework and experiments deal with lowdimensional deterministic MDPs, there are many avenues for further algorithmic and theoretical developments.
First, for nondeterministic MDPs, the Bellman operator is unfortunately not maxplus additive; however, there are natural extensions to general MDPs using measurebased formulations on probability measures on [16, 19], where using a policy , one goes from a measure to , with a reward going to and equal to
. The difficulty here is the increased dimensionality of the problem due a new problem defined on probability measures. Also, going beyond modelbased reinforcement learning could be done by estimation of model parameters; the maxplus formalism is naturally compatible with dealing with confidence intervals. Finally, it would be worth exploring multiresolution techniques which are common in signal processing
[23], to deal with higherdimensional problems, where shortterm and longterm interactions could be partially decoupled.Acknowledgements
We acknowledge support the European Research Council (grant SEQUOIA 724063).
References
 [1] Marianne Akian, Stéphane Gaubert, and Asma Lakhoua. The maxplus finite element method for solving deterministic optimal control problems: basic properties and convergence analysis. SIAM Journal on Control and Optimization, 47(2):817–848, 2008.
 [2] François Baccelli, Guy Cohen, Geert Jan Olsder, and JeanPierre Quadrat. Synchronization and Linearity: an Algebra for Discrete Event Systems. John Wiley & Sons, 1992.
 [3] Heinz H. Bauschke, Jérôme Bolte, and Marc Teboulle. A descent lemma beyond Lipschitz gradient continuity: firstorder methods revisited and applications. Mathematics of Operations Research, 42(2):330–348, 2016.
 [4] Dimitri P. Bertsekas. Weighted supnorm contractions in dynamic programming: A review and some new applications. Technical Report LIDSP2884, Dept. Elect. Eng. Comput. Sci., Massachusetts Institute of Technology, 2012.
 [5] Dimitri P. Bertsekas and David A. Castanon. Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Transactions on Automatic Control, 34(6):589–598, 1989.
 [6] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
 [7] Lucian Busoniu, Damien Ernst, Bart De Schutter, and Robert Babuska. Crossentropy optimization of control policies with adaptive basis functions. IEEE Transactions on Systems, Man, and Cybernetics, 41(1):196–209, 2011.
 [8] Peter Butkovič. Maxlinear Systems: Theory and Algorithms. Springer Science & Business Media, 2010.
 [9] Pierre Calka and Nicolas Chenavier. Extreme values for characteristic radii of a PoissonVoronoi tessellation. Extremes, 17(3):359–385, 2014.
 [10] Guy Cohen, Stéphane Gaubert, and JeanPierre Quadrat. Duality and separation theorems in idempotent semimodules. Linear Algebra and its Applications, 379(1):395–422, 2004.
 [11] Michael G. Crandall and PierreLouis Lions. Viscosity solutions of HamiltonJacobi equations. Transactions of the American Mathematical Society, 277(1):1–42, 1983.
 [12] Bogdan Dumitrescu and Paul Irofti. Dictionary Learning Algorithms and Applications. Springer, 2018.
 [13] Stéphane Gaubert, William McEneaney, and Zheng Qu. Curse of dimensionality reduction in maxplus based approximation methods: Theoretical estimates and improved pruning algorithms. In Conference on Decision and Control and European Control Conference, pages 1054–1061, 2011.
 [14] Matthieu Geist and Olivier Pietquin. A brief survey of parametric value function approximation. Technical report, Supélec, 2010.

[15]
Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman.
Efficient solution algorithms for factored MDPs.
Journal of Artificial Intelligence Research
, 19:399–468, 2003.  [16] Onésimo HernándezLerma and JeanBernard Lasserre. DiscreteTime Markov Control Processes: Basic Optimality Criteria, volume 30. Springer Science & Business Media, 2012.
 [17] Sham Kakade, Mengdi Wang, and Lin F. Yang. Variance reduction methods for sublinear reinforcement learning. Technical Report 1802.09184, arXiv, 2018.
 [18] Philipp W. Keller, Shie Mannor, and Doina Precup. Automatic basis function construction for approximate dynamic programming and reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 2006.
 [19] JeanBernard Lasserre, Didier Henrion, Christophe Prieur, and Emmanuel Trélat. Nonlinear optimal control via occupation measures and lmirelaxations. SIAM Journal on Control and Optimization, 47(4):1643–1666, 2008.
 [20] DeRong Liu, HongLiang Li, and Ding Wang. Feature selection and feature learning for highdimensional batch reinforcement learning: A survey. International Journal of Automation and Computing, 12(3):229–242, 2015.
 [21] Sridhar Mahadevan and Mauro Maggioni. Protovalue functions: A Laplacian framework for learning representation and control in Markov decision processes. Journal of Machine Learning Research, 8(Oct):2169–2231, 2007.
 [22] Julien Mairal, Francis Bach, and Jean Ponce. Sparse modeling for image and vision processing. Foundations and Trends® in Computer Graphics and Vision, 8(23):85–283, 2014.
 [23] Stéphane Mallat. A Wavelet Tour of Signal Processing: the Sparse Way. Academic Press, 2008.
 [24] William M. McEneaney. Error analysis of a maxplus algorithm for a firstorder HJB equation. In Stochastic Theory and Control, pages 335–351. Springer, 2002.
 [25] Rémi Munos. A study of reinforcement learning in the continuous case by the means of viscosity solutions. Machine Learning, 40(3):265–299, 2000.
 [26] Rémi Munos and Andrew Moore. Variable resolution discretization in optimal control. Machine Learning, 49(23):291–323, 2002.
 [27] Relu Patrascu, Pascal Poupart, Dale Schuurmans, Craig Boutilier, and Carlos Guestrin. Greedy linear valueapproximation for factored Markov decision processes. In Proc. AAAI, 2002.
 [28] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
 [29] Joan SerraSagristà. Enumeration of lattice points in norm. Information Processing Letters, 76(12):39–44, 2000.
 [30] Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value iteration and faster algorithms for solving Markov decision processes. In Proceedings of the Symposium on Discrete Algorithms (SODA), 2018.
 [31] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2018.
 [32] Vladimir Temlyakov. Sparse Approximation with Bases. Springer, 2015.
Appendix A Value iteration and reduced valueiteration convergence
In this appendix, we provide short proofs for the lemmas and propositions of the main paper related to (reduced) value iteration.
a.1 Approximation of the value function through approximate fixed points
This is taken from [4, Prop. 2.1] and presented because the proof structure is used later.
Lemma 1 ([4])
If , then .
Proof
Consider
. We have
, because is contractive. This leads to
, thus by letting tend to infinity.
a.2 Proof Prop. 1
(a) This is consequence of the nonexpansiveness of and , and the contractiveness of .
(b) We have: . Using the nonexpansivity of , we get
This imples implies that using the same reasoning as in the proof of Lemma 1.
The result extends to assumptions on instead of (and similarly for ). Indeed, we have, with ,
by nonexpansivity of . By taking the infimum over , we get that
Comments
There are no comments yet.