This paper is concerned with the estimation, via simulation, of value functions in the context of expected infinite horizon discounted rewards for Markov chains. Estimating such value functions plays an important role in approximate dynamic programming and applied probability in general. In many problems of practical interest, the state space is huge or even continuous and the value function is computationally intractable. Therefore, we need to approximate the value function. In this work, we develop a fully non-parametric method to estimate the value function by incorporating shape constraints, such as knowledge of convexity, monotonicity, or Lipschitz constants.
The most common method employed to approximate the value function is parametric approximate dynamic programming; see Powell (2011) and Bertsekas (2007). In this method, the user specifies an “approximation architecture” (i.e. a set of basis functions) and the algorithm then produces an approximation in the span of this basis. Selecting the basis function is essential because an inappropriate “approximation architecture” might cause unsatisfactory results, and cannot be improved by additional sampling or computational effort.
In contrast, we are proposing a fully “non-parametric” method to avoid the difficulty of choosing a correct approximation architecture. The general idea is to take advantage of shape properties of the optimal value function in estimating the function. A variety of control problems exist on continuous state spaces for which convexity in the value function naturally arises. For instance, in a linear transition system, if the reward function in each stage is convex, then the value function is convex. Inventory models represent a well-known example of this class of problems. Singular stochastic control (Kumar and Muthuraman (2004)) and partially observed Markov processes (Smallwood and Sondik (1973)) are two other subclasses of problems for which the value function is convex. As another example, Karoui et al. (1998) show that the American-style option is convex for a generalized Black–Scholes model.
Monotonicity properties have been studied in the literature for various problems formulated as Markov decision processes. For instance, if the reward is monotone and the chain is stochastically monotone, the value function is monotone. InPapadaki and Powell (2007), the monotonicity of the value function is studied in the case of the multi-product batch dispatch problem. Stokey (1989, p. 267-268) presents general conditions that guarantee the value function will be monotonic in the underlying state variable. Discussions of monotonicity appear also in Serfozo (1976) and Topkis (1998).
In Smith and McCardle (2002) and Atakan (2003), sufficient conditions are provided on the transition probability of a stochastic dynamic programming problem to ensure the shape properties of the value function. The goal of our work is to exploit this type of shape property to estimate the value function.
We suggest two methods for computing an approximation of the value function for a fixed policy. In the first method, we estimate the value function along a path by explicitly incorporating the shape constraint. For instance, in the case that we know the value function is convex, we consider the set of all convex functions which is a convex cone in the space of measurable functions. Having a sample path of the underlining process, one can reach a noisy observation of the value function. By projecting this noisy observation to the cone of convex functions, we achieve an estimator for the value function. Since this method requires only one sample path of the process, it can be used in reinforcement learning applications.
The second method is based on estimating the value function by taking advantage of the fixed point property of the value function in addition to the shape constraint. The value function satisfies a specific linear system of equations. Therefore, estimating the value function is possible by approximating the fixed point of this system of equations over the cone of convex functions. This fixed point can be obtained by iteratively projecting onto the cone of convex functions. The simulation results show that the second approach has reduced variance and provides more accurate estimators as compared to the first approach.
The projection onto the cone of convex functions is possible by solving a least square optimization problem. This optimization problem can be interpreted as a multi-dimensional convex regression. Convex regression is concerned with computing the best fit of a convex function to a dataset of observations;
for . Convex regression derives a convex estimator of by solving a least square problem. In one dimension, the theory of convex regression is well established; see Hanson and Pledger (1976) for the consistency result and Mammen (1991) and Groeneboom et al. (2001) for the rate of convergence. The consistency of convex regression has been shown in Lim and Glynn (2012), and in Seijo and Sen (2011) in the multi-dimensional case where the observations are independent.
To show the consistency of the estimator in our method, we extend the results in convex regression literature to the Markov processes. Let be a positive Harris recurrent and the noise sequence be a correlated sequence satisfying suitable technical assumptions. We show that the estimator is converging to the projection of onto the cone of convex functions in the Hilbert space of measurable functions. Lim and Glynn (2012) studied the behavior of the estimator when the model is mis-specifed so that the function is non-convex under the much stronger assumption that the function is bounded. Our result relaxes this assumption.
Recently, Hannah and Dunson (2011, 2013) employed the notion of fitting convex functions in solving dynamic programming problems. The key differences between our work and Hannah and Dunson (2013, 2011) are as follows:
Our method is fully non-parametric while their approach is semi-parametric and required adjusting several parameters before fitting a convex function or determining the prior distribution for Bayesian updating.
It is well known that value iteration type algorithms often lead to errors that grow exponentially in the problem horizon. Small local changes at each iteration can lead to a large global error of the approximation; see Section IV of Tsitsiklis and Van Roy (2001) and Ma and Powell (2009). In contrast, in our method the projection to the convex set occurs asymptotically with respect to the stationary distribution of the underlying Markov chain and has a convergence guarantee.
The literature on approximate dynamic programming (ADP) is also related to our work. Some recent works in this area suggest that the performance of parametric ADP algorithms is improved by exploiting structural properties; see Wang and Judd (2000); Cai and Judd (2010, 2012a, 2012b, 2012c); Cai et al. (2013). In addition, Godfrey and Powell (2001) and Powell et al. (2004) consider the cases where the value functions are known to be convex and approximate the value function by separable, piecewise linear functions of one variable. In Kunnumkal and Topaloglu (2010), the monotonicity of value functions are used to approximate the value function where the state space is finite.
In greater detail, we make the following contributions:
We rigorously develop a fully non-parametric method to estimate shape constrained value functions of multi-dimensional continuous state space M.C. In the case that the value function is convex, the estimator can be represented as a piecewise linear function and evaluated at each point in linear time.
We extend the convex regression to the case in which explanatory variables are sampled along a Markov chain path. Moreover, the observations are correlated and generated along the same path.
We identify the behavior of the estimator in the case of mis-specification, where the value function is not-convex.
We show the convergence of the estimator to the solution of the projected Bellman equation as the length of the sample path goes to infinity,
We extend the non-parametric method to estimate the value functions which are Lipschitz or monotone and convex.
The rest of this section is organized as follows: In Section 2, we precisely introduce the mathematical framework for our analysis. In section 3, we describe our methods. Section 4 presents the extension of multi-dimensional convex regression to the Markov processes and shows the consistency of convex regression in this general framework. In Section 5, we use the results of Section 4 to prove the convergence of our methods. In Section 6, we extend our methods to estimate the value functions by exploiting other shape structures. In Section 7, we study the efficacy of our methods by applying them to a pricing problem in energy market.
Let be a discrete time Markov chain evolving on a general continuous state space embedded on
. Each random variableis measurable with respect to the Borel -algebra associated with . The transition probability of the Markov chain represents the time-homogeneous probability that the next state will be given that the current state is . Let be the reward function received at time , and be a discounting factor with .
The value function, which is the expected infinite horizon discounted reward for the Markov chain, is given by
According to the Markov property, we have
Define the operator by
where is the space of measurable functions over . The operator can be considered as the Bellman operator for a fixed policy. It is well known that is a contraction with respect to the sup norm.
for every . Furthermore, the value function is the unique fixed point of equation ; see Bertsekas (2007, p.408).
Let be a probability measure on Define
Suppose that is the set of all convex functions over which are measurable with respect to . Note that is a closed convex cone over the space of functions ; see Lim and Glynn (2012). The projection operator onto the cone with respect to the measure , represented by , is defined as
The projection of onto the convex cone , denoted by , can be characterized by
for every .
3 Convex Value Exploration
In this section, we suggest two different methods to approximate the value function for a given fixed policy by incorporating the shape constraints. Here, we first focus on the convexity as a shape constraint. Next, we extend our methods to monotonicity and Lipschitz constraints. In the first method, we estimate the value function by explicitly incorporating the shape constraints. In the second one, we improve the estimator by incorporating the shape constraint and simultaneously taking advantage of the fact that the function satisfies a specific linear system of equations. In the subsequent sections, we discuss the convergence of these methods.
3.1 Truncated Method
Let be the underlying Markov process. Consider a single sample path of . The total discounted rewards over this sample path rise to noisy observations of the value function at these sample points. Fitting a convex function to these observations gives an estimation of the value function. Since we truncate the infinite horizon discounted reward stream to get the noisy observation at each sample point, we call this method the truncated method.
Let be a sample path of the Markov chain with length . Also, assume that is the sequence of corresponding rewards at sample points for . A noisy observation of is thus given by
for . Observe that , where is close to zero if the number of sample points is sufficiently large. We can construct an estimator of by projecting the noisy observations onto the cone of convex functions. The projection is possible by fitting a convex function to the points . Assuming is a convex function, we use the least squares estimator (LSE) to project the noisy observations onto the cone of convex functions by solving
Since is an infinite-dimensional space, this minimization may appear to be computationally intractable. However, it turns out that this minimization can be formulated as a finite-dimensional quadratic program (QP):
In Lim and Glynn (2012), it is shown that this least square problem has a minimizer , and any minimizer of (2) over satisfies . We defer more discussion of solving this optimization problem more efficiently to Chapter 5. We define our estimator as
The function is a convex and finite value function over the convex hull of the points . Furthermore, it is straightforward to show that is a piecewise linear convex function given by
In the next section, we will show that as the sample size , the estimator converges uniformly to over every compact set.
3.2 Fixed Point Projection
In this section, we improve the previous method by taking advantage of the fixed point property of the value function in addition to the shape constraint. The rationale of the method is to iteratively apply the Bellman operator and project to the cone of convex functions. This method provides an approximation of the value function as the fixed point of the operator . First, we start with the ideal case, in which we can exactly compute the expectation with respect to the stationary distribution as well as the projection to the space of convex functions. Next, we explain a numerical algorithm that approximately follows this ideal iteration procedure.
Here, we assume the value function belongs to the space of measurable functions and is convex. In the next section, we study the behavior of the estimator in the general case where the value function is not convex. The value function is the fixed point of the operator , so . Moreover, by the convexity assumption, is a fixed point of the projection operator onto the cone of convex functions, and we have . Therefore, is the fixed point of the combination of the operators and and satisfies
In the next theorem, we show the existence of such a fixed point as a result the of contraction of both operators and .
Let be the stationary distribution of the Markov chain , and . Then there exists a unique fixed point such that
Moreover, let be a sequence of functions in the convex closed cone , defined by
Then, we have
The sequence generated in (5) does not converge for an arbitrary norm . The assumption that is the stationary distribution of the underlying Markov chain is essential to guarantee the convergence of the sequence. For instance, the projection with respect to the sup-norm is not contraction (see Example B.1). For a similar discussion in the context of parametric ADP, see Tsitsiklis and Van Roy (2001).
First, we show that if is the stationary distribution of the Markov chain, then the operator is a contraction with respect to the norm . For any two functions , we have
Moreover, we know that , the projection operator onto the convex cone , is also a contraction with respect to norm; see P.26 Borwein and Lewis (2005). More precisely, if , then we have
Thus, for every . Therefore,
The rest of the theorem follows directly from the Banach fixed point theorem. ∎
In the rest of this section, we develop a computational method to approximate the fixed point over the cone by using simulated trajectories. Exact computation of is not generally viable. Evaluating at any involves the computation of the expectation This expectation is over a potentially high-dimensional or infinite-dimensional space and hence can pose a computational challenge. The following proposition provides an equivalent characterization to the operator . As a result of this proposition, it suffices to evaluate at two sample points rather than computing
Let and be two independent samples of an M.C. at time given . Moreover, define the random variable such that
For every measurable function , we have
Let be the projection of onto the cone of convex functions , which is the minimizer of
By using the independence of and given , we obtain
for every function . Therefore, we can conclude that is also the minimizer of the optimization
By using the ergodic property of the Markov chains, it is straightforward to calculate an estimator of . At each time step , we generate two independent copies and given . We call a “two copy sample path”.
Under appropriate conditions over the process , we have
Here, we discuss a potential but unsuccessful Monte Carlo approach to approximate . By the convexity assumption, the value function is the fixed point of , and therefore is the minimizer of the optimization problem
Similar to Proposition 3.3, it is possible to show that the fixed point is also the minimizer of
One might solve the optimization problem
However, it can be easily shown that this optimization problem is non-convex and unbounded for any finite sample path of length ; see Example B.2. We can solve this difficulty by employing an iterative projection procedure. Before discussing this method, we impose an additional shape constraint to bound the value function. This assumption helps to restrict the cone of convex functions and make the projection more tractable.
Let the state space be bounded. Moreover, assume that for every , the sub-gradient of is bounded by a constant :
Suppose that there exists a constant such that for every state we have
It is straightforward to show that
Therefore, if the reward function is bounded over the state space, then Assumption (3.4) holds.
Now, we present an alternative method to estimate the value function by using convexity and the fixed point property. The method is similar to the ideal procedure in Theorem 3.1
. The main difference is using the random vectorfor a piecewise linear function instead of . We first generate a two copy sample path of length . This sample path does not change throughout the procedure. We iteratively compute the random vector for a piecewise linear function . Next, we project onto the convex cone to achieve . Each convex projection is a least square finite-dimensional optimization problem. By following this procedure iteratively, an estimation of the fixed point over the cone is obtained. The details of the method are as follows:
- Generating Sample Path:
Generate a “two copy sample path” of length . At each time step , generate two independent copies and given .
- Updating Step:
for every . The sequence is a noisy observation of .
Project onto the cone of convex functions by solving the finite-dimensional convex program
Given the optimal solution to this optimization, we can construct a piecewise linear convex function. Define
The updating and projection stages for a fixed “two copy sample
path” should be continued until a desired level of accuracy is reached. We
can consider as an estimator for the value function. In
the next section, we will show that for sufficiently large sample size
and a large number of iterations , the estimator
converges uniformly to the value function over every compact set.
4 Empirical Projection Consistency
In this section we describe a generalization of the consistency result of convex regression in Lim and Glynn (2012) to the positive Harris chains. Our result includes the model mis-specification case without any extra assumption to bound the function. In the next section, we use this result to show that the estimators in truncated method and fixed point projection method converge to the value function as the sample size grows to infinity.
Let be defined on the probability space . For every -measurable random variable , we can define the projection onto the cone with respect to the norm as the solution of
Let be a sequence of random vectors in which for every . We show that if converges on average to , then the empirical projection of this sequence onto the cone of convex functions gives a consistent estimate of projecting onto this cone. For ease of exposition, we define a sequence of random vectors as strongly ergodic in the following way:
Suppose that is a positive Harris chain with stationary distribution , and be a sequence of random variables. We call this sequence “strongly ergodic” if there exists a -measurable random variable such that ,
for every function , and .
To illustrate this definition, we provide several examples.
Let the be such that
is a sequence of i.i.d noise terms with respect to such that
, and is a convex function in . Then by the
strong law of large numbers, we obtain that
. Then by the strong law of large numbers, we obtain thatis “strongly ergodic”.
In Lemma A.2, we show that if , then the two following random sequences are “strongly ergodic”:
Let be the optimizer of the convex optimization problem
for . Note that similar to (11), we can convert this optimization problem to a finite quadratic convex problem. In the following theorem, we show that is an estimator for , the projection of onto the space of convex functions.
We need some assumptions over the structure of the Markov chain.
For the Markov chain , we have:
It is positive Harris recurrent with unique stationary distribution .
for every positive radius ball which is a subset of state space .
The proof follows the same steps as the convergence proof in Lim and Glynn (2012). The main difference is the use of the ergodic property of Harris chains instead of the strong law of large number for i.i.d random variables. Moreover, we continue to allow the model mis-specification in which is not a convex function.
We first start by showing the consistency of the projection onto the compact disk for every . Then, by expanding this projection over the whole space, we conclude the theorem.
For every , define as the set of all functions such that is a convex function over the disc . Similar to Proposition 3 in Lim and Glynn (2012), we can show that is a closed subset of . Therefore, there exists a unique function which is the projection of onto .
It is clear that for almost every . In Lemma A.1, we show converges to as goes to infinity.
and for sufficiently large
We conclude the last inequality from the “strongly ergodic” property of . By the Cauchy–Schwarz inequality, the tail of the empirical inner product can be uniformly bounded for every and sufficiently large . Observe that
In the last line, we used the “strongly ergodic” assumption and the triangle inequality. Since and , the right hand side can be smaller than any for large enough . Thus,
for sufficiently large . Therefore, the terms in (14), which correspond to the samples outside the disk can be made arbitrarily small. In the next lemma, we show that converges to inside the disk.
Then for every ,
See the Appendix for the proof. This lemma ensures that there exists a sequence converging to zero such that
We can get a lower bound for (18) by the Cauchy–Schwarz inequality.
By combining these inequalities and using Lemma 4.7, we obtain
According to Lemma A.1, there exists a sufficiently large for every such that
Therefore, there exists a sufficiently large for any such that