Log In Sign Up

A Structure-aware Online Learning Algorithm for Markov Decision Processes

To overcome the curse of dimensionality and curse of modeling in Dynamic Programming (DP) methods for solving classical Markov Decision Process (MDP) problems, Reinforcement Learning (RL) algorithms are popular. In this paper, we consider an infinite-horizon average reward MDP problem and prove the optimality of the threshold policy under certain conditions. Traditional RL techniques do not exploit the threshold nature of optimal policy while learning. In this paper, we propose a new RL algorithm which utilizes the known threshold structure of the optimal policy while learning by reducing the feasible policy space. We establish that the proposed algorithm converges to the optimal policy. It provides a significant improvement in convergence speed and computational and storage complexity over traditional RL algorithms. The proposed technique can be applied to a wide variety of optimization problems that include energy efficient data transmission and management of queues. We exhibit the improvement in convergence speed of the proposed algorithm over other RL algorithms through simulations.


page 1

page 2

page 3

page 4


Online Reinforcement Learning of Optimal Threshold Policies for Markov Decision Processes

Markov Decision Process (MDP) problems can be solved using Dynamic Progr...

Mirror Learning: A Unifying Framework of Policy Optimisation

General policy improvement (GPI) and trust-region learning (TRL) are the...

Reinforcement Learning with Heterogeneous Data: Estimation and Inference

Reinforcement Learning (RL) has the promise of providing data-driven sup...

Optimizing Coordinated Vehicle Platooning: An Analytical Approach Based on Stochastic Dynamic Programming

Platooning connected and autonomous vehicles (CAVs) can improve traffic ...

An Optimization Method-Assisted Ensemble Deep Reinforcement Learning Algorithm to Solve Unit Commitment Problems

Unit commitment (UC) is a fundamental problem in the day-ahead electrici...

Renewal Monte Carlo: Renewal theory based reinforcement learning

In this paper, we present an online reinforcement learning algorithm, ca...

Near Optimal Policy Optimization via REPS

Since its introduction a decade ago, relative entropy policy search (REP...

1. Introduction

The framework of Markov Decision Process (MDP) (Puterman, 2014) is used in modeling and optimization of stochastic systems that involve decision making. An MDP is a controlled stochastic process on a state space with an associated control process of ‘actions’, where the transition from one state to the next depends only on the current state-action pair and not on the past history of the system (known as the controlled Markov property). Each state transition is associated with a reward. Our MDP problem aims to maximize the average reward and provides an optimal policy as a solution. A policy is a mapping from a state to an action describing which action is to be chosen in a state. An optimal policy maximizes the average reward.

A common approach for solving MDP problems is Dynamic Programming (DP) (Puterman, 2014). In this paper, we consider an MDP problem and prove that the optimal policy has a threshold structure using DP methods. In other words, we prove that up to a certain threshold in the state space, a specific action is preferred and thereafter another action is preferred.

Classical iterative methods for DP are computationally inefficient in the face of large state and action spaces. This is known as the curse of dimensionality. Moreover, the computation of optimal policy using DP methods requires the knowledge of state transition probability matrix which is often governed by the statistics of unknown system dynamics. For example, in a telecommunication system, transition probabilities between different states are determined by the statistics of arrival rates of users. This is known as the curse of modeling. In practice, it may be difficult to gather the knowledge regarding the statistics of the system dynamics beforehand. When we do not have any prior knowledge of the statistics of the system dynamics, a popular approach is Reinforcement Learning (RL) techniques which learn the optimal policy iteratively by trial and error

(Sutton and Barto, 1998). Examples of RL techniques include TD() (Sutton and Barto, 1998), Q-learning (Watkins and Dayan, 1992), actor-critic (Borkar, 2005), policy gradient (Sutton et al., 2000) and Post-Decision State (PDS) learning (Powell, 2007; Salodkar et al., 2008). Consider, e.g., Q-learning and PDS learning. Q-learning (Watkins and Dayan, 1992) is one of most popular learning algorithms. Q-learning iteratively computes the Q-function associated with every state-action pair using a combination of exploration and exploitation. Since Q-learning needs to learn the optimal policy for all state-action pairs, the storage complexity of the scheme is of the order of the cardinality of the state space times the cardinality of the action space. In many cases of practical interest, the state and action spaces are large which renders Q-learning impractical. Furthermore, due to the presence of exploration, the convergence rate of Q-learning is generally slow. The idea of PDS (Powell, 2007; Salodkar et al., 2008) learning obtained by reformulating the Relative Value Iteration Algorithm (RVIA) (Puterman, 2014) equation is adopted in literature for various problems. The main advantage of PDS learning is that it circumvents the action exploration, thereby improving the convergence rate. Also, there is no need of storing the Q functions of state-action pairs. Instead, it requires only storing the value functions associated with the states. Therefore the storage complexity of the PDS learning scheme is lower than that of Q-learning.

A common drawback of the learning schemes described above is that they do not exploit any known properties related to the structure of the optimal policy. In other words, while learning the optimal policy, the schemes search the optimal policy from the set of all possible policies. However, depending on the structure of the optimal policy, the size of feasible action set in various states can be reduced. Moreover, depending on the optimal policy, some of the states may not be visited at all. If we incorporate such knowledge in the learning process, intuitively, faster convergence can be achieved due to reductions in the state and action spaces or the range of possible policies. Furthermore, this may result in reductions in storage and computational complexity as well.

In this paper, we propose a Structure-Aware Learning (SAL) algorithm which exploits the threshold nature of optimal policy and searches the optimal policy only from the set of threshold policies. To be precise, instead of learning the optimal policy for the entire state space, it only learns the threshold in the state space where the optimal action changes. Based on the gradient of the average reward of the system, the threshold is updated on a slower timescale than that of the value function iterates. As a result, the convergence time of the proposed algorithm reduces along with a reduction in computational complexity and storage complexity in comparison to traditional schemes such as Q-learning and PDS learning. We prove that the proposed scheme indeed converges to the optimal policy. In general, the proposed technique is applicable to a large variety of optimization problems where the optimal policy is threshold in nature, e.g., (Agarwal et al., 2008; Sinha and Chaporkar, 2012; Koole, 1998; Brouns and Van Der Wal, 2006; Ngo and Krishnamurthy, 2009). Simulation results are presented where the proposed technique is employed on a well-known problem from queuing theory (Koole, 1998) to demonstrate that the proposed algorithm indeed offers faster convergence than traditional algorithms.

There are a few works in the literature (Kunnumkal and Topaloglu, 2008; Fu and van der Schaar, 2012; Ngo and Krishnamurthy, 2010) which exploit the structural properties in the learning framework. In (Fu and van der Schaar, 2012), an online learning algorithm which approximates the value functions using piecewise linear functions is proposed. However, there is an associated trade-off between complexity and approximation accuracy in this scheme. In (Kunnumkal and Topaloglu, 2008), authors propose a variant of Q-learning where the value function iterates are projected in such a manner that they preserve the monotonicity in system state. Similar model is adopted in (Ngo and Krishnamurthy, 2010). Although there is an improvement in convergence rate over conventional Q-learning, not much gain in computational complexity is achieved. Unlike us, none of these works consider the threshold as a parameter in the learning framework. Therefore they are computationally less efficient than our solution.

The rest of the paper is organized as follows. The system model and problem formulation are described in Section 2. In Section 3, the optimality of threshold policy is established. In Section 4, the structure-aware learning algorithm is proposed along with a proof of convergence. We provide a comparative study of computational and storage complexities of different RL schemes in Section 5. Simulation results are provided in Section 6. Section 7 discusses possible extensions of the problem, followed by conclusions in Section 8.

2. System Model & Problem Formulation

We consider a controlled time-homogeneous Discrete Time Markov Chain (DTMC) and denote it by

which takes values from the finite state space . Without loss of generality, we assume that , where is a fixed positive integer. For the sake of simplicity, we assume that each state is associated with an action space . Let the action space consists of two actions, viz., and . Let the transition probability of going from state to state under action be denoted as . Therefore, we have, and . Let the action process be denoted by . Therefore, the evolution of can be described by

Let us assume that whenever is chosen in state , no reward is obtained, and the system remains in the same state with probability and goes to state with the remaining probability, where . We further assume that whenever the system is in state and is chosen, a non-negative fixed reward is obtained and the system moves to state with probability and moves to state with the remaining probability. Note that the is not feasible in state .

We have used this model for sake of specificity and because it does arise in practice. Analogous schemes can be developed for other models that naturally lead to a threshold structure.

We aim to obtain a policy which maximizes the average expected reward of the system. Let be the set of memoryless policies where the decision rule at time depends only on the state of the system at time and not on the past history. Under the assumption of unichain nature of the underlying Markov chain which guarantees the existence of unique stationary distribution, let the average reward of the system over infinite horizon under policy be independent of the initial condition and be denoted by . That is, we intend to maximize


where denotes the reward function in state under action , and denotes the expectation operator under policy . The limit in Equation (1) may be taken to exist because the optimal policy is known to be stationary. The DP equation depicted below provides the necessary condition for optimality .


where and denote the value function of state and the optimal average reward, respectively. The above yields the optimal policy, i.e., optimal action as a function of current state. RVIA can be used to solve this problem using the iterative scheme described below.



is the value function estimate in

iteration of RVIA and is a fixed state.

3. Structure of Optimal Policy

In this section, we investigate the structure of the optimal policy. We prove the structural properties using the ‘non-increasing difference’ property of the value function in the lemma described next.

Lemma 0 ().

is non-increasing in .


Proof is presented in Appendix A. ∎

The following theorem describes that the optimal policy is of threshold type where is optimal only upto a certain threshold.

Theorem 2 ().

The optimal policy has a threshold structure where the optimal action changes from to after a certain threshold in .


If is optimal in state , then . Using Lemma 1, is non-increasing in . Therefore, it follows that there exists a threshold such that is optimal only below the threshold, thereafter. ∎

4. Structure-aware Online RL Algorithm

In this section, we propose a learning algorithm by exploiting the threshold properties of the optimal policy. Unlike the traditional RL algorithms which optimize over the entire policy space, our algorithm searches the optimal policy only from the set of threshold policies. As a result, the proposed algorithm converges faster than traditional RL algorithms like Q-learning, PDS learning. Also, the computational complexity and the storage complexity of learning is reduced as argued later.

4.1. Gradient Based RL Framework

Since we know that the optimal policy is threshold in nature where the optimal action changes from to after a certain threshold, if we know the value of the threshold, we can specify the optimal policy completely. However, the value of the threshold depends on the transition probabilities (i.e., ) between different states. Therefore, in the absence of knowledge regarding , instead of learning the optimal policy from the set of all policies, we only learn the the optimal value of the threshold. We target to optimize over the threshold using an update rule so that the value of threshold converges to the optimal threshold.

We consider the set of threshold policies and describe them in terms of the value of parameter threshold (, say). The approach we adopt in this paper is to compute the gradient of the average expected reward of the system with respect to the threshold and improve the threshold policy in the direction of the gradient by updating the the value of . Before proceeding, we need to explicitly indicate the dependence of the associated MDP on by redefining the notations in the context of threshold policies.

Let the steady state stationary probability of state , the value function of state and the average reward of the Markov chain in terms of threshold parameter be denoted by , and , respectively. Let the transition probability from state to state under threshold be denoted as . Therefore,

We later embed the discrete parameter into a continuous valued one. With this in mind, we make the following assumption regarding .

Assumption 1 ().

is a twice differentiable function of with bounded first and second derivatives. Moreover, is bounded.

The proposition described below provides a closed-form expression for the gradient of the average reward .

Proposition 0 ().

Under Assumption 1,


Detailed proof can be found in (Marbach and Tsitsiklis, 2001). ∎

The system model considered by us is a special case of the model considered in (Marbach and Tsitsiklis, 2001), with the exception that unlike in (Marbach and Tsitsiklis, 2001), the reward function in our case does not have any dependence on .

4.2. Online RL Algorithm

Optimal policy can be obtained using RVIA if the transition probabilities between different states are known beforehand. In the absence of knowledge regarding transition probabilities, we can use theory of Stochastic Approximation (SA) (Borkar, 2008) to remove the expectation operation in Equation (3) and converge to the optimal policy by averaging over time. Let be a positive step-size sequence having the following properties.


Let be another step-size sequence with similar properties as in Equation (4) along with the following additional property.


In order to learn the optimal policy, we adopt the following strategy. We update the value function of one state at a time and keep others unchanged. Let be the state whose value function is updated at iteration. Let denote the number of times the value function of the state is updated till iteration. Symbolically,

The scheme for the update of value function can be described as follows.


where denotes the value function of state at the iteration on the faster timescale when the current value of threshold is . The scheme (6) solves a dynamic programming equation for a fixed value of threshold , referred to as primal RVIA. To obtain the optimal threshold value, has to be iterated in a separate timescale . Intuitively, in order to learn the value of the optimal threshold, we can determine the value of based on the current value of threshold at the iteration and then update the value of threshold in the direction of the gradient. This is similar to a stochastic gradient scheme which can be expressed as


The assumptions described in Equations (4) and (5) guarantee that value function and threshold parameter are updated in two separate timescales without interfering in each other’s convergence behavior. The value functions are updated in a faster timescale than that of the threshold. From the faster timescale, the value of threshold appears to be fixed. From the slower timescale, the value functions seem to be equilibrated according to the current threshold value. This behavior is commonly known as “leader-follower” scheme.

Given a threshold , we assume that the transition from state is determined by the rule , if and by the rule , otherwise. For example, consider that the system is in state and . Then the next state to which the system moves is governed by the rule for action . Therefore, the system moves to the state . However, if , then the state transition is given by the rule for action . Therefore, the system remains in state . This scheme is applied to Equation (6) for a fixed value of threshold .

To update the threshold, we need to interpolate the value of threshold which takes discrete values, to continuous domain so that the online rule can be applied. Since the threshold policy can be described as a step function which takes discrete non-negative values as input and follows

upto a threshold and thereafter, the derivative does not exist at all points (See Assumption 1). Therefore, we propose an approximation to the threshold policy using a randomized policy. The randomized policy is a mixture of two policies depicted by and with corresponding probabilities and . To be precise,


Note that the function which decides how much importance is to be given to respective policies, is a function of state and current value of threshold . For a convenient approximation, should be an increasing function of . The idea is to provide comparable importances to both and near the threshold and reduce the importance of () away from the threshold in the left (right) direction. We choose the following function owing to its nice properties such as continuous differentiability and the existence of non-zero derivative everywhere.


This does not satisfy Assumption 1 at , but that does not affect our subsequent analysis if we take right or left derivatives at these points.

Remark 1 ().

Another choice of could be the following.

Since this function exactly replicates the step function nature of the optimal policy in the interval and and uses approximation only in the interval , the approximation error in this case is less than that of Equation (9). However, the derivative of the function is nonzero only in the interval . Therefore, if the initial guess of the threshold is outside this range, then the proposed learning scheme may not converge to the optimal threshold as the gradient becomes zero.

While devising an update rule for the threshold, we evaluate as a representative of and use that in Equation (7). From Equation (8), we get,


Since multiplication by a constant factor does not impact the online update of the proposed scheme, we incorporate an extra multiplicative factor of to the right hand side of Equation (10). This operation can be described in the following manner. In every iteration, we choose transition according to and with equal probabilities. is a state-dependent term which denotes how much importance is to be given to the value function of the state. Therefore, the update of in the slower timescale is as follows.


is a random variable which takes values

and with equal probabilities. If , then the transition is determined by the rule , else by . Therefore, where . The averaging effect of SA scheme enables us to obtain the effective drift in Equation (10). The projection operator is introduced to guarantee that the iterates remain bounded in .

Therefore, the online RL scheme where the value functions are updated in the faster timescale and the threshold parameter in the slower one, can be summarized as


The transitions in (11) from to correspond to a single run of a simulated chain as is common in RL. For each current state , the in (12) is generated separately as per .

Theorem 2 ().

The schemes (11) and (12) converge to optimality almost surely (a.s.).


Proof is provided in Appendix B. ∎

We describe the resulting two-timescale SAL algorithm in Algorithm 1 .

1:Initialize number of iterations , value function and the threshold .
2:while TRUE do
3:     Choose action governed by the current value of .
4:     Update the value function of state using Equation (11).
5:     Update threshold using Equation (12).
6:     Update and .
7:end while
Algorithm 1 Two-timescale SAL algorithm

As described in Algorithm 1

, the number of iterations, value functions and the threshold are initialized at the beginning. On every decision epoch, we choose the action which is specified by the current value of threshold. Based on the reward obtained, the value function of states and the value of threshold are updated in faster and slower timescale, respectively. The rules for the updates are provided in Equation (

11) and (12), respectively.

Remark 2 ().

Even if the optimal policy in an MDP problem does not have a threshold structure, the methodologies presented in this paper which is guaranteed to converge to the optimal (at least locally) threshold policy, can be used. In general threshold policies are easy to implement and have low storage complexity. Besides, often a well chosen threshold policy provides a good performance.

5. Computational and Storage Complexity

In this section, we provide a comparative study of computational and storage complexities associated with traditional learning algorithms such as Q-learning, PDS learning and the SAL algorithm. The comparison is summarized in Table 1.

Algorithm Storage Computational
complexity complexity
Q-learning (Sutton and Barto, 1998; Watkins and Dayan, 1992)
PDS learning (Salodkar et al., 2008; Powell, 2007)
Table 1. Computational and storage complexities of various RL algorithms.

As described in Table 1, Q-learning algorithm needs to store the value function associated with every state-action pair. Thus, the storage complexity associated with Q-learning is . PDS learning algorithm needs to store the value functions associated with only the PDSs along with feasible actions in every state, thereby requiring storage. The SAL algorithm proposed by us needs to store the value functions of all the states and the value of threshold. We no longer need to store feasible actions corresponding to every state since the value of threshold completely specifies the policy. Therefore, the storage complexity of SAL algorithm is . However, for all practical purposes, once the algorithm converges, it is sufficient to store only the value of threshold instead of optimal actions associated with every state, as required by Q-learning and PDS learning.

Q-learning algorithm updates the value function associated with a state-action pair in every iteration by evaluating functions and choosing the best one. Therefore, the per-iteration complexity associated with Q-learning is . In the case of PDS learning, each iteration involves the evaluation of functions, thereby having a per-iteration complexity of . As evident form Equation (11) and (12), single iteration of the proposed algorithm involves updating the value function of a state and the value of threshold. Therefore, the computational complexity of our proposed algorithm is . This is a considerable reduction in computational complexity in comparison to Q-learning and PDS learning.

6. Simulation Results

In this section, we demonstrate the advantages offered by the proposed algorithm in terms of convergence speed with respect to other traditional algorithms such as Q-learning (Sutton and Barto, 1998), PDS learning (Salodkar et al., 2008). We adopt a simple queuing model from (Koole, 1998) and exhibit that the SAL algorithm converges faster than other RL algorithms. In general, the proposed learning technique is applicable to models involving threshold structure of the optimal policy, such as (Agarwal et al., 2008; Sinha and Chaporkar, 2012; Brouns and Van Der Wal, 2006; Ngo and Krishnamurthy, 2009).

Authors in (Koole, 1998)

consider a single queue where the service time is exponentially distributed (with parameter

, say), and the arrival process is Poisson. The system incurs a constant cost upon blocking a user. Additionally, there is a holding cost which is a convex function of the number of customers in the system. Authors prove that it is optimal to admit a user only below a threshold on the number of customers. We conduct ns-3 simulations of SAL algorithm to exploit the threshold structure of optimal policy in (Koole, 1998) and compare the convergence performance with Q-learning and PDS learning algorithms.

6.1. Convergence Analysis

(a) .
(b) .
Figure 1. Plot of average cost vs. number of iterations for different algorithms.

As illustrated in Fig. (0(a)) and (0(b)), SAL algorithm converges faster than both Q-learning and PDS learning. Due to the absence of exploration mechanism, PDS learning has better convergence behavior than Q-learning. However, SAL algorithm outperforms both Q-learning and PDS learning due to the fact that it operates on a smaller feasible policy space (set of threshold policies only) than other algorithms. On the other hand, for both Q-learning and PDS learning, the policy at any given iteration may be non-threshold in nature. This increases the convergence time to optimality. As observed in Fig. 0(a), while Q-learning and PDS learning require around and iterations, respectively, for convergence, SAL algorithm requires only iterations. Similarly, in Fig. 0(b), the number of iterations reduces from in Q-learning and in PDS learning to iterations in SAL algorithm.

(a) .
(b) .
Figure 2. Plot of average cost vs. sum of step sizes till iteration for different algorithms.

However, for practical purposes, even if we do not converge to the optimal policy, if the average cost of the system does not change much over a window of iterations, we can say that the stopping criterion is reached. In other words, the current policy is close to the optimal policy with high probability. Instead of a window of iterations, we consider the sum of step sizes till the present iteration as the parameter of choice to eliminate the effect of declining step size in convergence. We choose the window size equal to and observe in Fig. 1(a) that convergence for Q-learning, PDS learning and SAL algorithm are achieved approximately in and iterations. Similarly, we observe in Fig. 1(b) that number of iterations required for practical convergence reduces from and in Q-learning and PDS learning to in SAL algorithm.

7. Possible Extensions

In this section, we describe the possible extensions of the techniques proposed in this paper. Although the techniques employed in this paper are primarily focused towards solving MDP problems, the techniques can be employed for learning problems involving Constrained MDP (CMDP) problem also. Due to the presence of constraints, usually a two-timescale learning approach is adopted (Borkar, 2008), where the value functions are updated in one timescale and the associated Lagrange Multiplier (LM) in another. Consideration of structure-aware learning may introduce another timescale where the value of threshold is updated. However, since the iterates for the LM and the threshold are not dependent on each other, they can be updated in the same timescale.

The proposed learning technique can also be extended to MDP/ CMDP problems parameterized by a set of threshold parameters rather than only one. . In the slower timescale, one threshold parameter can be updated in a single iteration based on the visited state and rest can be kept fixed. Since the update of threshold parameters follows a stochastic gradient scheme, contrary to value function iterates, the threshold parameter iterates do not need individual local clocks for convergence. However, for the scheme to work, the relative frequencies of the update of individual threshold parameters have to be bounded away from zero (Borkar, 2008). Yet another future direction is to develop RL schemes for restless bandits wherein threshold policies often lead to simple index-based policies, see (Borkar and Chadha, 2018) for a step towards this.

8. Conclusions

In this paper, we have considered an MDP problem and proved the optimality of threshold policies. To this end, we have proposed a RL algorithm which exploits the threshold structure of the optimal policy while learning. Contrary to traditional RL algorithms, the proposed algorithm searches the optimal policy only from the set of threshold policies and hence provides faster convergence. We have proved that the proposed scheme indeed converges to the globally optimal threshold policy. Analysis has been presented to exhibit the effectiveness of the proposed technique in reducing the computational and storage complexity. Simulation results demonstrate the improvement in convergence behavior of the proposed algorithm in comparison to that of Q-learning and PDS learning.

Appendix A Proof of Lemma 1

We rewrite the optimality equation for the value function as

Let the value function of state in iteration of Value Iteration Algorithm (VIA) be denoted by . Start with . Hence, is non-increasing in . We have,


Using Equation (13), is non-increasing in . Now, we assume that is non-increasing in . We need to prove that is non-increasing in . Let us define as follows.

Also define . Let us define . Therefore,

Since is non-increasing in , is non-increasing in . Let and be the maximizing actions in states and , respectively.

Let . For proving that is non-increasing in , we need to prove . Let us consider four cases as follows.

Since and are non-increasing in , is non-increasing in (Using (13)). Since , is non-increasing in .

Appendix B Proof of Theorem 2

Proof methodologies adopted in this paper are similar to that of (Salodkar et al., 2008)

. The idea of adoption of Ordinary Differential Equation (ODE) approach for analyzing SA algorithms by considering them as a noisy discretization of a limiting ODE

(Borkar, 2008), is considered. Step size parameters are considered as discrete time steps, and if the discrete values of the iterates are linearly interpolated, they closely follow the trajectory of the ODE. Assumptions on step sizes, viz., (4) and (5) are made to guarantee that the discretization error and error due to noise are negligible asymptotically. As a result, in the asymptotic sense, the iterates closely follow the trajectory of the ODEs and converge a.s. to the globally asymptotically stable equilibrium.

Update rules for value functions and threshold in the faster and slower timescale, respectively, are as follows.


Following the two timescale analysis adopted in (Borkar, 2008), we consider Equation (14) first keeping threshold fixed. Let be a map given by


The knowledge of is required only for the sake of analysis. However, the proposed algorithm can operate without the knowledge of . Since is kept constant, this gives rise to the following limiting ODE which tracks Equation (14).


As , converges to the fixed point of (i.e., )(Konda and Borkar, 1999), which is the asymptotically stable equilibrium of the ODE. Similar approaches are adopted in (Abounadi et al., 2001; Konda and Borkar, 1999).

The lemma presented next establishes the boundedness of value functions and threshold iterates.

Lemma 0 ().

The value function and the threshold iterates are bounded a.s.


Let be a map given by


Note that Equation (16) reduces to (18) if the immediate reward is zero. Now, . Consider the limiting ODE


Observe that the globally asymptotically stable equilibrium of the ODE (19) is the origin. Also, notice that the ODE (19) is a scaled limit of the ODE (17). Boundedness of follows (Borkar and Meyn, 2000).

Boundedness of iterates of follows from (12). ∎

The physical interpretation behind the proof is as follows. If the iterates of the value functions become unbounded along a subsequence, then a scaled version of the original ODE follows the ODE approximately. Since we have shown that the scaled ODE must globally asymptotically converge to the origin, the scaled ODE must return to the origin. Therefore, the value function iterates must also move towards a bounded set. This ensures the stability of the value function iterates.

Lemma 0 ().

a.s., where is the value function of the states for .


We know that the threshold is varied on a slower timescale than that of . Therefore, the value function iterates treat the threshold value as constant. Therefore, iterations can be viewed as , where . Thus, the limiting ODEs associated with value function and threshold iterates are and , respectively. Since , it is sufficient to consider the ODE , for a fixed value of . The rest of the proof is similar to (Borkar, 2005). ∎

The subsequent lemmas prove that the average reward under a threshold () is unimodal in , and hence the threshold iterations converge to the optimal threshold . Therefore, converges to .

Lemma 0 ().

is non-increasing in .


Proof is provided in Appendix C. ∎

Lemma 0 ().

is unimodal in .


Proof is provided in Appendix D. ∎

Lemma 0 ().

The threshold iterates .


The limiting ODE for Equation (15) is the gradient ascent

with inward pointing gradient at . Using Lemma 4, there does not exist any local maximum other than the global maximum . Therefore . ∎

Remark 3 ().

In general, in an MDP problem with a threshold structure, unimodality of average reward may not hold. In such cases, the threshold iterates may converge to a local maximum only.

Appendix C Proof of Lemma 3

We need to prove that is non-increasing in . We use induction. When , and . We have,

Let . Then and .

Now, assume that the claim holds for any , i.e., . We need to prove that . It is easy to see that . Therefore, to complete the proof, we need to prove that . Let be the maximizing actions in states and , respectively, at iteration. Let be the maximizing actions in states and , respectively, at iteration. Now, it is impossible to have and . If , we have, . From Lemma 1, we must have . If , we have . This contradicts the inductive assumption. Therefore, we consider three cases as follows. For a given value of and , if the inequality holds for any values of and , then the inequality will hold for maximizing actions as well.
1) then choose . We have,

2) If then we choose , and the inequality satisfies similar to the previous case.
3) If and , then we choose and .

Thus, we have, .

Appendix D Proof of Lemma 4

We know that if the optimal action in state is , then . Since VIA converges to the policy with threshold , such that , and . Let be the optimal threshold at iteration of VIA. Symbolically, . If for no values of , the inequality holds, then is taken as . Using Lemma 3, must monotonically decrease with and .

Consider a modified problem where is not permitted in any state , for a given threshold . Lemma 3 holds for this modified problem too. Let be the first VIA iteration where the threshold drops to . The value function iterates for the modified and the original problem are same for because is never chosen as the optimal action for in the original problem in these iterations. Therefore, must be finite and the following inequality holds for both original and the modified problem after iterations.


Using Lemma 3, Equation (20) holds . Therefore, in the considered modified problem, converges to . This implies that the threshold policy with threshold is better than that of . Since can be chosen arbitrarily, average reward is monotonically decreasing with , .

Now, if we have , we must have . Therefore, . Thus, is unimodal in .


Work of VSB is supported in part by a J. C. Bose Fellowship and CEFIPRA grant for “Machine Learning for Network Analytics”.


  • (1)
  • Abounadi et al. (2001) Jinane Abounadi, D Bertsekas, and Vivek S Borkar. 2001. Learning algorithms for Markov decision processes with average cost. SIAM Journal on Control and Optimization 40, 3 (2001), 681–698.
  • Agarwal et al. (2008) Mukul Agarwal, Vivek S Borkar, and Abhay Karandikar. 2008. Structural properties of optimal transmission policies over a randomly varying channel. IEEE Trans. Automat. Control 53, 6 (2008), 1476–1491.
  • Borkar (2005) Vivek S Borkar. 2005. An actor-critic algorithm for constrained Markov decision processes. Systems & control letters 54, 3 (2005), 207–213.
  • Borkar (2008) Vivek S Borkar. 2008. Stochastic approximation: A dynamical systems viewpoint. Cambridge University Press.
  • Borkar and Chadha (2018) Vivek S Borkar and Karan Chadha. 2018. A reinforcement learning algorithm for restless bandits. In Indian Control Conference (ICC). 2018. IEEE, 89–94.
  • Borkar and Meyn (2000) Vivek S Borkar and Sean P Meyn. 2000. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38, 2 (2000), 447–469.
  • Brouns and Van Der Wal (2006) Gido AJF Brouns and Jan Van Der Wal. 2006. Optimal threshold policies in a two-class preemptive priority queue with admission and termination control. Queueing Systems 54, 1 (2006), 21–33.
  • Fu and van der Schaar (2012) Fangwen Fu and Mihaela van der Schaar. 2012. Structure-aware stochastic control for transmission scheduling. IEEE Transactions on Vehicular Technology 61, 9 (2012), 3931–3945.
  • Konda and Borkar (1999) Vijaymohan R Konda and Vivek S Borkar. 1999. Actor-Critic-Type Learning Algorithms for Markov Decision Processes. SIAM Journal on control and Optimization 38, 1 (1999), 94–123.
  • Koole (1998) Ger Koole. 1998. Structural results for the control of queueing systems using event-based dynamic programming. Queueing systems 30, 3-4 (1998), 323–339.
  • Kunnumkal and Topaloglu (2008) Sumit Kunnumkal and Huseyin Topaloglu. 2008. Exploiting the structural properties of the underlying Markov decision problem in the Q-learning algorithm. INFORMS Journal on Computing 20, 2 (2008), 288–301.
  • Marbach and Tsitsiklis (2001) Peter Marbach and John N Tsitsiklis. 2001. Simulation-based optimization of Markov reward processes. IEEE Trans. Automat. Control 46, 2 (2001), 191–209.
  • Ngo and Krishnamurthy (2009) Minh Hanh Ngo and Vikram Krishnamurthy. 2009. Optimality of threshold policies for transmission scheduling in correlated fading channels. IEEE Transactions on Communications 57, 8 (2009).
  • Ngo and Krishnamurthy (2010) Minh Hanh Ngo and Vikram Krishnamurthy. 2010. Monotonicity of constrained optimal transmission policies in correlated fading channels with ARQ. IEEE Trans. Signal Processing 58, 1 (2010), 438–451.
  • Powell (2007) Warren B Powell. 2007. Approximate Dynamic Programming: Solving the curses of dimensionality. Vol. 703. John Wiley & Sons.
  • Puterman (2014) Martin L Puterman. 2014. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
  • Salodkar et al. (2008) Nitin Salodkar, Abhijeet Bhorkar, Abhay Karandikar, and Vivek S Borkar. 2008. An on-line learning algorithm for energy efficient delay constrained scheduling over a fading channel. IEEE Journal on Selected Areas in Communications 26, 4 (2008), 732–742.
  • Sinha and Chaporkar (2012) Abhinav Sinha and Prasanna Chaporkar. 2012. Optimal power allocation for a renewable energy source. In 2012 National Conference on Communications (NCC). IEEE, 1–5.
  • Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction. MIT press Cambridge.
  • Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems. 1057–1063.
  • Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning 8, 3-4 (1992), 279–292.