Online Reinforcement Learning of Optimal Threshold Policies for Markov Decision Processes

12/21/2019 ∙ by Arghyadip Roy, et al. ∙ Indian Institute of Technology Kanpur 38

Markov Decision Process (MDP) problems can be solved using Dynamic Programming (DP) methods which suffer from the curse of dimensionality and the curse of modeling. To overcome these issues, Reinforcement Learning (RL) methods are adopted in practice. In this paper, we aim to obtain the optimal admission control policy in a system where different classes of customers are present. Using DP techniques, we prove that it is optimal to admit the i th class of customers only upto a threshold τ(i) which is a non-increasing function of i. Contrary to traditional RL algorithms which do not take into account the structural properties of the optimal policy while learning, we propose a structure-aware learning algorithm which exploits the threshold structure of the optimal policy. We prove the asymptotic convergence of the proposed algorithm to the optimal policy. Due to the reduction in the policy space, the structure-aware learning algorithm provides remarkable improvements in storage and computational complexities over classical RL algorithms. Simulation results also establish the gain in the convergence rate of the proposed algorithm over other RL algorithms. The techniques presented in the paper can be applied to any general MDP problem covering various applications such as inventory management, financial planning and communication networking.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Markov Decision Process (MDP)[puterman2014markov] is a framework which is widely used for the optimization of stochastic systems involving uncertainty to make optimal temporal decisions. An MDP is a controlled stochastic process which operates on a state space. Each state is associated with a control process of action. An MDP satisfies the controlled Markov property, i.e., the transition from one state to another is only governed by the current state-action pair and is independent of the past history of the system. Each transition gives rise to a certain amount of reward which depends on the current state and action. A stationary policy is a mapping from a state to an action describing which action is to be chosen in a state. The objective of the MDP problem considered here is to determine the optimal policy which maximizes the average expected reward of the system. It is known that it suffices to consider only stationary policies for this problem.

MDP has been extensively used to model problems related to queue management frequently arising in telecommunication systems, inventory management and production management. A generalized framework for MDP based modeling in the context of queue management is provided in [ccil2009effects]. The authors in [ccil2009effects] also investigate the structural properties of the optimal policy using Dynamic Programming (DP) [puterman2014markov] methods and study the impact of various system parameters on the structural properties. DP techniques for the computation of optimal policy suffer from the following major drawbacks. First, DP based methods such as Value Iteration Algorithm (VIA) and Policy Iteration Algorithm (PIA) [puterman2014markov] are computationally inefficient in the face of large state and action spaces. This is known as the curse of dimensionality

. Furthermore, computation of the optimal policy requires the knowledge of the underlying transition probabilities which often depend on the statistics of different unknown system parameters such as arrival rates of users. In reality, it may be hard to gather these statistics beforehand. This drawback is known as the

curse of modeling.

RL techniques [sutton1998reinforcement] address the issue of the curse of modeling. They learn the optimal policy in an iterative fashion without requiring the knowledge of the statistics of the system dynamics. However, popular RL techniques such as Q-learning [watkins1992q], Temporal Difference (TD) learning [sutton1998reinforcement], policy gradient [sutton2000policy], actor-critic learning [borkar2005actor] and Post-Decision State (PDS) learning [powell2007approximate, salodkar2008line]

suffer from the shortcoming that they do not exploit the known structural properties of the optimal policy, if any, within the learning framework. When these schemes iteratively learn the optimal policy by trial and error, the policy search space consists of an exhaustive collection of all possible policies. However, existing literature on operations research and communications reveals that in many cases of practical interest, value functions of states satisfy properties like monotonicity, convexity/concavity and sub-modularity/super-modularity. These results are often exploited to prove various structural properties of the optimal policy including threshold structure, transience of certain states of the underlying Markov chain and index rules

[agarwal2008structural, smith2002structural]. In the learning framework, if one can exploit these structural properties to reduce the search space while learning, then faster convergence can be achieved along with a significant reduction in the computational complexity.

To illustrate the benefit provided by the awareness of the structural properties in the context of RL, we consider the following scenario. We consider a multi-server queuing system with a finite buffer where multiple classes of customers are present. We aim to determine the optimal admission control policy which maximizes the average expected reward over infinite horizon. The system model presented in this paper is motivated from [ccil2009effects] which considers the infinite buffer case. A similar model for batch arrival is considered in [ccil2007structural]. We prove the existence of a threshold-based optimal policy using DP methods under certain assumptions on the reward function. Specifically, we prove that it is optimal to admit the class of customer only upto a threshold which is a non-increasing function of . Therefore, learning the optimal policy is equivalent to learning the value of threshold for each value of .

Motivated by this, we propose a Structure-Aware Learning for MUltiple Thresholds (SALMUT) algorithm which eliminates the set of non-threshold policies and considers only the set of threshold policies where the values of thresholds are ordered, i.e., . We consider a two timescale approach. In the faster timescale, the value functions of the states are updated. In the slower timescale, we update the values of the threshold parameters based on the gradients of the average reward with respect to the threshold. We establish that this scheme results in reductions in storage and computational complexities in comparison to traditional RL schemes. Since the associated search space of policies is smaller, SALMUT converges faster than classical RL techniques. We prove that the proposed algorithm converges to the optimal policy in an asymptotic sense. Simulation results are presented to exhibit the gain in convergence speed achieved by SALMUT in comparison to classical RL algorithms. Note that the techniques presented in this paper are of independent interest and can be employed to learn the optimal policy in a large set of optimization problems where the optimality of threshold policies holds, see e.g., [agarwal2008structural, sinha2012optimal, koole1998structural, brouns2006optimal, ngo2009optimality].

We provide a generic framework in this paper for learning a set of threshold parameters. In many cases of practical interest, instead of a set of thresholds, we may need to learn a single threshold only, see e.g., our preliminary work in [roy2019structure]. The proposed algorithm in this paper is generic enough to be adopted for [roy2019structure] without any modification. In another work [roy2019low], a structure-aware online learning algorithm is proposed for learning a single parameterized threshold. However, unlike [roy2019low] where the thresholds for different parameter values are independent of each other, in this paper, the thresholds have to satisfy certain ordering constraints. Hence, the thresholds are dependent on each other. Therefore, the scheme for updating the thresholds in the slower timescale and corresponding convergence behavior differs significantly from those of [roy2019low].

One of the advantages of the proposed scheme is that it essentially reduces a non-linear system (involving maximization over a set of actions) into a linear system for a quasi-static value of threshold for the faster timescale and learns the optimal threshold on the slower timescale. This reduces the per-iteration computational complexity significantly compared to other learning schemes in the literature.

I-a Related Work

There are many RL algorithms which are proposed in the literature over the years, see [sutton1998reinforcement, bertsekas1995dynamic] for excellent overviews on these. Model-free RL algorithms do not require any prior knowledge regarding the transition probabilities of the underlying model. Among the policy iteration based methods, actor-critic class of methods is popular. It uses simulation for approximate policy evaluation (by a critic) and utilizes that for approximate policy improvement (by an actor). Policy gradient based methods learn the parameterized policy iteratively using the gradient of a chosen performance metric with respect to the considered parameter.

Value iteration based methods choose different actions, observe the rewards and iteratively learn the best action in each state. TD learning algorithm is one of the first model-free RL algorithms which can be used to estimate the value functions of different states. Q-learning

[watkins1992q] is a very popular RL algorithm which iteratively evaluates the Q-function of every state-action pair. It uses a combination of exploration and exploitation where the exploration is gradually reduced over time. Since we need to store the Q-function of every state-action pair, the storage complexity is quite high, especially under large state and action spaces. Furthermore, the presence of exploration mechanism makes the convergence behavior of Q-learning slow for practical purposes.

PDS learning algorithm [powell2007approximate, salodkar2008line] addresses these issues by removing the requirement of action exploration. Due to the absence of exploration, the convergence rate of PDS learning is faster than that of Q-learning. Moreover, the storage complexity of PDS learning is lesser than that of Q-learning since we no longer need to store the Q functions associated with the state-action pairs. We need to store only the value functions of the states. In [mastronarde2012joint], a Virtual Experience (VE) learning algorithm where multiple PDSs can be updated at a time, is proposed. Reduction in convergence time is achieved at the cost of increase in computational complexity.

The main limitation of these learning schemes is that they do not exploit the existing structural properties of the optimal policy, if any. If the knowledge of the structural properties [smith2002structural] can be exploited in the learning framework, then improved convergence can be achieved due to reduction in the dimensionality of the effective policy space. Few works in the literature [kunnumkal2008exploiting, fu2012structure, ngo2010monotonicity, sharma2018accelerated] focus on the exploitation of the structural properties while learning the optimal policy. A Q-learning based algorithm is proposed in [kunnumkal2008exploiting] where in every iteration, the value functions are projected in such a way that the monotonicity in system state is guaranteed. Similar methodologies are adopted in [ngo2010monotonicity]. Although this approach provides an improvement in the convergence speed over traditional Q-learning, the per-iteration computational complexity does not improve. In [fu2012structure], a learning algorithm which uses a piecewise linear approximation of the value function, is proposed. However, as the approximation becomes better, the complexity of the proposed scheme increases. The authors in [sharma2018accelerated] combine the idea of VE learning [mastronarde2012joint] and piece-wise planar approximation of PDS value functions to exploit the structural properties. However, the computation complexity is worse than that of PDS learning.

I-B Our Contributions

In this paper, we propose the SALMUT algorithm, a structure-aware learning algorithm which exploits the benefits provided by the knowledge of the structural properties. First, we aim to obtain the optimal admission control policy in a multi-server queuing system with limited buffer size which handles customers of multiple classes. We establish the existence of a threshold-based optimal policy where the optimal threshold to admit the class of customer is non-increasing in

. Based on this, the proposed SALMUT algorithm learns the optimal policy only from the set of ordered threshold policies. A two timescale approach where the value functions of states are updated in a faster timescale than that of the threshold vector, is adopted. The asymptotic convergence of the proposed scheme to the optimal policy is derived. Simulation results establish that the proposed SALMUT algorithm converges faster than traditional algorithms such as Q-learning and PDS learning algorithms due to a reduction in the size of the feasible policy space. To the best of our knowledge, contrary to other works in the literature

[kunnumkal2008exploiting, fu2012structure, ngo2010monotonicity, sharma2018accelerated], we for the first time consider the threshold vector as a parameter in the learning process to obtain a linear system for a fixed value of threshold and hence, a significant reduction in the per-iteration computational complexity. Our main contributions can be summarized as follows.

  • We establish that the optimal admission threshold for class of customer is non-increasing in .

  • We propose the SALMUT algorithm which exploits the knowledge of structural properties in the learning framework. The convergence proof of the proposed algorithm is presented.

  • Analytical results demonstrate that significant improvements in storage cum computational complexity are achieved in comparison to other state-of-the-art algorithms.

  • The proposed algorithm provides a novel framework and hence, can be utilized in other problems where the optimal policy is threshold in nature.

  • We evaluate and establish that the proposed SALMUT algorithm converges faster than classical RL algorithms using simulations.

The rest of the paper is organized as follows. The system model and the problem formulation are described in Section II. In Section III, we establish the optimality of threshold policies. In Section IV, we propose the SALMUT algorithm along with a proof of convergence. A comparison of storage and computational complexities of the proposed algorithm with those of traditional RL algorithms is provided in Section V. Section VI presents the simulation results. We conclude the paper in Section VII.

Ii System Model & Problem Formulation

We consider a queuing system with identical servers and a finite buffer of size . We investigate an optimal admission control problem in the system where classes of customers are present. It is assumed that the arrival of class- customers is a Poisson process with mean

. We further assume that the service time is exponentially distributed with mean

, irrespective of the class of the customer.

Ii-a State & Action Space

We model the system as a controlled time-homogeneous continuous time stochastic process . The state of the system in the state space (, say) can be represented by the pair where denotes the total number of customers in the system and

denotes the class type. Arrivals and departures of different classes of customers are taken as decision epochs. We take

as the departure event, and correspond to an arrival of a type of customer, respectively. Note that the transitions of happen only at the decision epochs. The associated continuous time Markov chain has a finite number of states and hence, it is regular. In other words, the rate of exponentially distributed sojourn times in each state is bounded. Therefore, it is sufficient to observe the system state only at the decision epochs to know the entire sample path [kumar2012discrete]. The system state need not be observed at other time points.

Let the action space be denoted by . consists of two actions, viz., blocking of an arriving user (, say) and admission of an arriving user (, say). In case of departures, there is no choice of actions. The only available action is to continue/do nothing (, say). Note that when , then the only feasible action in case of an arrival (i.e., is .

Ii-B State Transitions and Rewards

Based on the current system state and the chosen action , the system moves to state with a positive probability. The transition from state to can be factored into two parts, viz., the deterministic transition due to the chosen action and the probabilistic transition due to the next event. Let the transition probability due to chosen action be denoted by . Then,

where . Let the sum of arrival and service rates in state (which is independent of ) be denoted by . Therefore,

Let and . Note that although do not depend on , these notations are introduced for the ease of representation. Now,

Hence the transition probability from state to state (, say) is expressed as

Based on the system state and the chosen action , finite amounts of reward rate (, say) and cost rate are obtained. Let the non-negative reward rate obtained by the admission of a class- customer be , where for . Therefore,

We assume that if the system is in state , then a non-negative cost rate of (independent of ) where and are non-decreasing functions of (convex increasing in the discrete domain), is incurred.

Ii-C Problem Formulation

At each arrival instant, the system either admits or rejects the incoming customers. We first obtain the optimal admission control policy which maximizes the average expected reward of the system over infinite horizon. This problem can be formulated as a continuous time MDP problem.

Let be the set of stationary policies (decision rule at time depends only on the system state at time and not on the past history). Since the zero state is reachable from any state with positive probability, the underlying Markov chain is unichain. This ensures the existence of a unique stationary distribution. Let the infinite horizon average reward (which is independent of the initial state) under policy be denoted by . We aim to maximize

(1)

where is the total reward till time and is the expectation operator under policy . For a stationary policy, the limit in Equation (1) exists.

The DP equation which describes the necessary condition for optimality in a semi-Markov decision process ( and ) is

where , and denote the value function of state , the optimal average reward and the mean transition time from state upon choosing action , respectively. We rewrite the DP equation after substituting the values of and transition probabilities as

where corresponds to a departure event. We define

Therefore, the following relations hold.

and

(2)

where . Equation (2) reveals that instead of considering the system state as the pair , we can consider the system state as with value function and transition probability to state under action , and the analysis remains unaffected. However, in this model, the reward rate is the weighted average of reward rates for different events in the original model. The probability of the event acts as the corresponding weight.

The sojourn times being exponentially distributed, this converts into a continuous time controlled Markov chain. The resulting optimality equation is as follows.

(3)

where are controlled transition rates which satisfy (for ) and . Note that Equation (3) follows directly from the Poisson equation [marbach2001simulation]. Scaling the transition rates by a positive quantity is equivalent to time scaling. This operation scales the average reward for every policy including the optimal one, however, without changing the optimal policy. Therefore, we assume without loss of generality. This implies that for . We obtain

(4)

by adding to both sides of Equation (3). Here, for and . Equation (4) is the DP equation for an equivalent discrete time MDP (say having controlled transition probabilities ) which is used throughout the rest of the paper.

This problem can be solved using RVIA according to the following iterative scheme.

(5)

where is a fixed state and is the estimate of value function of state at iteration.

Iii Structural Property of Optimal Policy

In this section, we derive that there exists a threshold based optimal policy which admits a customer of class only upto a threshold . Moreover, is a non-increasing function of . We prove these properties using the following lemma.

Lemma 1.

is non-increasing in .

Proof.

Proof is presented in Appendix A. ∎

Theorem 1.

The optimal policy is of threshold-type where it is optimal to admit customer of class only upto a threshold which a non-increasing function of .

Proof.

For class of customer, if is optimal in state , then . From Lemma 1, is non-increasing in . This proves the existence of a threshold for class of customer.

Since we have assumed that for , implies . Hence, if in state , optimal action for class of customer is , then has to be optimal for class too. Therefore, is a non-increasing function of . ∎

Iv Exploitation of structural properties in RL

In this section, we propose an RL algorithm which exploits the knowledge regarding the existence of a threshold-based optimal policy.

Iv-a Gradient-based RL Technique

Given that the optimal policy is threshold in nature where the optimal action changes from to at for class of customers, the knowledge of uniquely characterizes the optimal policy. However, computation of these threshold parameters can be performed only if the event probabilities in state (governed by ) are known beforehand. When the s are unknown, then we can learn these ordered thresholds instead of learning the optimal policy from the set of all policies including the non-threshold policies. We devise an iterative update rule for a threshold vector of dimensionality so that the threshold vector iterate converges to the optimal threshold vector.

We consider the set of threshold policies where the thresholds for different classes of customers are ordered ( for ) and represent them as policies parameterized by the threshold vector where . In this context, we redefine the notations associated with the MDP to reflect their dependence on . The aim is to compute the gradient of the average expected reward of the system with respect to and improve the policy by updating in the direction of the gradient.

Let us denote the transition probability from state to state corresponding to the threshold vector by . Hence,

Let the value function of state , the average reward of the Markov chain and the steady state stationary probability of state parameterized by the threshold vector be denoted by , and , respectively. The following assumption is made on so that the discrete parameter can be later embedded into a continuous domain.

Assumption 1.

is bounded and a twice differentiable function of . It has bounded first and second derivatives.

Under these assumptions, the following proposition provides a closed form expression for the gradient of .

Proposition 1.
(6)
Proof.

Proof is provided in [marbach2001simulation]. ∎

Note that [marbach2001simulation] considers a more general case where unlike here, the reward function also depends on .

Iv-B Structure-aware Online RL Algorithm

As shown in Equation (5), the optimal policy can be computed using RVIA if we know the transition probabilities between different states and the arrival rates of different types of users. When these parameters are unknown, theory of Stochastic Approximation (SA) [borkar2008stochastic] enables us to replace the expectation operation in Equation (5) by averaging over time and still converge to the optimal policy. Let be a positive step-size sequence satisfying the following properties:

(7)

Let be another step-size sequence which apart from properties in Equation (7), has the property:

(8)

We adopt the following strategy in order to learn the optimal policy. We update the value function of the system state (based on the type of arrival) at any given iteration and keep the value functions of other states unchanged. Let be the state of the system at iteration. Let the number of times state is updated till iteration be denoted by . Therefore,

The update of value function of state (corresponding to the arrival of a type of customer) is done using the following scheme:

(9)

where denotes the value function of state at iteration provided the threshold vector is . This is known as the primal RVIA which is performed in the faster timescale.

Remark 1.

Note that according to the proposed scheme (9), for a fixed threshold policy, the operator in Equation (5) goes away, and the resulting system becomes a linear system.

The scheme (9) works for a fixed value of threshold vector. To obtain the optimal value of , the threshold vector needs to be iterated in a slower timescale . The idea is to learn the optimal threshold vector by computing based on the current value of the threshold and updating the value of threshold in the direction of the gradient. This scheme is similar to a stochastic gradient routine as described below.

(10)

where is the threshold vector at iteration. Equations (7) and (8) ensure that the value function and threshold vector iterates are updated in different time scales. From the slower timescale, the value functions seem to be quasi-equilibrated, whereas form the faster timescale, the threshold vector appears to be quasi-static (known as the “leader-follower” behavior).

Given a threshold vector , it is assumed that the transition from state for the admission of class of customer is driven by the rule if and by the rule otherwise. Under rule , the system moves from state to state following action if . On the other hand, rule dictates that the system remains in state following action . For a fixed , Equation (9) is updated using the above rule.

For a given class of customer, the threshold policy chooses the rule upto a threshold on the state space and follows the rule , thereafter. Therefore the threshold policy is defined at discrete points and does not satisfy Assumption 1 as the derivative is undefined. To address this issue, we propose an approximation (interpolation to continuous domain) of the threshold policy, which resembles a step function, so that the derivative exists at every point. This results in a randomized policy which in state , chooses policies and with associated probabilities and , respectively. In other words,

(11)

Intuitively, which is a function of the system state and the threshold vector , should be designed in such a manner that it allocates similar probabilities to and near the threshold. As we move away towards the left (right) direction, the probability of choosing () should decrease. Therefore needs to be an increasing function of . The following function is chosen as a convenient approximation because it is continuously differentiable and the derivative is non-zero at every point.

(12)
Remark 2.

Note that although the state space is discrete, individual threshold vector component iterates may take values in the continuous domain. However, only an ordinal comparison dictates which action needs to be chosen in the current system state.

Remark 3.

Instead of the sigmoid function as in Equation (

12), the following function which uses approximation only when , could have been chosen.

Clearly, this function does not employ approximation except in the interval , leading to lesser approximation error than that of Equation (12). However, it may lead to slow convergence since the derivative of the function and hence the gradient becomes zero outside .

Based on the proposed approximation, we set to devise an online update rule for the update of the threshold vector in the slower timescale , following Equation (10). We evaluate as a representative of since the steady state stationary probabilities in Equation (6) can be replaced by averaging over time. Using Equation (11), we get

(13)

We incorporate a multiplying factor of in the right hand side of Equation (13) since multiplication by a constant term does not alter the scheme. The physical significance of this operation is that at every iteration, transitions following rules and are adopted with equal probabilities. depends on the system state and threshold vector at any given iteration.

Based on this analysis, when a class of customer arrives, the online update rule for the component of the threshold vector is as follows.

where

is a random variable which can take values

and with equal probabilities. If , then the transition is governed by the rule , else by . In other words, the next state is with probability . The projection operator ensures that iterates remain bounded in a specific interval, as specified later. Recall that in Theorem 1, we have derived that the threshold for class of customer is a non-increasing function of . Therefore, , the component of the threshold vector iterates should always be less than or equal to the component.

The first component of is considered to be a free variable which can choose any value in . The projection operator ensures that remains bounded in . To be precise,

The framework of SA enables us to obtain the effective drift in Equation (13) by performing averaging.

Therefore the two-timescale online RL scheme where the value functions and the threshold vector are updated in the faster and the slower timescale, respectively, is as described below. We suppress the parametric dependence of on .

(14)

and

(15)

The physical significance of Equation (15) is that when type of customers arrive, then the component of is updated. However, since the components are provably ordered, we need to update the components too where . We no longer need to update the components corresponding to since the order is already preserved while updating the component. Also, in Equation (14), the reward function is taken to be when a class of customer arrives. The expectation operation in Equation (5) is mimicked by the averaging over time implicit in a stochastic approximation scheme. Note that contrary to [roy2019low], where due to the independence among the threshold parameters only one threshold is updated at a time, in this paper, multiple threshold parameters may need to get updated to capture the ordering constraints.

Remark 4.

Instead of the two-timescale approach adopted in this paper, a multi-timescale approach where each individual threshold is updated in a separate timescale, may be chosen. However, since the updates of thresholds are coupled only through the ordering constraints, they can be updated in the same timescale. Moreover, in practice, a multi-timescale approach may not work well since the fastest (slowest) timescale may be too fast (slow).

Theorem 2.

The schemes described by Equations (14) and (15) converge to the optimal policy almost surely (a.s.).

Proof.

Proof is given in Appendix B. ∎

Based on the foregoing analysis, we describe the resulting SALMUT algorithm in Algorithm 1. The number of iterations, the value functions and the threshold vector are initialized first. On a decision epoch, if there is an arrival of a specific class of customer, then action is chosen based on the current value of threshold vector. Based on the arrival of users, value function of current state is updated then using Equation (14) in the faster timescale. The threshold vector is also updated following Equation (15) in the slower timescale. Note that the value function is updated one at a time. However, multiple components of the threshold vector may need to be updated in a single iteration.

1:Initialize number of iterations , value function and the threshold vector .
2:while TRUE do
3:     if Arrival of type of customer then
4:         Choose action based on current value of .
5:     end if
6:     Update value function of state using Equation (14).
7:     Update threshold using Equation (15).
8:     Update and .
9:end while
Algorithm 1 Two-timescale SALMUT algorithm
Remark 5.

Even if there does not exist a threshold policy which is optimal for a given MDP problem, the techniques presented in this paper can be applied to learn the best threshold policy (locally at least) asymptotically. Threshold policies are easy to implement. In many cases, they provide comparable performances to that of the optimal policy, with a significantly lower storage complexity.

V Complexity Analysis

In this section, we compare the storage and computational complexities of the proposed SALMUT algorithm with those of existing learning schemes including Q-learning and PDS learning. We summarize our analysis in Table I.

Algorithm Computational Storage
complexity complexity
Q-learning [sutton1998reinforcement, watkins1992q]
Monotone Q-learning [kunnumkal2008exploiting, ngo2010monotonicity]
PDS learning [salodkar2008line, powell2007approximate]
VE learning [mastronarde2012joint]
Grid learning [sharma2018accelerated]
Adaptive appx. learning [fu2012structure]
SALMUT
Table I: Computational and storage complexities of RL algorithms.

Q-learning needs to store the value function of every state-action pair. While updating the value function, it chooses the best one after evaluating functions. Thus, the storage and per-iteration computational complexities of Q-learning are and , respectively. Since at each iteration, the monotone Q-learning algorithm [kunnumkal2008exploiting, ngo2010monotonicity] projects the policy obtained using Q-learning within the set of monotone policies, the storage and computational complexities are identical to those of Q-learning.

PDS learning which involves computation of functions in every iteration, has a per-iteration computational complexity of . The storage complexity of PDS learning is because value functions of PDSs and feasible actions in different states need to be stored. VE learning [mastronarde2012joint] updates multiple PDSs at a time. Therefore, the computational complexity contains an additional term which signifies the cardinality of the VE tuple. Similarly, grid learning in [sharma2018accelerated] and adaptive approximation learning in [fu2012structure] are associated with additional factors and , respectively. and depend on the depth of a quadtree used for value function approximation and the approximation error threshold , respectively.

In SALMUT algorithm, we need to store the value function of every state. Moreover, since the threshold vector completely characterizes the policy, we no longer need to store feasible actions in different states. This results in a storage complexity of . The SALMUT algorithm may require to update all components of the threshold vector at a time (See Equation (15)). Furthermore, the update of value function involves the computation of a single function corresponding to the current value of threshold (See Equation (14)). Therefore, the per-iteration computational complexity is . Thus the proposed algorithm provides significant improvements in storage and per-iteration computational complexities compared to traditional RL schemes. Note that the computational complexity of the proposed scheme does not depend on the size of the action space and depends only on the number of classes of customers.

Vi Simulation Results

In this section, we demonstrate the advantage offered by the proposed SALMUT algorithm in terms of the convergence speed with respect to traditional RL algorithms such as Q-learning and PDS learning. We simulate the multi-server system with a finite buffer where two classes of customers are present. We take s. We choose , and . The cost function is chosen as . The step size schedules are chosen as and . Our observations establish that SALMUT algorithm converges faster than other algorithms. Note that the proposed algorithm can be applied to any general scenario involving optimality of threshold policies, e.g., [agarwal2008structural, sinha2012optimal, brouns2006optimal, ngo2009optimality].

Vi-a Convergence Behavior

(a) .
(b) .
Figure 1: Plot of average reward vs. number of iterations for different algorithms.

We describe the convergence behaviors of Q-learning, PDS learning and SALMUT algorithms in Figs 0(a) and 0(b). We exclude initial 10 burn-in period values of the iterates to facilitate a convenient representation. Since unlike PDS learning, Q-learning is associated with exploration mechanism, the convergence of PDS learning is faster than that of Q-learning. However, the proposed algorithm converges faster than both Q-learning and PDS learning algorithms since it operates on a smaller policy space (threshold policies only). As observed in Fig. 0(a), Q-learning and PDS learning algorithms take approximately 3000 and 2000 iterations for convergence, which translate into 1502 and 1008 s, respectively. On the other hand, SALMUT algorithms converges only in 1500 iterations (779 s). Similarly, in Fig. 0(b), the convergence time reduces from 1540 s (Q-learning and PDS learning) to 509 s (SALMUT). These correspond to approximately 3000 , 3000 and 1000 iterations, respectively.

Vi-B Stopping Criteria for Practical Convergence

(a) .
(b) .
Figure 2: Plot of average reward vs. sum of step sizes till iteration () for different algorithms.

In practical cases, one may not wait till the actual convergence to happen. When the average reward of the system does not change much over a suitable window, we may conclude that stopping condition is met. This translates into the fact that the obtained policy is in close neighborhood of the optimal policy with a high probability. The choice of window size over is to eliminate the effect of diminishing step size affecting the convergence behavior. We choose a window size of , observe the ratio of maximum and minimum average rewards over this window and conclude that convergence is achieved when the ratio exceeds . Fig. 1(a) reveals that practical convergences for Q-learning, PDS learning and SALMUT algorithms are achieved in 1180,580 and 426 iterations, respectively. Similarly, in Fig. 1(b), Q-learning, PDS learning and SALMUT algorithms converge in approximately 1180,1180 and 580 iterations, respectively.

Vii Conclusions & Future Directions

In this paper, we have considered the optimal admission control problem in a multi-server queuing system. We have proved the existence of a threshold-based optimal policy where the threshold for the admission of class of customer is a non-increasing function of . We have proposed an RL algorithm which exploits the threshold nature of the optimal policy in the learning framework. Since the proposed algorithm operates only on the set of ordered threshold policies, the convergence behavior is faster than traditional RL algorithms. The convergence of the proposed algorithm to the globally optimal threshold vector is established. Apart from gain in convergence speed, the proposed scheme provides improvements in storage and computational complexities too. Simulation results establish the improvement in convergence behavior with respect to state-of-the-art RL schemes.

In future, this work can be extended to solve Constrained MDP (CMDP) based RL problems. Usually CMDP is associated with a two-timescale approach [borkar2008stochastic] where the value functions and the Largrange Multiplier (LM) are updated in the faster and slower timescale, respectively. Structure-awareness may introduce a third timescale for the update of the threshold parameter. Alternatively, the LM and the threshold parameter can be updated in the same slower timescale as they are independent of each other. Another possible future direction is to develop RL algorithms for restless bandits such as [borkar2018reinforcement] since threshold policies often translate into index-based policies.

Appendix A Proof of Lemma 1

Proof techniques are similar to those of our earlier work [roy2019structure]. The optimality equation for value function is

In Value Iteration Algorithm (VIA), let the value function of state at iteration be denoted by . We start with . Therefore, is a non-increasing function of . Since (using definition of )

(16)

and is non-decreasing in , same property holds for . Let us assume that is a non-increasing function of . We require to prove that is a non-increasing function of . Since , this implies the lemma.

We define and as

and . Also, we define
and . Hence, we have,

and

Since is a non-increasing function of , is non-increasing in , and . Let the maximizing actions for the admission of a class- customer in states and be denoted by and , respectively. Therefore,

Let us denote . To prove that is non-increasing in , we need to prove that . We consider the following cases.


  • .



  • .


To analyze the difference of second and third terms in Equation (16) corresponding to states and , we consider two cases.

  • : Both and are non-increasing in .

  • : The difference is equal to

    which is non-increasing in .

Since is non-decreasing in , this proves that is non-increasing in . Hence, is non-increasing in .

Appendix B Proof of Theorem 2

The proof methodologies are based on the approach of viewing SA algorithms as a noisy discretization of a limiting Ordinary Differential Equation (ODE) and are similar to those of

[roy2019structure, roy2019low]. Step size parameters are viewed as discrete time steps. Standard assumptions on step sizes (viz., Equations (7) and (8)) ensure that the errors due to noise and discretization are negligible asymptotically. Therefore, asymptotically, the iterates closely follow the trajectory of the ODE and ensure a.s. convergence to the globally asymptotically stable equilibrium.

Using the two timescale approach in [borkar2008stochastic], we consider Equation (14) for a fixed value of . We assume that is a map described by

(17)

Note that the knowledge of and are not required for the proposed algorithm and is only required for the purpose of analysis. For a fixed , Equation (14) tracks the following limiting ODE

(18)

converges to the fixed point of (determined using )[konda1999actor] which is the asymptotically stable equilibrium of the ODE, as . Analogous methodologies are adopted in [abounadi2001learning, konda1999actor].

Next we establish that the value function and threshold vector iterates are bounded.

Lemma 2.

The threshold vector and value function iterates are bounded a.s.

Proof.

Let be the following map

(19)

Clearly, if the reward and cost functions are zero, Equation (17) is same as Equation (19). Also, . The globally asymptotically stable equilibrium of the following ODE which is a scaled limit of ODE (18)

(20)

is the origin. Boundedness of value functions and threshold vector iterates follow from [borkar2000ode] and Equation (15), respectively. ∎

In this proof, we have considered a scaled ODE which approximately follows the original ODE if the value functions become unbounded along a subsequence. Since the origin is the globally asymptotically stable equilibrium of the scaled ODE, the scaled ODE must return to the origin. Hence, the original value function iterates also must move towards a bounded set, thus guaranteeing the stability of value function iterates.

Lemma 3.