1 Introduction
We consider the problem of reinforcement learning (RL) for controlling an unknown dynamical systems with an unbounded state space. Such problems are ubiquitous in various application domains, as exemplified by scheduling for networked systems. As a paradigm for learning to control dynamical systems, RL has a rich literature. In particular, algorithms for the setting with finite, bounded or compact state spaces has been well studied, with both classical asymptotic results and recent nonasymptotic performance guarantees. However, literature on problems with unbounded state space is scarce, with few exceptions such as linear quadratic regulator [1, 15], where the structure of the dynamics is known. Indeed, the unboundedness of the state space presents with new challenges for algorithm or policy design, as well as analysis of policy in terms of quantifying the “goodness”.
A motivating example. To exemplify the challenges involved, we consider a simple example of discretetime queueing system with two queues as shows in Figure 1: Jobs arrives to queue per Bernoulli process with rate . A central server can choose to serve job from one of the queues at each time, and if a job from queue is chosen to serve, it departs the system with probability . That is, in effect amount of “work” arrives to queue while total amount of work the system can do is . The state of the system is with representing number of jobs in the th queue.
The evolution of the system is controlled by a scheduling decision that specifies which queue
to serve at each time. Viewed as a Markov decision process (MDP), the state space is
and the action space is . Operating under a Markov policy , the server will serve queue at state . The problem of stochastic control is to identify a policy that optimizes a given criterion (e.g., average or discounted total queue lengths). Finding such a policy using empirical data is the question of Reinforcement Learning (RL).Solutions that may not work. In traditional RL approach, the policy is trained offline using finitely many samples for finite, bounded or compact state spaces and then it is deployed in wild without further changes. A natural adaption of such an approach is by restricting the RL policy to a finite subset of the state space chosen appropriately or arbitrarily. Examples of such algorithms include modelbased methods (such as UCRL/PSRL [25, 38]
) that estimate the transition probabilities of the system and then solve the dynamic programming problem on the estimated system, and modelfree methods (such as TD/QLearning
[48]) that directly estimate the value function.However, even in our simple, motivating example, the system will reach a state not contained in the training data with nonzero probability. The estimate for ’s transition probabilities and value function will remain at their initial/default values (say 0). With such an uninformative estimate, the corresponding policy will be independent of the state . And it is likely that the policy may end up serving empty queue with a nonzero probability. This might cause the queues to grow unboundedly with strictly positive probability. Clearly, more sophisticated approaches to truncate system are not going to help as they will suffer from similar issue.
An alternative to truncation is to find “lower dimensional structure” through functional approximation, e.g., by parametrizing the policy
within some function class (such as linear functions or neural networks). For this approach to work, the function class must be expressive enough to contain a stable policy. However, it is not at all clear,
a priori, which parametric function class has this property, even for the simple example in Figure 1. This challenge is only exacerbated in more complicated systems. Although some approximation architectures work well empirically [34, 33], there is no rigorous performance guarantee in general.Challenges. To sum up, the traditional RL approaches for finite, bounded or compact state space are not well suited for systems with unbounded state space. Approaches that rely on offline training only are bound to fail as system will reach a state that is not observed in finitely many samples during offline training and hence, there is no meaningful guidance from the policy. Therefore, to learn a reasonable policy with an unbounded state space, the policy ought to be updated whenever a new scenario is encountered. That is, unlike traditional RL, we need to consider online policies, i.e., one that is continually updated upon incurring new scenarios, very much like that in the context of Multiarm bandit (cf. [9]).
Another challenge is in analyzing or quantifying “goodness” of such a policy. Traditionally, the “goodness” of an RL policy is measured in terms of the error induced in approximating, for example, the optimal value function over the entire state space; usually measured through norm error bound, cf. [28, 8]. Since the state space is unbounded, expecting a good approximation of the optimal value function over the entire state space is not a meaningful measure. Therefore, we need an alternative to quantify the “goodness” of a policy.
Questions of interest. In this work, we are interested in the following questions: (a) What is the appropriate “goodness” of performance for a RL policy for unbounded state space? (b) Is there an online, datadriven RL policy that achieves such “goodness”? and if so, (c) How does the number of samples required per timestep scale?
Our contributions. Motivated by the above considerations, we consider discounted Markov Decision Processes with an unbounded state space and a finite action space, under a generative model.
New Notion of Stability. As the main contribution, we introduce notion of stability to quantify “goodness” of RL policy for unbounded state space inspired by literature in queueing systems and control theory. Informally, an RL policy is stable if the system dynamics under the policy returns to a finite, bounded or compact subset of the system infinitely often — in the context of our example, it would imply that queuesizes remain finite with probability .
Stable RL Policy. As a proof of concept, we present a simple RL policy using a Sparse Sampling Monte Carlo Oracle that is stable for any MDP, as long as the optimal policy respects a Lyapunov function with drift condition (cf. Condition 2). Our policy does not require knowledge of or access to such a Lyapunov function. It recommends an action at each time with finitely many simulations of the MDP through the oracle. That is, the policy is online and guarantees stability for each trajectory starting without any prior training. The number of samples required at each time step scales as where is the drift in Lyapunov function. To our best knowledge, ours is the first online RL policy that is stable for generic MDP with unbounded state space.
Sample Efficient Stable RL Policy. To further improve the efficiency, for MDPs with Lipschitz optimal value function, we propose a modified Sparse Sampling Monte Carlo Oracle for which the number of samples required at each time step scales as where is the dimension of the state space. That is, the sample complexity becomes polynomial in from being superpolynomial with the vanilla oracle. The efficient oracle utilizes the minimal structure of smoothness in the optimal value function and should be of interest in its own right as it provides improvement for all policies in the literature where such an oracle plays a key role, e.g., [27, 39].
Adaptive Algorithm Based on Statistical Tests. While the algorithm does not require knowing the Lyapunov function itself, it does have a parameter whose optimal value depends on the drift parameter of the Lyapunov function. Therefore, we further develop an adaptive, agnostic version of our algorithm that automatically searches for an appropriate tuning parameter. We establish that either this algorithm discovers the right value and hence ensures stability, or the system is nearstable in the sense that as . The nearstability is a form of sublinear regret. For example, in the context of a queueing system, this would correspond to queues growing as with time in contrast to queues for stable (or optimal) policy. Further, in the context of queueing system, it would imply “fluid” or “rate” stability cf. [13, 12] — to the best of our knowledge, this is first such general RL policy for generic queueing systems with such a property.
Settings/Conditions  Guarantees  
Ours  Unbounded space; Unknown dynamics; Existence of unknown Lyapunov function  Stochastic Stability 
[15]  Linear dynamics; Unknown parameters; Quadratic cost (LQR)  Constraints on state & action 
[49]  Finite space; Deterministic, known dynamics; Gaussian safety function  Constraints on state & action 
[55]  Unknown dynamics; Compact parametrized policy class  Expected constraint costs (CMDP) 
[51]  Gaussian process dynamics; Unknown parameters  Controltheoretic stability 
[7]  Compact space; Deterministic, partially known dynamics; Access to Lyapunov func.  Controltheoretic stability 

Related work. The concept of stability introduced in this paper is related to the notion of safety in RL, but with crucial differences. Various definitions of safety exist in literature [40, 19], such as hard constraints on individual states and/or actions [18, 23, 14, 15, 29], expected cumulative constraint costs formulated as Constrained MDP [2, 11, 55] and controltheoretic notions of stability [6, 51, 7]. Importantly, in our work, stability is defined in terms of the positive recurrence of Markov chains, which cannot be immediately written as constraints/cost over the states and actions. In particular, our stability notion captures longterm behaviors of the system—it should eventually stay in a desirable, bounded region with high probability. In general, there does not exist an action that immediately drives the system back to that region; learning a policy that achieves so in the long run is nontrivial and is precisely our goal. Overall, we believe this new notion of stability provides a generic, formal framework for studying RL with unbounded state space. In Table 1, we provide a concise comparison with some prior work, and we refer the readers to Appendix A for detailed discussions.
2 Setup and Problem Statement
2.1 Markov Decision Process and Online Policy
Markov Decision Process (MDP). We consider a discretetime discounted Markov decision process (MDP) defined by the tuple where and are the state space and action space, respectively, is the Markovian transition kernel, is the reward function, and is a discounted factor. At time , the system is in state ; upon taking action , the system transits to with probability and generates a reward At time , the action is chosen as per some policy where represents the probability of taking action given state . If for all , then it is called a stationary policy. For each stationary policy, we define the standard value function and Qfunction, respectively, as and . An optimal stationary policy is the policy that achieves optimal value, i,e., . Correspondingly, . It is well understood that such an optimal policy and associated value/Q functions, and , exist in a reasonable generality; cf. [8]. We focus on MDPs satisfying the following condition.
Condition 1.
Action space is finite, and state space is unbounded with some .
We assume that the reward function is bounded and takes value in . Consequently, for any policy, and function are bounded and take value in where . Breaking ties randomly, we shall restrict to optimal policy of the following form: for each ,
where . Define For a given state , if not all actions are optimal, then . Denote the minimum gap by . Throughout the paper, we assume that .
System dynamics and online policy. In this work, our interest is in designing online policy starting with no prior information. Precisely, the system starts with arbitrary initial state and initial policy . At time , given state , action is chosen with probability leading to state with probability . At each time , policy is decided using potentially a finite number of simulation steps from the underlying MDP, in addition to the historical information observed till time . In this sense, the policy is online.
2.2 Stability
We desire our online policy to have a stability property, formally defined as follows.
Definition 1 (Stability).
We call policy stable if for any , there exists a bounded set such that the following are satisfied:

Boundedness:
(1) 
Recurrence: Let .^{1}^{1}1 of empty set is defined as . Then
(2)
In words, we desire that starting with no prior information, the policy learns as it goes and manages to retain the state in a finite, bounded set with high probability. It is worth remarking that the above stability property is similar to the recurrence property for Markov chains.
Problem statement. Design an online stable policy for a given, discounted cost MDP.
MDPs respecting Lyapunov function. As we search for a stable policy for MDP, if the optimal stationary policy for MDP is stable, then the resulting system dynamics under optimal policy is a time homogeneous Markov chain that is positive recurrent. The property of positive recurrence is known to be equivalent to the existence of the socalled Lyapunov function; see [35]. In particular, if there exists a policy for the MDP under which the resulting Markov chain is positive recurrent, then this policy has a Lyapunov function. These observations motivate us to restrict attention to MDPs with the following property.
Condition 2.
The Markov chain over induced by MDP operating under the optimal policy respects a Lyapunov function such that , for all finite , and such that for some , for any :

(Bounded Increment) For every and every ,
(3) for all possible next states with .^{2}^{2}2 should be understood as the density of the conditional distribution of the next state with respect to the Lebesgue measure (for continuous state space) or the counting measure (for discrete state space).

(Drift Condition) For every such that
(4)
Recall our motivating example of a queueing system with two queues. It is well known and can be easily verified that for the servethelongestqueue policy, serves as a Lyapunov function satisfying Condition 2 as long as . As explained above, the requirement of MDP respecting Lyapunov function under optimal policy is not restrictive. We further note that our policy will not require precise knowledge of or access to such a Lyapunov function.
3 Online Stable Policy
We present our main results in this section. First, we describe a stationary, online policy that is stable under Conditions 1 and 2. This policy is simple and provides key intuition behind our approach, though being sample inefficient. Next, we present an efficient version thereof that utilizes the minimal structure of Lipschitz optimal value function (cf. Condition 3). Finally, we design an adaptive version of our algorithm that automatically discover the appropriate tuning parameter without knowing the value of the drift parameter of the Lyapunov function.
3.1 Sample Inefficient Stable Policy
Overview of policy. At each time , given the state , an action is chosen by sampling from a distribution over . The distribution is determined by an online planning algorithm that uses finitely many simulations of MDP performed at time and depends only on the input state . Therefore, the policy is stationary. Precisely, using Monte Carlo Oracle with Sparse Sampling for parameters (details below, adapted from [26]), we produce as an estimation of , for each action . We use Boltzman distribution with parameter for sampling action :
Sparse Sampling Monte Carlo Oracle. We describe the Monte Carlo Oracle based on sparse sampling [26]. Given an input state with two integer valued parameters, , and an estimate of value function , it produces an estimate of for all . The error in estimate depends on and error in value function . With larger values of and , the estimation error decreases but the number of simulations of MDP required increases. Next we provide details of the algorithm.
Sparse sampling oracle constructs a tree of depth representing a partial step lookahead of MDP starting with the input state as its root. The tree is constructed iteratively as follows: at any state node in the tree (initially, root node ), for each action , sample times the next state of MDP for the stateaction pair . Each of the resulting next states, for each , are placed as children nodes, with the edge between and the child node labeled by the generated reward. The process is repeated for all nodes of each level until reaching a depth of . That is, it builds an array tree of depth representing a partial step lookahead tree starting from the queried state , and hence the term sparse sampling oracle.
To obtain an estimate for the values associated with , we start by assigning value to each leaf node at depth , i.e., . These values, together with the associated rewards on the edges, are then backedup to find estimates of values for their parents, i.e., nodes at depth . The estimate for the value of a parent node and an action is just a simple average over ’s children, i.e., where is the reward on the edge of action , and is the set of ’s children nodes. The estimate for the value of state is given by The process is recursively applied from the leaves up to the root level to find estimates of for the root node .
Lemma 1 (Oracle Guarantee).
Given input state and , under the Sparse Sampling Monte Carlo Oracle, we have
(5) 
with probability at least , with choice of parameters
(6) 
The number of simulation steps of MDP utilized are , which as a function of scales as .
We shall omit the proof of Lemma 1 as it follows directly from result in [26]. Further, we provide proof of Lemma 2 which establishes similar guarantees for a modified sampleefficient Sparse Monte Carlo Oracle.
3.1.1 Stability Guarantees
Stability. We state the result establishing stability of the stationary policy described above under appropriate choice of parameters and . The proof is provided in Appendix B.
3.2 Sample Efficient Stable Policy
The Purpose. Theorem 1 suggests that, as a function of , scales as . Indeed, as becomes larger, i.e., the negative drift for Lyapunov function under optimal policy increases, system starts living in a smaller region with a higher likelihood. The challenging regime is when is small, formally . Back to our queueing example, this corresponds to what is known as Heavy Traffic regime in queueing systems. Analyzing complex queueing systems in this regime allows understanding of fundamental “performance bottlenecks” within the system and subsequently understand the properties of desired optimal policy. Indeed, a great deal of progress has been made over more than past four decades now; see for example [24, 22, 31]. Back to the setting of MDP with : Based on Theorem 1, the number of samples required per time step for the policy described in Section 3.1 scales as . That is, the policy is superpolynomial in .
Minimal Structural Assumption. In what follows, we describe a stable policy with the number of samples required scaling polynomially in , precisely where is the dimension of the statespace . This efficiency is achieved due to minimal structure in the optimal value function in terms of Condition 3, which effectively states that the value function is Lipschitz. Specifically, we provide an efficient Sparse Sampling Monte Carlo Oracle exploiting the Bounded Increment property (3) in Condition 2 along with Lipschitzness of optimal value function as described in Condition 3. We remark that for learning with continuous state/action space, smoothness assumption such as the Lipschitz continuity is natural and typical [4, 45, 44, 16].
Condition 3.
Let . The optimal value function is a Lipschitz continuous function. Precisely, there exists constant such that for any ,
(8) 
Overview of policy. The policy is exactly the same as that described in Section 3.1, but with a single difference: instead of using the Sparse Sampling Monte Carlo Oracle, we replace it with an efficient version that exploits Condition 3 as described next.
Efficient Sparse Sampling Monte Carlo Oracle. We describe a modification of the Monte Carlo Oracle based on sparse sampling described earlier. As before, our interest is in obtaining an approximate estimation of for a given and any with minimal number of samples of the underlying MDP. To that end, we shall utilize property (3) of Condition 2 and Condition 3 to propose a modification of the method described in Section 3.1. For a given parameter , define
For any , so that or . Define that maps to its closest element in : i.e. .
For a given , we shall obtain a good estimate of effectively using method described in Section 3.1. We start with as the root node. For each stateaction pair , we sample next state of MDP times leading to states . In contrast to the method described in Section 3.1, we use states in place of . These form states (or nodes) as part of the sampling tree at level . For each state, say at level , for each , we sample next states of MDP, and replace them by their closest elements in to obtain states or nodes at level , we repeat till levels. Note that all states on level to are from . To improve sample efficiency, during the construction, if a state has been visited before, we use previously sampled next states for each action , instead of obtaining new samples.
To obtain an estimate for the Qvalue associated with the root state and any action , we start by assigning the value to each leaf node at depth , i.e., . These values, together with the associated rewards on the edges, are then backedup to find estimates for values of their parents at depth , and this is repeated till we reach the root node, . Precisely, for and node at level ,
(9) 
The method outputs as estimate of for all .
3.2.1 Stability Guarantees
Improved Sample Complexity. In Lemma 2, we summarize the estimation error as well as the number of samples utilized by this modified Sparse Sampling Monte Carlo Oracle.
Lemma 2 (Modified Oracle Guarantee).
Given input state , , under the modified Sparse Sampling Monte Carlo Oracle, we have
with probability at least , with choice of parameters , , and
The number of simulation steps of MDP utilized, as a function of scales as .
Stability. We state the result establishing stability of the stationary policy described above under appropriate choice of parameters and .
Theorem 2.
As discussed earlier, as , with above choice , and hence the number of samples required per time step scale as . That is, the sample complexity per time step is polynomial in the Lyapunov drift parameter , rather than superpolynomial as required in Theorem 1.
3.3 Discovering Appropriate Policy Parameter
The Purpose. The sample inefficient policy described in Section 3.1 or sample efficient policy described in Section 3.2 are stable only if the parameter (or equivalently , since in Theorem 1 or 2) is chosen to be small enough. However, what is small enough value for is not clear a priori without knowledge of the system parameters as stated in (7) or (10). In principle, we can continue reducing ; however, the sample complexity required per unit time increases with reduction in . Therefore, it is essential to ensure that the reduction is eventually terminated. Below we describe such a method, based on a hypothesis test using the positive recurrence property established in the proofs of Theorems 1 and 2.
A Statistical Hypothesis Test. Toward the goal of finding an appropriate value of
, i.e., small enough but not too small, we describe a statistical hypothesis test — if
is small enough, then test passes with high probability. We shall utilize this test to devise an adaptive method that finds the right value of as described below. We state the following structural assumption.Condition 4.
Consider the setting of Condition 2. Let the Lyapunov function , in addition, satisfy
(11) 
for all with constants and .
Let be as defined in (7) and . Under the above condition, arguments used in the proofs of Theorem 1 and 2 establish that if , then the following exponential tail bound holds:
Note that the probability bound on the right hand side is summable over when choosing, say, . This property enables a hypothesis test, via checking
, which is used below to devise an adaptive method. The norm in use can be arbitrary since all norms are equivalent in a finitedimensional vector space.
An Adaptive Method for Tuning . Using the statistical hypothesis test, we now describe an algorithm that finds under which the hypothesis test is satisfied with probability exponentially close to , and is strictly positive. Initially, set . At each time , we decide to whether adjust value of or not by checking whether . If yes, then we set ; else we keep .
3.3.1 Stability Guarantees
The following theorem provides guarantees for the above adaptive algorithm.
Theorem 3.
Consider the setup of Theorem 1 (respectively Theorem 2). Let, in addition, Condition 4 hold. Consider the system operating with choice of parameter at any time as per the above described method with policy described in Section 3.1 (respectively Section 3.2). Then with probability 1, the system operating under such changing choice of is either eventually operating with value and hence stable, or
(12) 
moreover, we have
(13) 
The theorem guarantees that the system is either stable as the algorithm finds the appropriate policy parameter, or nearstable in the sense that the state grows at most as . That is, this adaptive algorithm induces at worst regret since the optimal policy will retain . This is, indeed a lowregret algorithm in that respect.
Connecting to Fluid Stability. When viewed in the context of queueing system, this result suggests that the queuesizes (which is the state of the system) scales as . As mentioned earlier, this is related to fluid stability. Precisely, consider rescaling . Then, by definition, we have . For any fixed , this would suggest as . Since the system trajectory of state induces Lipschitz sample paths, for any the limit points of exist (due to compactness of appropriately defined metric space) and they are all . This will imply fluid or rate stability in a queueing system: the “departure rate” of jobs is the same as the “arrival rate” of jobs, cf. [12, 13]. This is a highly desirable guarantee for queueing systems that is implied by our sample efficient, online and adaptive RL policy.
4 Conclusion
In this paper, we investigate reinforcement learning for systems with unbounded state space, as motivated by classical queueing systems. To tackle the challenges due to unboundedness of the state space, we propose a natural notion of stability to quantify the “goodness” of policy and importantly, design efficient, adaptive algorithms that achieves such stability.
Stability in problems with unbounded state spaces are of central importance in classical queueing and control theory, yet this aspect has received relatively less attention in existing reinforcement learning literature. As reinforcement learning becomes increasingly popular in various application domains, we believe that modeling and achieving stability is critical. This paper and the framework introduced provide some first steps towards this direction.
References
 [1] Yasin AbbasiYadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.

[2]
Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel.
Constrained policy optimization.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pages 22–31. JMLR. org, 2017.  [3] Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
 [4] András Antos, Csaba Szepesvári, and Rémi Munos. Fitted qiteration in continuous actionspace mdps. In Advances in neural information processing systems, pages 9–16, 2008.
 [5] Anil Aswani, Humberto Gonzalez, S Shankar Sastry, and Claire Tomlin. Provably safe and robust learningbased model predictive control. Automatica, 49(5):1216–1226, 2013.
 [6] Felix Berkenkamp, Angela P Schoellig, and Andreas Krause. Safe controller optimization for quadrotors with gaussian processes. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 491–496. IEEE, 2016.
 [7] Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe modelbased reinforcement learning with stability guarantees. In Advances in neural information processing systems, pages 908–918, 2017.
 [8] D.P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 2017.
 [9] Sébastien Bubeck, Nicolo CesaBianchi, et al. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
 [10] Steven Chen, Kelsey Saulnier, Nikolay Atanasov, Daniel D Lee, Vijay Kumar, George J Pappas, and Manfred Morari. Approximating explicit model predictive control using constrained neural networks. In 2018 Annual American Control Conference (ACC), pages 1520–1527. IEEE, 2018.
 [11] Yinlam Chow, Ofir Nachum, Edgar DuenezGuzman, and Mohammad Ghavamzadeh. A lyapunovbased approach to safe reinforcement learning. In Advances in Neural Information Processing Systems, pages 8092–8101, 2018.
 [12] Jim G Dai. On positive harris recurrence of multiclass queueing networks: a unified approach via fluid limit models. The Annals of Applied Probability, pages 49–77, 1995.

[13]
Jim G Dai and Sean P Meyn.
Stability and convergence of moments for multiclass queueing networks via fluid limit models.
IEEE Transactions on Automatic Control, 40(11):1889–1904, 1995.  [14] Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757, 2018.
 [15] Sarah Dean, Stephen Tu, Nikolai Matni, and Benjamin Recht. Safely learning to control the constrained linear quadratic regulator. In 2019 American Control Conference (ACC), pages 5582–5588. IEEE, 2019.
 [16] François Dufour and Tomás PrietoRumeau. Approximation of markov decision processes with general state space. Journal of Mathematical Analysis and applications, 388(2):1254–1267, 2012.
 [17] Atilla Eryilmaz and Rayadurgam Srikant. Asymptotically tight steadystate queue length bounds implied by drift conditions. Queueing Systems, 72(34):311–359, 2012.

[18]
Javier Garcia and Fernando Fernández.
Safe exploration of state and action spaces in reinforcement
learning.
Journal of Artificial Intelligence Research
, 45:515–564, 2012.  [19] Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
 [20] Peter W Glynn, Assaf Zeevi, et al. Bounding stationary expectations of markov processes. In Markov processes and related topics: a Festschrift for Thomas G. Kurtz, pages 195–214. Institute of Mathematical Statistics, 2008.
 [21] Bruce Hajek. Hittingtime and occupationtime bounds implied by drift analysis with applications. Advances in Applied probability, 14(3):502–525, 1982.
 [22] Shlomo Halfin and Ward Whitt. Heavytraffic limits for queues with many exponential servers. Operations research, 29(3):567–588, 1981.
 [23] Alexander Hans, Daniel Schneegaß, Anton Maximilian Schäfer, and Steffen Udluft. Safe exploration for reinforcement learning. In ESANN, pages 143–148, 2008.
 [24] J Michael Harrison. The diffusion approximation for tandem queues in heavy traffic. Advances in Applied Probability, 10(4):886–905, 1978.
 [25] Thomas Jaksch, Ronald Ortner, and Peter Auer. Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
 [26] Michael Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for nearoptimal planning in large markov decision processes. Machine learning, 49(23):193–208, 2002.
 [27] Michael Kearns, Yishay Mansour, and Satinder Singh. Fast planning in stochastic games. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pages 309–316, 2000.
 [28] Michael Kearns and Satinder Singh. Finitesample convergence rates for qlearning and indirect algorithms. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, pages 996–1002, Cambridge, MA, USA, 1999. MIT Press.
 [29] Torsten Koller, Felix Berkenkamp, Matteo Turchetta, and Andreas Krause. Learningbased model predictive control for safe exploration. In 2018 IEEE Conference on Decision and Control (CDC), pages 6059–6066. IEEE, 2018.
 [30] Subhashini Krishnasamy, Ari Arapostathis, Ramesh Johari, and Sanjay Shakkottai. On learning the c rule in single and parallel server networks. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 153–154. IEEE, 2018.
 [31] Harold Kushner. Heavy traffic analysis of controlled queueing and communication networks, volume 47. Springer Science & Business Media, 2013.
 [32] Bai Liu, Qiaomin Xie, and Eytan Modiano. Reinforcement learning for optimal control of queueing systems. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 663–670. IEEE, 2019.
 [33] Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM Special Interest Group on Data Communication, pages 270–288. ACM, 2019.
 [34] Hongzi Mao, Shaileshh Bojja Venkatakrishnan, Malte Schwarzkopf, and Mohammad Alizadeh. Variance reduction for reinforcement learning in inputdriven environments. arXiv preprint arXiv:1807.02264, 2018.
 [35] JeanFrançois Mertens, Ester SamuelCahn, and Shmuel Zamir. Necessary and sufficient conditions for recurrence and transience of markov chains, in terms of inequalities. Journal of Applied Probability, 15(4):848–851, 1978.
 [36] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 [37] Ciamac C Moallemi, Sunil Kumar, and Benjamin Van Roy. Approximate and datadriven dynamic programming for queueing networks. Submitted for publication, 2008.
 [38] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
 [39] David C Parkes, Dimah Yanovsky, and Satinder P Singh. Approximately efficient online mechanism design. In Advances in Neural Information Processing Systems, pages 1049–1056, 2005.
 [40] Martin Pecka and Tomas Svoboda. Safe exploration techniques for reinforcement learning–an overview. In International Workshop on Modelling and Simulation for Autonomous Systems, pages 357–375. Springer, 2014.
 [41] Theodore J Perkins and Andrew G Barto. Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research, 3(Dec):803–832, 2002.

[42]
Benjamin V Roy and Daniela D Farias.
Approximate linear programming for averagecost dynamic programming.
In Advances in neural information processing systems, pages 1619–1626, 2003.  [43] Dorsa Sadigh and Ashish Kapoor. Safe control under uncertainty with probabilistic signal temporal logic. In Proceedings of Robotics: Science and Systems, 2016.
 [44] Devavrat Shah and Qiaomin Xie. Qlearning with nearest neighbors. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3115–3125. Curran Associates, Inc., 2018.

[45]
Devavrat Shah, Qiaomin Xie, and Zhi Xu.
Nonasymptotic analysis of monte carlo tree search.
In ACM SIGMETRICS, 2020.  [46] R Srikant and Lei Ying. Finitetime error bounds for linear stochastic approximation and td learning. arXiv preprint arXiv:1902.00923, 2019.
 [47] Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
 [48] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press Cambridge, 1998.
 [49] Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in finite markov decision processes with gaussian processes. In Advances in Neural Information Processing Systems, pages 4312–4320, 2016.
 [50] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In AAAI, volume 2, page 5. Phoenix, AZ, 2016.
 [51] Julia Vinogradska, Bastian Bischoff, Duy NguyenTuong, Anne Romer, Henner Schmidt, and Jan Peters. Stability of controllers for gaussian process forward models. In International Conference on Machine Learning, pages 545–554, 2016.
 [52] Kim P Wabersich and Melanie N Zeilinger. Safe exploration of nonlinear dynamical systems: A predictive safety filter for reinforcement learning. arXiv preprint arXiv:1812.05506, 2018.
 [53] Christopher JCH Watkins and Peter Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.
 [54] Yuzhe Yang, Guo Zhang, Zhi Xu, and Dina Katabi. Harnessing structures for valuebased planning and reinforcement learning. ICLR, 2020.
 [55] Ming Yu, Zhuoran Yang, Mladen Kolar, and Zhaoran Wang. Convergent policy optimization for safe reinforcement learning. In Advances in Neural Information Processing Systems, pages 3121–3133, 2019.
Appendix A Related Work
The concept of stability introduced in this paper is related to the notion of safety in RL, but with crucial differences. Various definitions of safety exist in literature [40, 19]. One line of work defines safety as hard constraints on individual states and/or actions [18, 23, 14, 15, 29]. Some other work considers safety guarantee in terms of keeping the expected sum of certain costs over a trajectory below a given threshold [2, 11, 55]. In our work, stability is defined in terms of the positive recurrence of Markov chains, which cannot be immediately written as constraints/cost over the state and actions. In particular, our stability notion captures longterm behaviors of the system—it should eventually stay in a desirable region of the state space with high probability. In general, there does not exist an action that immediately drives the system back to that region; learning a policy that achieves so in the long run is nontrivial and precisely our goal.
Many work on RL safety is modelbased, either requiring a prior known safe backup policy [18, 23], or using model predictive control approaches [52, 43, 5, 29]. One line of work focuses specifically on systems with a linear model with constraints (e.g., LQR) [10, 15]. Some other work considers modelfree policy search algorithms [2, 11, 55], under the framework of constrained Markov decision process [3], which models safety as expected cumulative constraint costs.
Another line of work considers controltheoretic notions of stability [6, 51, 7], which bears similarity to our framework. We remark that these results mostly focus on systems with deterministic and partially unknown dynamics, different from our setting where the dynamics are stochastic and unknown. Their approaches are limited to compact state spaces where discretization is feasible.
Our analysis makes use of Lyapunov functions, which is a classical tool in control and Markov chain theory for studying stability and steadystate behaviors [20, 17]. The work [41] is among the first to use Lyapunov functions in RL and studies closedloop stability of an agent. More recent work uses Lyapunov functions to establish the finitetime error bounds of TDlearning [46] and to solve constrained MDPs [11] and to find region of attraction for deterministic systems [7, 6].
Our RL algorithm fits broadly into valuebased methods [53, 47, 36, 50, 54, 44]. Approximate dynamic programming techniques and RL have been applied to queueing problems in prior work [30, 42, 37], though their settings and goals are quite different from us, and their approaches exploit prior knowledge of queueing theory and specific structures of the problems. Most related to us is the recent work [32], which also considers problems with unbounded state spaces. Their algorithm makes use of a known stabilizing policy. We do not assume knowledge of such a policy; rather, our goal is to learn one from data.
Appendix B Proof of Theorem 1
For any given , we know that at each step, with probability ,
(14) 
with appropriate choice of parameters as stated in Lemma 1. The stationary policy utilizes Botzman transformation of . The following lemma establishes the approximation error between the Boltzman policy and optimal policy.
Lemma 3.
Given state let be such that with probability at least ,
Consider two Boltzmann policies with temperature
Then, we have that

With probability at least

With probability at least ,
From above Lemma, with notation for any , we obtain that for our stochastic policy , with probability ,
(15) 
By Condition 2, we know that the MDP dynamic under the optimal policy respects a Lyapunov function that has bounded increment and drift property. As per (15), the Boltzman policy is a good approximation of the optimal policy at each time step with high probability. The following Lemma argues that under such an approximate policy, MDP respects the same Lyapunov function but with a slightly modified drift condition.
Lemma 4.
Consider the set up of Theorem 1. Suppose that at each time step , a stochastic policy (i.e., is a distribution over ) is executed such that for each , with probability at least ,
Then, for every such that , we have
Now, based on Lemma 4, we note that for every such that , the following drift inequality holds for our stochastic policy :
(16) 
We shall argue that under choice of and as per (7), the right hand size of (16) is less than . To do so, it is sufficient to argue that . That is, we want to argue
(17) 
To establish the above mentioned claim, it is sufficient to argue that each of (I), (II) and (III) is no more than , under the choice of as per (7) and . To that end, (III) is less than immediately since . For (II), similar claim follows due to . For (I), using facts that and for and , we have that (with )
Thus, we conclude that if , then
(18) 
We recall the following result of [21] that implies positive recurrence property for stochastic system satisfying drift condition as in (18).
Lemma 5.
Consider a policy . Suppose that there exists a Lyapunov function such that the policy satisfies the bounded increment condition with parameter and the drift condition with parameters and . Let . Let , and . Then it follows that for all ,
(19)  
(20) 
By immediate application of Lemma 5, with , and , it follows that
(21) 
and ,
(22) 
where is the return time to a set that the Lyapunov function is bounded by starting from time .
Define level set . By Condition 2, for any finite . Now (21) implies that for any small , we have
By letting and be large enough, we can always make the above probability bound as small as possible. That is, for any given , there exist a large and a small , such that
In addition, (22) implies that
Therefore,
That is, given the current state , the return time to the bounded set is uniformly bounded, across all . This establishes the stability of the policy as desired and Theorem 1.
b.1 Proof of Lemma 3
We first bound the total variation distance between and . For each , we have
Comments
There are no comments yet.