I Introduction
Markov decision processes (MDPs) are classical formalization of sequential decision making in discretetime stochastic control processes [sutton2018reinforcement]
. In MDPs, the outcomes of actions are uncertain, and influence not only immediate rewards, but also future rewards through next states. Policies, which are strategies for action selection, should therefore strike a tradeoff between immediate rewards and delayed rewards, and be optimal in some sense. MDPs have gained recognition in diverse fields such as operations research, economics, engineering, wireless networks, artificial intelligence, and learning systems
[kallenberg2011markov]. Moreover, an MDP is a mathematically idealized form of the reinforcement learning problem, which is an active research field within the machine learning community. In many situations however, finding the optimal policies with respect to a single reward function does not suffice to fully describe sequential decision making in problems with multiple conflicting objectives
[mannor2004geometric, van2014multi]. The framework of constrained MDPs (CMDPs) is the natural approach for handling multiobjective decision making under uncertainty [altman1999constrained].Algorithmic methods for solving CMDPs have been extensively studied when the underlying transition probability function is known
[altman1999constrained, dolgov2005stationary, zadorojniy2006robustness, piunovskiy2000constrained, chen2004dynamic, piunovskiy2006dynamic, chen2007non], and unknown [geibel2005risk, geibel2012learning, borkar2005actor, abad2002self, bhatnagar2010actor, bhatnagar2012online, tessler2018reward, liang2018accelerated, fu2018risk, achiam2017constrained, prashanth2014actor, paternain2019safe, ding2020provably, raybenchmarking, khairy2020constrained]. In the case of finite stateaction spaces with known transition probability function, the solution for a CMDP can be obtained by solving finite linear programs [altman1999constrained, zadorojniy2006robustness, dolgov2005stationary], or by deriving a Bellman optimality equation with respect to an augmented MDP whose state vector consists of two parts: the first part is the state of the original MDP, while the second part keeps track of the cumulative constraints cost
[piunovskiy2000constrained, piunovskiy2006dynamic, chen2004dynamic, chen2007non]. Linear programming methods become computationally impractical at a much smaller number of states than dynamic programming, by an estimated factor of about 100
[sutton2018reinforcement]. In practice, MDPspecific algorithms, which are dynamic programming based methods, hold more promise for efficient solution [littman2013complexity]. While MDP augmentation based methods provide a theoretical framework to apply dynamic programming to constrained stochastic control, they introduce continuous variables to the state space which rules out practical tabular methods and make the design of a solution algorithm challenging [chow2015risk].On the other hand, solution methods for the constrained reinforcement learning problem, i.e., CMDPs with unknown transition probability, are generally based on Lagrangian primaldual type optimization. In these methods, gradientascent is performed on statevalues at a fast time scale to find the optimal value function for a given set of Lagrangian multipliers, while gradientdescent is performed on the Lagrangian multipliers at a slower time scale. This process is repeated until convergence to a saddle point. Existing works have explored the primaldual optimization approach in the tabular setting, i.e., without function approximation [geibel2005risk, geibel2012learning, borkar2005actor, abad2002self]
, and with function approximators such as deep neural networks
[bhatnagar2010actor, bhatnagar2012online, prashanth2014actor, tessler2018reward, liang2018accelerated, fu2018risk, achiam2017constrained, raybenchmarking, ding2020provably, paternain2019safe, khairy2020constrained]. While this approach is appealing in its simplicity, it suffers from the typical problems of gradient optimization on nonsmooth objectives, sensitivity to the initialization of the Lagrange multipliers, and convergence dependency on the learning rate sequence [chow2017risk, achiam2017constrained, liu2019ipo, liang2018accelerated]. If the learning rate is too small, the Lagrange multipliers will not update quickly to enforce the constraint; and if it is too high, the algorithm may oscillate around the optimal solution. In practice, a sequence of decreasing learning rates should be adopted to guarantee convergence [borkar2005actor], yet we do not have an obvious method to determine the optimal sequence, nor we can assess the quality of a solution in cases where the objective is not differentiable at the optimal solution.In this brief, we develop a new approach to solve finite CMDPs with discrete stateaction space and known probability transition function. We first prove that the optimization objective in the dual linear program of a finite CMDP is a piecewise linear convex function (PWLC) with respect to the Lagrange penalty multipliers. Next, we treat the dual linear program of a finite CMDP as a search problem over the Lagrange penalty multipliers, and propose a novel twolevel GradientAware Search (GAS) algorithm, which exploits the PWLC structure to find the optimal statevalue function and Lagrange penalty multipliers of a CMDP. We empirically compare the convergence performance of the proposed GAS algorithm with binary search (BS), Lagrangian primaldual optimization (PDO), and Linear Programming (LP), in two application domains, robot navigation in grid world and wireless network management. Compared with benchmark algorithms, we show that the proposed GAS algorithm converges to the optimal solution faster, does not require hyperparameter tuning, and is not sensitive to initialization of the Lagrange penalty multiplier.
The remainder of this paper is organized as follows. A background of unconstrained and constrained MDPs is given in Section II. Our proposed GradientAware Search (GAS) algorithm is proposed in Section III. Performance evaluation of GAS in two application domains is presented in Section IV, followed by our concluding remarks and future work in Section V.
Ii Background and Related Works
Iia Unconstrained Markov Decision Processes
An infinite horizon Markov Decision Process (MDP) with discountedreturns is defined as a tuple , where and are finite sets of states and actions, respectively, is the model’s stateactionstate transition probabilities, and is the initial distribution over the states, , is the reward function which maps every stateaction pair to the set of real numbers , and is the discount factor. Denote the transition probability from state to state if action is chosen by . The transition probability from state to state is therefore, , where is the adopted policy. The statevalue function of state under policy , , is the expected discounted return starting in state and following thereafter,
.  (1) 
Let be the vector of state values, . The solution of an MDP is a Markov stationary policy which maximizes the inner product , i.e.,
(2) 
There exist several methods to solve (2), including linear programming [kallenberg2011markov] and dynamic programming methods such as value iteration and policy iteration [sutton2018reinforcement]. Based on the linear programming formulation, the optimal can be obtained by solving the following primal linear program [kallenberg2011markov],
(3)  
Linear program (3) has variables and constraints, which becomes computationally impractical to solve for MDPs with large stateaction spaces. On the contrary, dynamic programming is comparatively better suited to handling large state spaces than linear programming, and is widely considered the only feasible way of solving (2) for large MDPs [sutton2018reinforcement]. Of particular interest is the value iteration algorithm which generates an optimal policy in polynomial time for a fixed [littman2013complexity]. In the value iteration algorithm, the statevalue function is iteratively updated based on Bellman’s principle of optimality,
(4) 
It has been shown that the nonlinear map in (4) is a monotone contraction mapping that admits a unique fixed point, which is the optimal value function [kallenberg2011markov]. By obtaining the optimal value function, the optimal policy can be derived using onestep lookahead: the optimal action at state is the one that attains the equality in (4), with ties broken arbitrarily.
IiB Constrained Markov Decision Processes
In constrained MDPs (CMDPs), an additional immediate cost function is augmented, such that a CMDP is defined by the tuple [altman1999constrained]. The statevalue function is defined as in (1). In addition, the infinitehorizon discountedcost of a state under policy is defined as,
.  (5) 
Let be the vector of state costs, , and a given constant which represents the constraint upperbound. The solution of a CMDP is a Markov stationary policy which maximizes subject to a constraint ,
(6)  
The canonical solution methodology for (6) is based on convex linear programming [altman1999constrained]. Of interest is the following dual linear program^{1}^{1}1The primal linear program is defined over a convex set of occupancy measures, and by the Lagrangian strong duality, (6) can be obtained. Interested readers can find more details about this formulation in Chapter 3 [altman1999constrained]. which can be solved for the optimal statevalue function and Lagrange penalty multiplier ,
(7)  
Note that for a fixed Lagrangian penalty multiplier , (7) is analogous to (3), which can be solved using dynamic programming and Bellman’s optimality equation [altman1999constrained],
(8) 
This formulation has led to the development of multi timescale stochastic approximation based Lagrangian primaldual type learning algorithms to solve CMDPs with and without value function approximation [geibel2005risk, geibel2012learning, borkar2005actor, abad2002self, bhatnagar2010actor, bhatnagar2012online, prashanth2014actor, tessler2018reward, liang2018accelerated, fu2018risk, achiam2017constrained, raybenchmarking, ding2020provably, paternain2019safe, khairy2020constrained]. In such algorithms, is updated by gradient descent at a slow timescale, while the statevalue function is optimized on a faster timescale using dynamic programming (modelbased) or gradientascent (modelfree), until an optimal saddle point is reached. These approaches however suffer from the typical problems of gradient optimization on nonsmooth objectives, sensitivity to the initialization of , and convergence dependency on the learning rate sequence used to update [chow2017risk, achiam2017constrained, liu2019ipo, liang2018accelerated]. As we will show in the next section, the optimization objective in (7) is piecewise linear convex with respect to , and so it is not differentiable at the cusps. This means that we cannot assess the optimality of a solution because usually does not have a zero gradient. This also means that convergence to is dependent on specifying an optimal learning rate sequence, which is by itself a very challenging task. Unlike existing methods, our proposed algorithm explicitly tackles these problems by exploiting problem structure, and so it does not require hyperparameter tuning or specifying an optimal learning rate sequence.
Iii A Novel CMDP Framework
In this section, we describe our novel methodology to solve finite CMDPs with discrete stateaction space and known probability transition function. For ease of notation, we hereby consider the case of one constraint, yet our proposed algorithm can be readily extended to the multiconstraint case by successively iterating the method described hereafter over each Lagrange penalty multiplier, one at a time.
Our proposed method works in an iterative twolevel optimization scheme. For a given Lagrange penalty multiplier , an unconstrained MDP with a penalized reward function , is specified, and we require the corresponding optimal value function to be found using dynamic programming (8). Denote the optimization objective in (7) as a function of by , i.e., . Thus to evaluate , one has to solve for first. How to efficiently search for the optimal ? To answer this question, we first prove that is a piecewise linear convex function with respect to , and design an efficient solution algorithm next.
Definition 1
Let , and for some positive integer . A Piecewise Linear^{2}^{2}2Affine is more accurate but less common. Convex function (PWLC) is given by .
Theorem 1
Given that is the optimal value function of an unconstrained MDP with a penalized reward function , is the probability that the initial state is , and is a constant representing the cost constraint upper bound in the original CMDP, is a PWLC function with respect to .
For ease of exposition, we introduce the following three properties of PWLC functions, and show how they can be used to construct the proof,

is PWLC for any . Notice that for pointwise maximum, . Hence, , which is PWLC.

is PWLC for any . Notice that . Hence, , which is PWLC.

is PWLC. Notice that , which is PWLC.
Based on these properties, the proof proceeds as follows. Given , the recursive formula in (8) is expanded by iteratively substituting for with the right hand side of (8). This expansion allows us to deduce that has the following functional form,
for some constants which come from discounted rewards, costs, and their product with the transition probability function. Based on successive application of properties 1 and 2, are PWLC functions with respect to . Also, is PWLC with respect to based on property 1. Finally, is a PWLC function with respect to based on property 3.
To efficiently find and solve (7), we propose a novel GradientAware Search (GAS) algorithm which exploits the piecewise linear convex structure. Our proposed GAS algorithm, which is outlined in Algorithm 1, operates over twoloops. For a given Lagrange multiplier (line 6) at an outer loop iteration, the inner loop finds the optimal statevalue function, using value iteration, as well as the gradient components of with respect to (lines 716). Based on this information, the outer loop exploits the PWLC structure to calculate the next (line 17, lines 2025). This process continues until a termination criteria is met (lines 1819). In what follows, the proposed GAS algorithm is outlined in detail.
Inner loop: The value iterations of the inner loop are subscripted with the index . Let , denote the stateaction value function which is the expected discounted return starting in state , taking action , and following thereafter, for a fixed . Note that we drop the subscript to avoid notation clutter, but dependence on remains unchanged. The stateaction value function at each step can be represented as a linear function of ,
(9) 
In this representation, accumulated discounted rewards contribute to and accumulated discounted costs contribute to (line 12). Given , and , the value function at state can be obtained by (line 10), where denotes the action which attains the maximum statevalue among the set of actions at state , (line 11). Notice that in the inner loop, two functions are estimated using value iterations, the total stateaction value function (line 8), and the stateaction cost function (line 12). Dynamic programming steps in the inner loop are performed until the mean relative absolute error in the estimation of the state value function is less than a preset accuracy threshold (line 1416). When the inner loop concludes, the optimal , are returned to the outer loop.
Outer loop: Given , and from the inner loop, and the gradient of with respect to , can be obtained based on (line 17),
(10) 
and,
(11)  
When is negative, i.e., , the current policy is infeasible, which means should be increased. On the other hand, when is nonnegative, i.e., , the current policy is feasible, yet we can potentially obtain another feasible policy with larger returns by decreasing . Notice that if is a cusp which lies at the intersection point of two linear segments of the PWLC function , does not exist in theory, although the two onesided derivatives (left and right) are welldefined. The nonexistence of a derivative at cusps does not pose a challenge to our proposed algorithm, because computations are accurate up to the machine’s floating point precision. To further clarify, suppose is a cusp on the curve of . In practice, we can only compute the gradient (11) at , where
is a very small precision loss due to rounding in floating point arithmetic, which is at the order of
for double precision as defined by the IEEE 7542008 standard. This means that computing the gradient at a point is essentially computing a onesided derivative, which always exists.It is worth to mention that our proposed algorithm does not perform gradient optimization (which suffers on nonsmooth objectives) but rather exploits the structure of the PWLC objective to find the optimal . The algorithm always retains two values for , , where and . along with define a line , while along with define another line . and intersect at a point , which is passed to the inner loop. There are two possible outcomes when running the inner loop with ,

, in this case because is PWLC with respect to . The algorithm retains and by setting, and (lines 2223).

, in this case because is PWLC with respect to . The algorithm retains and by setting, and (lines 2425).
The algorithm terminates when the absolute error between which is the minimum objective value found so far, and , is less than an arbitrarily small positive number (lines 1819). Note that the initial is set to , an arbitrarily large number which attains . If no such can be found, then (7) is not feasible, i.e., is not lower bounded and can be made arbitrarily small by taking . Figure 1 below shows a visual illustration of how Algorithm 1 works in practice. It is worth to mention that Algorithm 1 only evaluates at the points indicated by a green circle.
Iv Case Studies
In this section, we evaluate the efficacy of the proposed GAS algorithm in two different domains, robot navigation in a grid world, and solarpowered UAVbased wireless network management. In all the experiments, and are set to unless otherwise specified.
Iva Grid World Robot Navigation
In the grid world robot navigation problem, the agent controls the movement of a robot in a rectangular grid world, where states represent grid points on a 2D terrain map. The robot starts in a safe region in the bottom right corner (blue square), and is required to travel to the goal located in the top right corner (green square), as can be seen from Figure 3. At each time step, the agent can move in either of four directions to one of the neighbouring states. However, the movement direction of the agent is stochastic and partially depends on the chosen direction. Specifically, with probability , the robot will move in the chosen direction, and uniformly randomly otherwise, i.e., with probability the robot will move to one of the four neighboring states. At each time step, the agent receives a reward of to account for fuel usage. Upon reaching the goal, the agent receives a positive reward of . In between the starting and destination states, there are a number of obstacles (red squares) that the agent should avoid. Hitting an obstacle costs the agent . It is important to note that the 2D terrain map is built such that a shorter path induces higher risk of hitting obstacles. By maximizing longterm expected discounted rewards subject to longterm expected discounted costs, the agent finds the shortest path from the starting state to the destination state such that the toll for hitting obstacles does not exceed an upper bound. This problem is inspired by classic grid world problems in the literature [tessler2018reward, chow2015risk].
IvA1 Experiment Results
We choose a grid of with a total of states. The start state is , the destination is , , , , and obstacles are deployed. We follow [chow2015risk] in the choice of parameters, which trades off high penalty for obstacle collisions and computational complexity. is plotted in Figure 2(a) over . It can be seen from Figure 2(a) that is a PWLC function as given by Theorem . It can be also seen that the global minima is a nondifferentiable cusp around with unequal onesided derivatives.
In Figure 2(b), we compare the convergence performance of the proposed GAS with binary search (BS), by plotting the number of outer loop iterations versus an accuracy level . The initial search window for is ^{3}^{3}3 is problem specific. The ability to choose an arbitrarily high value for
without negatively impacting convergence time is desirable because it demonstrates the algorithm’s capability to make adaptively large strides in the search space towards the optimal solution.
, and is either or . Given a value for , the inner loop evaluates (10) and its gradient (11). The outer loop then determines the next either based on the proposed GAS or BS. This iterative procedures continues until the convergence criteria is met. Compared to BS, the proposed GAS algorithm requires a smaller number of outer loop iterations for all levels of accuracy thresholds . This is because GAS adaptively shrinks the search window by evaluating points at the intersection of line segments with nonnegative and negative gradients, whereas BS blindly shrinks the window size by every iteration. These results demonstrate that can be set arbitrarily small and can be set arbitrarily large without substantially impacting the convergence time.In Figure 2(c) we compare the convergence performance of the proposed GAS with the Lagrangian primaldual optimization (Lagrangian PDO), by plotting the total number of value iterations, which is the sum of the number of value iterations over all outer loop iterations, versus the learning rate decay parameter in a loglog scale. In the Lagrangian PDO, gradient descent on is performed along side dynamic programming iterations on , which motivates us to compare the convergence performance in terms of total number of value iterations. In Lagrangian PDO, the update rule for is , where is the step size parameter. In our experiments, is initially set to to speed up descending towards the minima, and is decayed according to , where is the learning rate decay parameter, and is the number of times changes signs. While such choice of learning rate decay does not necessarily produce an optimal sequence, it serves to expose the sensitivity of gradient based optimization to the initial and decay parameter . For every , is uniformly randomly initialized times from , and the mean number of total value iterations is plotted in Figure 2(c). It can be seen from Figure 2(c) that the convergence of the Lagrangian PDO method is sensitive to the initialization of which can be inferred from the initial window size, and the decay rate parameter. For some , Lagrangian PDO converges faster than GAS, but for other values it lags behind. In practice, is a hyperparameter which requires tuning, consuming extra computational resources. In contrast, the proposed GAS algorithm works out of the box without the need to specify a learning rate sequence or tune any hyperparameters.
In Table I, we compare the results obtained by the proposed GAS algorithm at convergence with those obtained by solving (7
) using Linear Programming. We use the LP solver provided by Gurobi, which is arguably the most powerful mathematical optimization solver in the market. The optimal
, , , , and are reported, where is the Bellman error at state given by, . It can be observed that Gurobi’s LP solver converges to a suboptimal solution with a higher objective value compared with the proposed GAS algorithm (recall that (7) is a minimization problem, hence the lower the better). This demonstrates that generic LP solvers may struggle to solve CMDPs, which has motivated the research community to develop CMDP specific algorithms.Grid World  SolarPowered UAV  

Robot Navigation  Based Wireless Network  
LP (Gurobi)  GAS  LP (Gurobi)  GAS  
68.33333  90.08102  0.030786  0.030786  
0.0  3.11e07  0.0  9.32e09  
9.89e10  3.11e07  0.001788  9.32e09  
1.13e07  3.11e07  1.160397  9.33e09  
15216.83  15184.57  93.92830  93.92830 
In Figure 3, filled contours maps for the value function are plotted. Regions which have the same value have the same color shade. The value function is plotted for four different values of the cost constraint upper bound, along with the start state (blue square), destination (green square), and obstacles (red squares). With a tighter cost constraint , a safer policy is learned in which the agent takes a long path to go around the obstacles as in Figure 3(d), successfully reaching the destination of the times. When the cost constraint is loose , a riskier policy is learned in which the agent takes a shorter path by navigating between the obstacles as in Figure 3(a), successfully reaching the destination of the times, based on roll outs.
IvB Solar Powered UAVBased Wireless Networks
As a second application domain, we study a wireless network management problem, which is of interest to the wireless networking research community. We consider a solarpowered UAVbased wireless network consisting of one UAV which relays data from network servers to wireless devices. Wireless devices are independently and uniformly deployed within a circular geographical area with radius . The UAV is deployed at the center of , and is equipped with solar panels which harvest solar energy to replenish its onboard battery storage. Because solar light is attenuated through the cloud cover, solar energy harvesting is highest at higher altitudes. System time is slotted into fixedlength discrete time units indexed by . During a time slot , the solarpowered UAV provides downlink wireless coverage to wireless devices on the ground. The UAV is equipped with an intelligent controller which at each time slot , controls its altitude , its directional antenna half power beamwidth angle , and its transmission power in dBm, based on its battery energy state and altitude, to maximize the wireless coverage probability for the worst case edge user, while ensuring energy sustainability of the solarpowered UAV. Note that the worst case edge user is the user which is farthest from the UAV. Due to distance dependent freespace path loss, the received signal at the edge user from the UAV is heavily attenuated, resulting in a low Signal to Noise (SNR) ratio and decreased coverage probability. Thus, by maximizing the coverage probability of the edge user, the communication performance is improved for every user in the network, albeit not proportionally. This wireless network control problem can be modeled as follows,
IvB1 Communication Model
during time slot , the antenna gain can be approximated by [mozaffari2016efficient],
(12) 
where is in degrees, is the sector angle, and is the antenna gain outside the main lobe. The antenna gain in dB scale is . The airtoground wireless channel can be modeled by considering LineofSight (LoS) and nonLineofSite (NLos) links between the UAV and the users separately. The probability of occurrence of LoS for an edge user during time slot can be modeled by , where and are constant values reflecting the environment impact, and is the elevation angle between the UAV and the edge user during time slot , [mozaffari2016efficient]. The probability of an NLoS link is . The shadow fading for the LoS and NLoS links during time slot
can be modeled by normal random variables
and, respectively, where the mean and variance are in dB scale and depend on on the elevation angle and environment parameters,
, [mozaffari2016efficient]. Accordingly, the coverage probability for the edge user during time slot , defined as the probability the received SNR by an edge user is larger than a threshold () is,(13)  
where for a standard normal random variable with mean and variance , , is the noise floor power, is the distancedependent pathloss, , is the speed of light, is the carrier frequency, and is the path loss exponent.
IvB2 UAV Energy Model
the attenuation of solar light passing through a cloud is modeled by , where denotes the absorption coefficient modeling the optical characteristics of the cloud, and is the distance that the solar light travels through the cloud [sun2019optimal]. The solar energy harvesting model in time slot is,
(14) 
where is a constant representing the energy harvesting efficiency, is the area of solar panels, is the average solar radiation intensity on earth, and is the time slot duration. and are the altitudes of upper and lower boundaries of the cloud. Based on this model, the output solar energy is highest above the cloud cover at , and it attenuates exponentially through the cloud cover until . During a timeslot , the UAV can cruise upwards or downwards at a constant speed of , or hover at the same altitude. The energy consumed for cruising or hovering during time slot is,
(15)  
Here, is the weight of the UAV, is air density, and is the total area of UAV rotor disks. is static power consumed for maintaining the operation of UAV. It is worth to mention that cruising upwards consumes more power than cruising downwards or hovering. Denote the battery energy storage of the UAV at the beginning of slot by . The battery energy of the next slot is given by,
(16) 
IvB3 CMDP Formulation
this constrained control problem of maximizing the wireless coverage probability for the worst case edge user, while ensuring energy sustainability of the solarpowered UAV can be modeled as a discretetime CMDP with discrete stateaction spaces as follows. First, the altitude of the UAV is discretized into discrete units of . Let the set of possible UAV altitudes be . In addition, the finite battery energy of the UAV is discretized into energy units, where each energy unit is Joules. Let be the set of possible UAV battery energy levels. Accordingly, the CMDP can be formulated as follows,

The state of the agent is the battery energy level and UAV altitude, , , where and . Thus .

The agent controls the UAV vertical velocity , the antenna transmission power
Comments
There are no comments yet.