Markov decision processes (MDPs) are classical formalization of sequential decision making in discrete-time stochastic control processes [sutton2018reinforcement]
. In MDPs, the outcomes of actions are uncertain, and influence not only immediate rewards, but also future rewards through next states. Policies, which are strategies for action selection, should therefore strike a trade-off between immediate rewards and delayed rewards, and be optimal in some sense. MDPs have gained recognition in diverse fields such as operations research, economics, engineering, wireless networks, artificial intelligence, and learning systems[kallenberg2011markov]
. Moreover, an MDP is a mathematically idealized form of the reinforcement learning problem, which is an active research field within the machine learning community. In many situations however, finding the optimal policies with respect to a single reward function does not suffice to fully describe sequential decision making in problems with multiple conflicting objectives[mannor2004geometric, van2014multi]. The framework of constrained MDPs (CMDPs) is the natural approach for handling multi-objective decision making under uncertainty [altman1999constrained].
Algorithmic methods for solving CMDPs have been extensively studied when the underlying transition probability function is known[altman1999constrained, dolgov2005stationary, zadorojniy2006robustness, piunovskiy2000constrained, chen2004dynamic, piunovskiy2006dynamic, chen2007non], and unknown [geibel2005risk, geibel2012learning, borkar2005actor, abad2002self, bhatnagar2010actor, bhatnagar2012online, tessler2018reward, liang2018accelerated, fu2018risk, achiam2017constrained, prashanth2014actor, paternain2019safe, ding2020provably, raybenchmarking, khairy2020constrained]. In the case of finite state-action spaces with known transition probability function, the solution for a CMDP can be obtained by solving finite linear programs [altman1999constrained, zadorojniy2006robustness, dolgov2005stationary]
, or by deriving a Bellman optimality equation with respect to an augmented MDP whose state vector consists of two parts: the first part is the state of the original MDP, while the second part keeps track of the cumulative constraints cost[piunovskiy2000constrained, piunovskiy2006dynamic, chen2004dynamic, chen2007non]
. Linear programming methods become computationally impractical at a much smaller number of states than dynamic programming, by an estimated factor of about 100[sutton2018reinforcement]. In practice, MDP-specific algorithms, which are dynamic programming based methods, hold more promise for efficient solution [littman2013complexity]. While MDP augmentation based methods provide a theoretical framework to apply dynamic programming to constrained stochastic control, they introduce continuous variables to the state space which rules out practical tabular methods and make the design of a solution algorithm challenging [chow2015risk].
On the other hand, solution methods for the constrained reinforcement learning problem, i.e., CMDPs with unknown transition probability, are generally based on Lagrangian primal-dual type optimization. In these methods, gradient-ascent is performed on state-values at a fast time scale to find the optimal value function for a given set of Lagrangian multipliers, while gradient-descent is performed on the Lagrangian multipliers at a slower time scale. This process is repeated until convergence to a saddle point. Existing works have explored the primal-dual optimization approach in the tabular setting, i.e., without function approximation [geibel2005risk, geibel2012learning, borkar2005actor, abad2002self]
, and with function approximators such as deep neural networks[bhatnagar2010actor, bhatnagar2012online, prashanth2014actor, tessler2018reward, liang2018accelerated, fu2018risk, achiam2017constrained, raybenchmarking, ding2020provably, paternain2019safe, khairy2020constrained]. While this approach is appealing in its simplicity, it suffers from the typical problems of gradient optimization on non-smooth objectives, sensitivity to the initialization of the Lagrange multipliers, and convergence dependency on the learning rate sequence [chow2017risk, achiam2017constrained, liu2019ipo, liang2018accelerated]. If the learning rate is too small, the Lagrange multipliers will not update quickly to enforce the constraint; and if it is too high, the algorithm may oscillate around the optimal solution. In practice, a sequence of decreasing learning rates should be adopted to guarantee convergence [borkar2005actor], yet we do not have an obvious method to determine the optimal sequence, nor we can assess the quality of a solution in cases where the objective is not differentiable at the optimal solution.
In this brief, we develop a new approach to solve finite CMDPs with discrete state-action space and known probability transition function. We first prove that the optimization objective in the dual linear program of a finite CMDP is a piece-wise linear convex function (PWLC) with respect to the Lagrange penalty multipliers. Next, we treat the dual linear program of a finite CMDP as a search problem over the Lagrange penalty multipliers, and propose a novel two-level Gradient-Aware Search (GAS) algorithm, which exploits the PWLC structure to find the optimal state-value function and Lagrange penalty multipliers of a CMDP. We empirically compare the convergence performance of the proposed GAS algorithm with binary search (BS), Lagrangian primal-dual optimization (PDO), and Linear Programming (LP), in two application domains, robot navigation in grid world and wireless network management. Compared with benchmark algorithms, we show that the proposed GAS algorithm converges to the optimal solution faster, does not require hyper-parameter tuning, and is not sensitive to initialization of the Lagrange penalty multiplier.
The remainder of this paper is organized as follows. A background of unconstrained and constrained MDPs is given in Section II. Our proposed Gradient-Aware Search (GAS) algorithm is proposed in Section III. Performance evaluation of GAS in two application domains is presented in Section IV, followed by our concluding remarks and future work in Section V.
Ii Background and Related Works
Ii-a Unconstrained Markov Decision Processes
An infinite horizon Markov Decision Process (MDP) with discounted-returns is defined as a tuple , where and are finite sets of states and actions, respectively, is the model’s state-action-state transition probabilities, and is the initial distribution over the states, , is the reward function which maps every state-action pair to the set of real numbers , and is the discount factor. Denote the transition probability from state to state if action is chosen by . The transition probability from state to state is therefore, , where is the adopted policy. The state-value function of state under policy , , is the expected discounted return starting in state and following thereafter,
Let be the vector of state values, . The solution of an MDP is a Markov stationary policy which maximizes the inner product , i.e.,
There exist several methods to solve (2), including linear programming [kallenberg2011markov] and dynamic programming methods such as value iteration and policy iteration [sutton2018reinforcement]. Based on the linear programming formulation, the optimal can be obtained by solving the following primal linear program [kallenberg2011markov],
Linear program (3) has variables and constraints, which becomes computationally impractical to solve for MDPs with large state-action spaces. On the contrary, dynamic programming is comparatively better suited to handling large state spaces than linear programming, and is widely considered the only feasible way of solving (2) for large MDPs [sutton2018reinforcement]. Of particular interest is the value iteration algorithm which generates an optimal policy in polynomial time for a fixed [littman2013complexity]. In the value iteration algorithm, the state-value function is iteratively updated based on Bellman’s principle of optimality,
It has been shown that the non-linear map in (4) is a monotone contraction mapping that admits a unique fixed point, which is the optimal value function [kallenberg2011markov]. By obtaining the optimal value function, the optimal policy can be derived using one-step look-ahead: the optimal action at state is the one that attains the equality in (4), with ties broken arbitrarily.
Ii-B Constrained Markov Decision Processes
In constrained MDPs (CMDPs), an additional immediate cost function is augmented, such that a CMDP is defined by the tuple [altman1999constrained]. The state-value function is defined as in (1). In addition, the infinite-horizon discounted-cost of a state under policy is defined as,
Let be the vector of state costs, , and a given constant which represents the constraint upper-bound. The solution of a CMDP is a Markov stationary policy which maximizes subject to a constraint ,
The canonical solution methodology for (6) is based on convex linear programming [altman1999constrained]. Of interest is the following dual linear program111The primal linear program is defined over a convex set of occupancy measures, and by the Lagrangian strong duality, (6) can be obtained. Interested readers can find more details about this formulation in Chapter 3 [altman1999constrained]. which can be solved for the optimal state-value function and Lagrange penalty multiplier ,
This formulation has led to the development of multi time-scale stochastic approximation based Lagrangian primal-dual type learning algorithms to solve CMDPs with and without value function approximation [geibel2005risk, geibel2012learning, borkar2005actor, abad2002self, bhatnagar2010actor, bhatnagar2012online, prashanth2014actor, tessler2018reward, liang2018accelerated, fu2018risk, achiam2017constrained, raybenchmarking, ding2020provably, paternain2019safe, khairy2020constrained]. In such algorithms, is updated by gradient descent at a slow time-scale, while the state-value function is optimized on a faster time-scale using dynamic programming (model-based) or gradient-ascent (model-free), until an optimal saddle point is reached. These approaches however suffer from the typical problems of gradient optimization on non-smooth objectives, sensitivity to the initialization of , and convergence dependency on the learning rate sequence used to update [chow2017risk, achiam2017constrained, liu2019ipo, liang2018accelerated]. As we will show in the next section, the optimization objective in (7) is piecewise linear convex with respect to , and so it is not differentiable at the cusps. This means that we cannot assess the optimality of a solution because usually does not have a zero gradient. This also means that convergence to is dependent on specifying an optimal learning rate sequence, which is by itself a very challenging task. Unlike existing methods, our proposed algorithm explicitly tackles these problems by exploiting problem structure, and so it does not require hyper-parameter tuning or specifying an optimal learning rate sequence.
Iii A Novel CMDP Framework
In this section, we describe our novel methodology to solve finite CMDPs with discrete state-action space and known probability transition function. For ease of notation, we hereby consider the case of one constraint, yet our proposed algorithm can be readily extended to the multi-constraint case by successively iterating the method described hereafter over each Lagrange penalty multiplier, one at a time.
Our proposed method works in an iterative two-level optimization scheme. For a given Lagrange penalty multiplier , an unconstrained MDP with a penalized reward function , is specified, and we require the corresponding optimal value function to be found using dynamic programming (8). Denote the optimization objective in (7) as a function of by , i.e., . Thus to evaluate , one has to solve for first. How to efficiently search for the optimal ? To answer this question, we first prove that is a piecewise linear convex function with respect to , and design an efficient solution algorithm next.
Let , and for some positive integer . A Piecewise Linear222Affine is more accurate but less common. Convex function (PWLC) is given by .
Given that is the optimal value function of an unconstrained MDP with a penalized reward function , is the probability that the initial state is , and is a constant representing the cost constraint upper bound in the original CMDP, is a PWLC function with respect to .
For ease of exposition, we introduce the following three properties of PWLC functions, and show how they can be used to construct the proof,
is PWLC for any . Notice that for point-wise maximum, . Hence, , which is PWLC.
is PWLC for any . Notice that . Hence, , which is PWLC.
is PWLC. Notice that , which is PWLC.
Based on these properties, the proof proceeds as follows. Given , the recursive formula in (8) is expanded by iteratively substituting for with the right hand side of (8). This expansion allows us to deduce that has the following functional form,
for some constants which come from discounted rewards, costs, and their product with the transition probability function. Based on successive application of properties 1 and 2, are PWLC functions with respect to . Also, is PWLC with respect to based on property 1. Finally, is a PWLC function with respect to based on property 3.
To efficiently find and solve (7), we propose a novel Gradient-Aware Search (GAS) algorithm which exploits the piecewise linear convex structure. Our proposed GAS algorithm, which is outlined in Algorithm 1, operates over two-loops. For a given Lagrange multiplier (line 6) at an outer loop iteration, the inner loop finds the optimal state-value function, using value iteration, as well as the gradient components of with respect to (lines 7-16). Based on this information, the outer loop exploits the PWLC structure to calculate the next (line 17, lines 20-25). This process continues until a termination criteria is met (lines 18-19). In what follows, the proposed GAS algorithm is outlined in detail.
Inner loop: The value iterations of the inner loop are sub-scripted with the index . Let , denote the state-action value function which is the expected discounted return starting in state , taking action , and following thereafter, for a fixed . Note that we drop the subscript to avoid notation clutter, but dependence on remains unchanged. The state-action value function at each step can be represented as a linear function of ,
In this representation, accumulated discounted rewards contribute to and accumulated discounted costs contribute to (line 12). Given , and , the value function at state can be obtained by (line 10), where denotes the action which attains the maximum state-value among the set of actions at state , (line 11). Notice that in the inner loop, two functions are estimated using value iterations, the total state-action value function (line 8), and the state-action cost function (line 12). Dynamic programming steps in the inner loop are performed until the mean relative absolute error in the estimation of the state value function is less than a preset accuracy threshold (line 14-16). When the inner loop concludes, the optimal , are returned to the outer loop.
Outer loop: Given , and from the inner loop, and the gradient of with respect to , can be obtained based on (line 17),
When is negative, i.e., , the current policy is infeasible, which means should be increased. On the other hand, when is non-negative, i.e., , the current policy is feasible, yet we can potentially obtain another feasible policy with larger returns by decreasing . Notice that if is a cusp which lies at the intersection point of two linear segments of the PWLC function , does not exist in theory, although the two one-sided derivatives (left and right) are well-defined. The nonexistence of a derivative at cusps does not pose a challenge to our proposed algorithm, because computations are accurate up to the machine’s floating point precision. To further clarify, suppose is a cusp on the curve of . In practice, we can only compute the gradient (11) at , where
is a very small precision loss due to rounding in floating point arithmetic, which is at the order offor double precision as defined by the IEEE 754-2008 standard. This means that computing the gradient at a point is essentially computing a one-sided derivative, which always exists.
It is worth to mention that our proposed algorithm does not perform gradient optimization (which suffers on non-smooth objectives) but rather exploits the structure of the PWLC objective to find the optimal . The algorithm always retains two values for , , where and . along with define a line , while along with define another line . and intersect at a point , which is passed to the inner loop. There are two possible outcomes when running the inner loop with ,
, in this case because is PWLC with respect to . The algorithm retains and by setting, and (lines 22-23).
, in this case because is PWLC with respect to . The algorithm retains and by setting, and (lines 24-25).
The algorithm terminates when the absolute error between which is the minimum objective value found so far, and , is less than an arbitrarily small positive number (lines 18-19). Note that the initial is set to , an arbitrarily large number which attains . If no such can be found, then (7) is not feasible, i.e., is not lower bounded and can be made arbitrarily small by taking . Figure 1 below shows a visual illustration of how Algorithm 1 works in practice. It is worth to mention that Algorithm 1 only evaluates at the points indicated by a green circle.
Iv Case Studies
In this section, we evaluate the efficacy of the proposed GAS algorithm in two different domains, robot navigation in a grid world, and solar-powered UAV-based wireless network management. In all the experiments, and are set to unless otherwise specified.
Iv-a Grid World Robot Navigation
In the grid world robot navigation problem, the agent controls the movement of a robot in a rectangular grid world, where states represent grid points on a 2D terrain map. The robot starts in a safe region in the bottom right corner (blue square), and is required to travel to the goal located in the top right corner (green square), as can be seen from Figure 3. At each time step, the agent can move in either of four directions to one of the neighbouring states. However, the movement direction of the agent is stochastic and partially depends on the chosen direction. Specifically, with probability , the robot will move in the chosen direction, and uniformly randomly otherwise, i.e., with probability the robot will move to one of the four neighboring states. At each time step, the agent receives a reward of to account for fuel usage. Upon reaching the goal, the agent receives a positive reward of . In between the starting and destination states, there are a number of obstacles (red squares) that the agent should avoid. Hitting an obstacle costs the agent . It is important to note that the 2D terrain map is built such that a shorter path induces higher risk of hitting obstacles. By maximizing long-term expected discounted rewards subject to long-term expected discounted costs, the agent finds the shortest path from the starting state to the destination state such that the toll for hitting obstacles does not exceed an upper bound. This problem is inspired by classic grid world problems in the literature [tessler2018reward, chow2015risk].
Iv-A1 Experiment Results
We choose a grid of with a total of states. The start state is , the destination is , , , , and obstacles are deployed. We follow [chow2015risk] in the choice of parameters, which trades off high penalty for obstacle collisions and computational complexity. is plotted in Figure 2(a) over . It can be seen from Figure 2(a) that is a PWLC function as given by Theorem . It can be also seen that the global minima is a non-differentiable cusp around with unequal one-sided derivatives.
In Figure 2(b), we compare the convergence performance of the proposed GAS with binary search (BS), by plotting the number of outer loop iterations versus an accuracy level . The initial search window for is 333 is problem specific. The ability to choose an arbitrarily high value for without negatively impacting convergence time is desirable because it demonstrates the algorithm’s capability to make adaptively large strides in the search space towards the optimal solution.
without negatively impacting convergence time is desirable because it demonstrates the algorithm’s capability to make adaptively large strides in the search space towards the optimal solution., and is either or . Given a value for , the inner loop evaluates (10) and its gradient (11). The outer loop then determines the next either based on the proposed GAS or BS. This iterative procedures continues until the convergence criteria is met. Compared to BS, the proposed GAS algorithm requires a smaller number of outer loop iterations for all levels of accuracy thresholds . This is because GAS adaptively shrinks the search window by evaluating points at the intersection of line segments with non-negative and negative gradients, whereas BS blindly shrinks the window size by every iteration. These results demonstrate that can be set arbitrarily small and can be set arbitrarily large without substantially impacting the convergence time.
In Figure 2(c) we compare the convergence performance of the proposed GAS with the Lagrangian primal-dual optimization (Lagrangian PDO), by plotting the total number of value iterations, which is the sum of the number of value iterations over all outer loop iterations, versus the learning rate decay parameter in a log-log scale. In the Lagrangian PDO, gradient descent on is performed along side dynamic programming iterations on , which motivates us to compare the convergence performance in terms of total number of value iterations. In Lagrangian PDO, the update rule for is , where is the step size parameter. In our experiments, is initially set to to speed up descending towards the minima, and is decayed according to , where is the learning rate decay parameter, and is the number of times changes signs. While such choice of learning rate decay does not necessarily produce an optimal sequence, it serves to expose the sensitivity of gradient based optimization to the initial and decay parameter . For every , is uniformly randomly initialized times from , and the mean number of total value iterations is plotted in Figure 2(c). It can be seen from Figure 2(c) that the convergence of the Lagrangian PDO method is sensitive to the initialization of which can be inferred from the initial window size, and the decay rate parameter. For some , Lagrangian PDO converges faster than GAS, but for other values it lags behind. In practice, is a hyper-parameter which requires tuning, consuming extra computational resources. In contrast, the proposed GAS algorithm works out of the box without the need to specify a learning rate sequence or tune any hyper-parameters.
) using Linear Programming. We use the LP solver provided by Gurobi, which is arguably the most powerful mathematical optimization solver in the market. The optimal, , , , and are reported, where is the Bellman error at state given by, . It can be observed that Gurobi’s LP solver converges to a sub-optimal solution with a higher objective value compared with the proposed GAS algorithm (recall that (7) is a minimization problem, hence the lower the better). This demonstrates that generic LP solvers may struggle to solve CMDPs, which has motivated the research community to develop CMDP specific algorithms.
|Grid World||Solar-Powered UAV-|
|Robot Navigation||Based Wireless Network|
|LP (Gurobi)||GAS||LP (Gurobi)||GAS|
In Figure 3, filled contours maps for the value function are plotted. Regions which have the same value have the same color shade. The value function is plotted for four different values of the cost constraint upper bound, along with the start state (blue square), destination (green square), and obstacles (red squares). With a tighter cost constraint , a safer policy is learned in which the agent takes a long path to go around the obstacles as in Figure 3(d), successfully reaching the destination of the times. When the cost constraint is loose , a riskier policy is learned in which the agent takes a shorter path by navigating between the obstacles as in Figure 3(a), successfully reaching the destination of the times, based on roll outs.
Iv-B Solar Powered UAV-Based Wireless Networks
As a second application domain, we study a wireless network management problem, which is of interest to the wireless networking research community. We consider a solar-powered UAV-based wireless network consisting of one UAV which relays data from network servers to wireless devices. Wireless devices are independently and uniformly deployed within a circular geographical area with radius . The UAV is deployed at the center of , and is equipped with solar panels which harvest solar energy to replenish its on-board battery storage. Because solar light is attenuated through the cloud cover, solar energy harvesting is highest at higher altitudes. System time is slotted into fixed-length discrete time units indexed by . During a time slot , the solar-powered UAV provides downlink wireless coverage to wireless devices on the ground. The UAV is equipped with an intelligent controller which at each time slot , controls its altitude , its directional antenna half power beamwidth angle , and its transmission power in dBm, based on its battery energy state and altitude, to maximize the wireless coverage probability for the worst case edge user, while ensuring energy sustainability of the solar-powered UAV. Note that the worst case edge user is the user which is farthest from the UAV. Due to distance dependent free-space path loss, the received signal at the edge user from the UAV is heavily attenuated, resulting in a low Signal to Noise (SNR) ratio and decreased coverage probability. Thus, by maximizing the coverage probability of the edge user, the communication performance is improved for every user in the network, albeit not proportionally. This wireless network control problem can be modeled as follows,
Iv-B1 Communication Model
during time slot , the antenna gain can be approximated by [mozaffari2016efficient],
where is in degrees, is the sector angle, and is the antenna gain outside the main lobe. The antenna gain in dB scale is . The air-to-ground wireless channel can be modeled by considering Line-of-Sight (LoS) and non-Line-of-Site (NLos) links between the UAV and the users separately. The probability of occurrence of LoS for an edge user during time slot can be modeled by , where and are constant values reflecting the environment impact, and is the elevation angle between the UAV and the edge user during time slot , [mozaffari2016efficient]. The probability of an NLoS link is . The shadow fading for the LoS and NLoS links during time slot
can be modeled by normal random variablesand
, respectively, where the mean and variance are in dB scale and depend on on the elevation angle and environment parameters,, [mozaffari2016efficient]. Accordingly, the coverage probability for the edge user during time slot , defined as the probability the received SNR by an edge user is larger than a threshold () is,
where for a standard normal random variable with mean and variance , , is the noise floor power, is the distance-dependent path-loss, , is the speed of light, is the carrier frequency, and is the path loss exponent.
Iv-B2 UAV Energy Model
the attenuation of solar light passing through a cloud is modeled by , where denotes the absorption coefficient modeling the optical characteristics of the cloud, and is the distance that the solar light travels through the cloud [sun2019optimal]. The solar energy harvesting model in time slot is,
where is a constant representing the energy harvesting efficiency, is the area of solar panels, is the average solar radiation intensity on earth, and is the time slot duration. and are the altitudes of upper and lower boundaries of the cloud. Based on this model, the output solar energy is highest above the cloud cover at , and it attenuates exponentially through the cloud cover until . During a time-slot , the UAV can cruise upwards or downwards at a constant speed of , or hover at the same altitude. The energy consumed for cruising or hovering during time slot is,
Here, is the weight of the UAV, is air density, and is the total area of UAV rotor disks. is static power consumed for maintaining the operation of UAV. It is worth to mention that cruising upwards consumes more power than cruising downwards or hovering. Denote the battery energy storage of the UAV at the beginning of slot by . The battery energy of the next slot is given by,
Iv-B3 CMDP Formulation
this constrained control problem of maximizing the wireless coverage probability for the worst case edge user, while ensuring energy sustainability of the solar-powered UAV can be modeled as a discrete-time CMDP with discrete state-action spaces as follows. First, the altitude of the UAV is discretized into discrete units of . Let the set of possible UAV altitudes be . In addition, the finite battery energy of the UAV is discretized into energy units, where each energy unit is Joules. Let be the set of possible UAV battery energy levels. Accordingly, the CMDP can be formulated as follows,
The state of the agent is the battery energy level and UAV altitude, , , where and . Thus .
The agent controls the UAV vertical velocity , the antenna transmission power