1 Introduction
Reinforcement learning (RL) is a framework to address sequential decisionmaking problems (Sutton & Barto, 2018; Szepesvári, 2010). In RL, a decision maker learns a policy to optimize a longterm objective by interacting with the (unknown or partially known) environment. The RL agent obtains evaluative feedback usually known as reward or cost for its actions at each time step, allowing it to improve the performance of subsequent actions (Sutton & Barto, 2018)
. With the advent of deep learning, RL has witnessed huge successes in recent times
(Silver et al., 2017). However, since most of these methods rely on modelfree RL, there are several unsolved challenges, which restrict the use of these algorithms for many safety critical physical systems (Vamvoudakis et al., 2015; Benosman, 2018). For example, it is very difficult for most modelfree RL algorithms to ensure basic properties like stability of solutions, robustness with respect to model uncertainties, etc. This has led to several research directions which study incorporating robustness, constraint satisfaction, and safe exploration during learning for safety critical applications. While robust constraint satisfaction and stability guarantees are highly desirable properties, they are also very challenging to incorporate in RL algorithms. The main goal of our work is to formulate this incorporation into robust constrainedMDPs (RCMDPs), and derive corresponding theories necessary to solve them.Constrained Markov Decision Processes (CMDPs) are a super class of MDPs that incorporate expected cumulative cost constraints
(Altman, 2004). Several solution methods have been proposed in the literature for solving CMDPs: trust region based methods (Achiam et al., 2017), linear programmingbased solutions
(Altman, 2004), surrogatebased methods (Chamiea et al., 2016; Dalal et al., 2018), Lagrangian methods (Geibel & Wysotzki, 2005; Altman, 2004). We refer to these CMDPs as nonrobust, since they do not take model uncertainties into account. On the other hand, another line of work explicitly handles model uncertainties and is known as Robust MDPs (RMDPs) (Nilim & Ghaoui, 2004; Wiesemann et al., 2013). RMDPs consider a set of plausible models from so called ambiguity sets. They compute solutions that can perform well even for the worst possible realization of models (Russel & Petrik, 2019a; Wiesemann et al., 2013; Iyengar, 2005). However, unlike CMDPs, these RMDPs are not capable of handling safety constraints.Safety constraints are important in reallife applications (Altman, 2004). One cannot afford to risk violating some given constraints in many reallife situations. For example, in autonomous cars, there are hard safety constraints on the car velocities and steering angles (Lin et al., 2018). Moreover, training often occurs on a simulated environment for many practical applications. The goal is to mitigate the sample inefficiency of modelfree RL algorithms (van Baar et al., 2019)
. The result is then transferred to the real world, typically followed by finetuning, a process referred to as Sim2Real. The simulator is by definition inaccurate with respect to the targeted problem, due to approximations and lack of system identification. Heuristic approaches like domain randomization
(van Baar et al., 2019) and metalearning (Finn et al., 2017) try to address model uncertainty in this setting, but they often are not theoretically sound. In safety critical applications, it is expected that a trained policy in simulation will offer certain guarantees about safety, when transferred to the realworld.In light of these practical motivations, we propose to unite the two concepts of RMDPs and CMDPs, leading to a new framework we refer as RCMDPs. The motivation is to ensure both safety and robustness. The goal of RCMDPs is to learn policies that simultaneously satisfy certain safety constraints and also perform well under worstcase scenarios. The contributions of this paper are fourfold: 1) formulate the concept of RCMDPs and derive related theories, 2) propose gradient based methods to optimize the RCMDP objective, 3) independently derive a Lyapunov based reward shaping technique, and 4) empirically validate the utility of the proposed ideas on several problem domains.
The paper is organized as follows: Section 2 describes the formulation of our RCMDP framework and the objective we seek to optimize. A Lagrangebased approach is presented in Section 3 along with required gradient update formulas and corresponding policy optimization algorithms. Section 4 is dedicated to the Lyapunov stable RCMDPs and presents the idea of Lyapunov based reward shaping. We draw the concluding remarks in Section 5.
2 Problem Formulation: RCMDP concept
We consider Robust Markov Decision Processes (RMDPs) with a finite number of states and finite number of actions . Every action is available for the decision maker to take in every state . After taking an action in state , the decision maker transitions to a next state according to the true, but unknown
, transition probability
and receives a reward . We use to denote transition probabilities from and , and condense it to refer to a transition function as. We condense the rewards to vectors
and .Our RMDP setting assumes that the transition is chosen adversarially from an ambiguity set for each and . An ambiguity set , defined for each state and action , is a set of feasible transitions quantifying the uncertainty in transition probabilities. We restrict our attention to rectangular ambiguity sets which simply assumes independence between transition probabilities of different stateaction pairs (Le Tallec, 2007; Wiesemann et al., 2013). We define the norm bounded ambiguity sets around the nominal transition probability , for some dataset as:
where is the budget of allowed deviations. This budget can be computed for each , using Hoeffding bound (Russel & Petrik, 2019b): , where is the number of transitions in dataset originating from state and an action , and is the confidence level. This , if used to compute a policy in RMDPs, then guarantees that the computed return is a lower bound with probability . Note, that this is just one specific choice for the ambiguity set. Our method can be extended to any other type of ambiguity set, e.g., norm, Bayesian, weighted, sampling based, etc. We use to generally refer to , where denotes the total number of time steps starting from , with the length of the horizon, and . For example, with we have starting from time step
. This collectively represents the ambiguity set along with the notion of independence between stateaction pairs in a tabular setting with discrete states and actions. Samplingbased sets under approximate methods, e.g., neural networks, for large and continuous problems also extend on this similar notion of ambiguity sets
(Tamar et al., 2014; Derman et al., 2018).A stationary randomized policy for state
defines a probability distribution over actions
. The set of all randomized stationary policies is denoted by . We parameterize the randomized policy for state as where is a dimensional parameter vector. Let be a sampled trajectory generated by executing a policy from a starting state under transition probabilities , where is the distribution of initial states. Then the probability of sampling a trajectory is: and the total reward along the trajectory is: (Puterman, 2005; Sutton & Barto, 2018). The value function for a policy and transition probability is: and the total return is:Because the RMDP setting considers different possible transition probabilities within the ambiguity set , we use a subscript (e.g. ) to indicate which one is used, in case it is not clear from the context.
We define a robust value function for an ambiguity set as: . Similar to ordinary MDPs, the robust value function can be computed using the robust Bellman operator (Iyengar, 2005; Nilim & Ghaoui, 2005):
The optimal robust value function , and the robust value function for a policy are unique and satisfy and (Iyengar, 2005). The robust return for a policy and ambiguity set is defined as (Nilim & Ghaoui, 2005; Russel & Petrik, 2019a):
where is the initial state distribution.
Constrained RMDP (RCMDP)
In addition to rewards for RMDPs described above, we incorporate a constraint cost , where and , representing some kind of constraint on safety for the agent’s behavior. Consider for example an autonomous car that makes money (reward ) for each complete trip but incurs a big fine (constraint cost ) for traffic violations or a collision. We define the constraint cost to be a negative reward , which brings consistency in representing the worstcase with a minimum over the ambiguity set for both the objective and the constraint. An associated constraint budget describes the total budget for constraint violations. This arrangement resembles the constrainedMDP setting as described in (Altman, 2004), but with additional robustness.
Similar to reward based estimates described above, the total constraint cost along a trajectory
is: , the robust value function for policy and ambiguity set is: and the robust return:Similar to , the optimal constraint value function is also unique and independently satisfies the Bellman optimality equation (Altman, 2004). We now formally define the objective of Robust Constrained MDP (RCMDP) as below: π_θ∈Π ^ρ(π_θ,P,r) , ^ρ(π_θ,P,d)≥β. This objective resembles the objective of a CMDP (Altman, 2004), but with additional robustness integrated by the quantification of the uncertainty about the model. The interpretation of the objective is to find a policy that maximizes the worstcase return estimates, while satisfying the constraints in all possible situations.
3 Robust Constrained Optimization
A standard approach for solving the optimization problem (2) is to apply the Lagrange relaxation procedure (Bertsekas (2003), Ch.3), which turns it into an unconstrained optimization problem:
(1) 
where is known as the Lagrange multiplier. Note that, the objective in (1) is nonconvex and therefore is not tractable. The dual function of involves a pointwise maximum with respect to and is written as (Paternain et al., 2019):
The dual function provides an upper bound on (1) and therefore needs to be minimized to contract the gap from optimality:
(2) 
The dual problem in (2) is convex and tractable, but the question remains about how large the duality gap is. In other words, how suboptimal the solution of the dual problem (2) is with respect to the solution of the original problem stated in (2). To answer that question, Paternain et al. (2019) show that strong duality holds in this case under some mild conditions and the duality gap is arbitrarily small even with the parameterization () of policies. We thus aim to optimize the dual version of this problem using gradients.
Proposition 1.
The relaxed RCMDP objective of (1) can be restated as:
(3) 
Proof.
We defer the detailed derivation to Appendix A.1. ∎
The goal is then to find a saddle point of in (3) that satisfies , and . This is achieved by ascending in and descending in using the gradients of objective with respect to and respectively (Chow & Ghavamzadeh, 2014).
Theorem 3.1.
The gradient of with respect to and can be computed as:
Proof.
See Appendix A.2 for the detailed derivation. ∎
With a fixed Lagrange multiplier , the constraint budget in (3) offsets the sum by a constant amount. We can therefore omit this constant and define the Bellman operator for RCMDPs. We then show that this operator is a contraction.
Proposition 2.
(Bellman Equation) For a fixed policy and discount factor , the RCMDP value function satisfies a Bellman equation for each :
(4) 
where .
Proof.
The proof is deferred to Appendix A.3. ∎
We define the Bellman optimality equation for RCMDPs as:
(5) 
Proposition 3.
(Contraction) The Bellman operator defined in (5) is a contraction.
Proof.
The proof follows directly from Theorem 3.2 of Iyengar (2005). ∎
The RCMDP Bellman operator therefore satisfies the Bellman optimality equation and converges to a fixed point of the optimal RCMDP value function .
Policy Gradient Algorithm
Algorithm LABEL:alg:rcpg presents a robust constrained policy gradient algorithm based on the gradient update rules derived above in Theorem 3.1. The algorithm proceeds in an episodic way based on trajectories and updates parameters based on the MonteCarlo estimates. The algorithm requires an ambiguity set as its input, which can be constructed with empirical estimates for smaller problems (Wiesemann et al., 2013; Russel & Petrik, 2019a; Behzadian et al., 2021). For larger problems it can be a parameterized estimate instead (Janner et al., 2019).
algocf[!h]
The step size schedules used in Algorithm LABEL:alg:rcpg satisfy the standard conditions for stochastic approximation algorithms (Borkar, 2009). That is, update is on the fastest timescale , whereas update is on a slower timescale , and thus results in a two timescale stochastic approximation algorithm. We derive its convergence to a saddle point as below.
Actor Critic Algorithm
The general issue of having high variance in the Monte Carlo based policy gradient algorithm can be handled by introducing state values to use as baselines
(Sutton & Barto, 2018). As the optimal value function for RCMDPs can be computed using Bellman style recursive updates as shown in (4), an extension of the above PG algorithm to the actorcritic framework is straightforward. Algorithm (1) reported in Appendix A.4 presents an actor critic (AC) algorithm for RCMDPs. The statevalue parameterization with brings a new dimension in algorithm (1) and results in a three timescale stochastic algorithm. The convergence properties for this AC algorithm can be derived in a way similar to Theorem 3.2 and we therefore omit the detailed derivations.4 Stable RobustConstrained RL: Lyapunovbased RCMDP Concept
In this section, we propose Lyapunovbased^{1}^{1}1Other works have applied different notions of Lyapunov stability in the context of modelbased RL Farahmand & Benosman (2017); ETH_Bastards2017 and MDPs Perkins & Barto (2000); Chow et al. (2018), however, none of these works incorporate explicit robustness in their formulation, i.e., in the context of RCMDP. reward shaping for RCMDPs. The motivation of this is threefold: i) learn a good policy faster, ii) serve as a proxy to guide robustness when an estimate for the value function is not readily available and iii) guarantee stability (in the sense of Lyapunov) in the learning process. We first briefly introduce the idea of Lyapunov stability, Lyapunov function, and some of its useful characteristics. We then introduce the notion of additive shaping reward strategy based on Lyapunov functions and analyze its properties.
Definition 1.
(Lyapunov stability) (Haddad, 2008) Consider the general nonlinear discrete system , where , is an open set containing , is a continuous function on . Then, the equilibrium point of satisfying , is said to be:
 Lyapunov stable if , , s.t., if , then
 Asymptotically stable if Lyapunov stable and , s.t., if , then
Definition 2.
(Lyapunov direct method) (Haddad, 2008) Consider the system (Sy), and assume that there exists a continuous Lyapunov function , s.t.,
(6a)  
(6b)  
(6c) 
then the equilibrium point is Lyapunov stable. If, in addition , then is asymptotically stable.
4.1 Stability Constraints for RMDPs
We propose to incorporate the Lyapunov stability descent property (6c) as a constraint in the RCMDP objective (2) , where the constraint cost is given by . We set the budget to enforce Lyapunov stability or set for achieving asymptotic stability. Note that in this setting, we assume that the only constraint cost is the stability cost , and thus we are in the setting of RMPDs to which we add a virtual stability constraint cost. In this setting, we apply Algorithm 1 to propose a Lyapunov stableRCPG algorithm, and use the results of Theorem 3.2, to deduce its asymptotic convergence to a local optimal stationary policy for the infinite horizon case. We summarize this in the following proposition.
Proposition 4.
Under assumptions (A1)  (A7) as stated in Section A.5, the sequence of parameter updates of LABEL:alg:rcpg, where converges almost surely to a locally optimal a.s. Lyapunov stable policy as . Furthermore, if , the policy is a.s. asymptotically stable.
Proof.
Consider the control problem defined by (2), under assumptions (A1)  (A7), and where . Then, based on Theorem 3.2, we can conclude that Algorithm 1, converges asymptotically almost surely to a local optimal policy . Furthermore, since is computed under the constraint of Lyapunov descent property in expectation, the equilibrium point of the controlled system is a.s.^{2}^{2}2Almost surely–a.s.–(asymptotic) Lyapunov stability is to be understood as (asymptotic) Lyapunov stability for almost all samples of the states. Lyapunov stable (Definition 3.5, Mahmoud et al. (2003)) when , and a.s. asymptotically Lyapunov stable (Definition 3.8, Mahmoud et al. (2003)) when . ∎
4.2 Stability Constraints for RCMDPs
In the case where the problem at hand is an RCMPD with a constraint cost (e.g. physical obstacle avoidance constraints for a mobile robot), we propose two main approaches to incorporate the stability descent constraint. We take the parallel between the notions of soft constraints, where the Lyapunov descent constraints is not enforced as a constraint cost as in Sec. 4.1, and reward shaping (Ng et al., 1999). Indeed, we propose to add the Lyapunov stability descent constraint directly to the reward of the RCMDP (2).
Reward Shaping with Lyapunov Constraint
We define the shaping reward function based on this Lyapunov descent property.
(7) 
The motivation behind this is quite intuitive: a transition towards descend direction leads to a desired region of the state space faster and therefore should be rewarded. So, if we were to receive a reward in the original setting, we instead would pretend to receive a reward of on the same event. This renders a transformed RCMDP with same state space, action space and transition probabilities. Only the reward function is reshaped with additional reward signals .
Theorem 4.1.
Every optimal finitehorizon policy in transformed RCMDP is also an optimal finitehorizon policy in the original RCMDP under Lyapunov based reward transformation stated in (7). Furthermore, under the assumption of transient MDP, every infinitehorizon policy in transformed RCMDP is also an optimal finitehorizon policy in the original RCMDP .
Proof.
In the finitehorizon case, this result is a simple extension of Theorem 1 of Ng et al. (1999) into the RCMDP setting and the proof follows directly from Ng et al. (1999). In the infinitehorizon case, one needs to rely on the transient assumption for the MDP (in the sense of Def. 7.1 in Altman (2004)) to conclude about the convergence of the finitehorizon problem to the infinitehorizon problem, using the arguments in (Theorem 15.1, Altman (2004)) . See Section B.1 for the full derivation. ∎
Remark 4.2.
Note that the concept of Lyapunov reward transformation is independent of the RL algorithm, and thus can be applied with any existing mainstream approaches such as TRPO, PPO, or CPO. The Lyapunov reward transformation will allow faster convergence for these existing approaches, as verified in our empirical analysis.
5 Conclusion
In this paper, we studied robust constrained MDPs (RCMDPs) to simultaneously deal with constraints and model uncertainties in reinforcement learning. We proposed the RCMDP framework, derived related theoretical analysis and proposed algorithms to optimize the objective of RCMDPs. We also proposed an extension to LyapunovRCMDPs (LRCMDPs) for RCMDPs based on the Lyapunov function. We analyzed the performance of our LRCMDP algorithms in the context of rewardshaping. We provided theoretical analysis of Lyapunov stability and asymptotic convergence for our methods. We also empirically validated the proposed algorithms on three different problem domains. Future work should focus on automated learning of the Lyapunov function from the domain itself and apply the proposed approach to more complex practical problem domains.
References

Achiam et al. (2017)
Achiam, J., Held, D., Tamar, A., and Abbeel, P.
Constrained Policy Optimization.
International Conference on Machine Learning
, 2017.  Altman (2004) Altman, E. Constrained Markov Decision Processes. 2004.

Behzadian et al. (2021)
Behzadian, B., Russel, R. H., Petrik, M., and Ho, C. P.
Optimizing Percentile Criterion Using Robust MDPs.
International Conference on Artificial Intelligence and Statistics (AISTATS)
, 2021.  Benosman (2018) Benosman, M. Model‐based vs data‐driven adaptive control: An overview. International Journal of Adaptive Control and Signal Processing, 2018.
 Bertsekas (2003) Bertsekas, D. P. Nonlinear programming. Athena Scientific, 2003.
 Borkar (2009) Borkar, V. S. Stochastic Approximation: A Dynamical Systems Viewpoint. International Statistical Review, 2009.

Chamiea et al. (2016)
Chamiea, M. E., Yu, Y., and Acikmese, B.
Convex synthesis of randomized policies for controlled markov chains with density safety upper bound constraints.
In IEEE American Control Conference, pp. 6290–6295, 2016.  Chow & Ghavamzadeh (2014) Chow, Y. and Ghavamzadeh, M. Algorithms for CVaR optimization in MDPs. Advances in Neural Information Processing Systems, 2014.
 Chow et al. (2018) Chow, Y., Nachum, O., DuenezGuzman, E., and Ghavamzadeh, M. A lyapunovbased approach to safe reinforcement learning. Advances in Neural Information Processing Systems, 2018.
 Dalal et al. (2018) Dalal, G., Dvijotham, K., Vecerik, M., Hester, T., Paduraru, C., and Tassa, Y. Safe exploration in continuous action spaces, 2018.
 Derman et al. (2018) Derman, E., Mankowitz, D. J., Mann, T. A., and Mannor, S. Softrobust actorcritic policygradient. Conference on Uncertainty in Artificial Intelligence (UAI), 2018.
 Farahmand & Benosman (2017) Farahmand, A.M. and Benosman, M. Towards stability in learning based control: A bayesian optimization based adaptive controller. In The Multidisciplinary Conference on Reinforcement Learning and Decision Making, 2017.

Finn et al. (2017)
Finn, C., Yu, T., Zhang, T., Abbeel, P., and Levine, S.
In Levine, S., Vanhoucke, V., and Goldberg, K. (eds.),
OneShot Visual Imitation Learning via MetaLearning
, 2017.  Geibel & Wysotzki (2005) Geibel, P. and Wysotzki, F. Risksensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research, 2005.
 Haddad (2008) Haddad, W. M. Nonlinear dynamical systems and control: A Lyapunovbased approach. Princeton University Press, 2008.
 Iyengar (2005) Iyengar, G. N. Robust dynamic programming. Mathematics of Operations Research, 2005.
 Janner et al. (2019) Janner, M., Fu, J., Zhang, M., and Levine, S. When to trust your model: Modelbased policy optimization. arXiv, 2019.
 Le Tallec (2007) Le Tallec, Y. Robust, RiskSensitive, and Datadriven Control of Markov Decision Processes. PhD thesis, MIT, 2007.
 Lin et al. (2018) Lin, S. C., Zhang, Y., Hsu, C. H., Skach, M., Haque, M. E., Tang, L., and Mars, J. The architectural implications of autonomous driving: Constraints and acceleration. ACM SIGPLAN Notices, 2018.
 Mahmoud et al. (2003) Mahmoud, M. M., Jiang, J., and Zhang, Y. Active Fault Tolerant Control Systems: Stochastic Analysis and Synthesis. Springer, 2003.
 Ng et al. (1999) Ng, A., Harada, D., and Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning, 1999.
 Nilim & Ghaoui (2004) Nilim, A. and Ghaoui, L. E. Robust solutions to Markov decision problems with uncertain transition matrices. Operations Research, 53(5):780, 2004.
 Nilim & Ghaoui (2005) Nilim, A. and Ghaoui, L. E. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, sep 2005. ISSN 0030364X. doi: 10.1287/opre.1050.0216.
 Paternain et al. (2019) Paternain, S., Chamon, L. F., CalvoFullana, M., and Ribeiro, A. Constrained reinforcement learning has zero duality gap. Conference on Neural Information Processing Systems, 2019.
 Perkins & Barto (2000) Perkins, T. J. and Barto, A. G. LyapunovConstrained Action Sets for Reinforcement Learning. 2000.
 Puterman (2005) Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2005.
 Russel & Petrik (2019a) Russel, R. H. and Petrik, M. Beyond Confidence Regions: Tight Bayesian Ambiguity Sets for Robust MDPs. Advances in Neural Information Processing Systems (NeurIPS), 2019a.
 Russel & Petrik (2019b) Russel, R. H. and Petrik, M. Beyond confidence regions: Tight Bayesian ambiguity sets for robust MDPs. Advances in Neural Information Processing Systems, 2019b.
 Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 2017.
 Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
 Szepesvári (2010) Szepesvári, C. Algorithms for Reinforcement Learning. Morgan & Claypool Publishers, 2010.
 Tamar et al. (2014) Tamar, A., Glassner, Y., and Mannor, S. Optimizing the CVaR via Sampling. 2014.
 Vamvoudakis et al. (2015) Vamvoudakis, K., Antsaklis, P., Dixon, W., Hespanha, J., Lewis, F., Modares, H., and Kiumarsi, B. Autonomy and machine intelligence in complex systems: A tutorial. 2015.

van Baar et al. (2019)
van Baar, J., Sullivan, A., Corcodel, R., Jha, D., Romeres, D., and Nikovski,
D. N.
Simtoreal transfer learning using robustified controllers in robotic tasks involving complex dynamics.
In IEEE International Conference on Robotics and Automation (ICRA), 2019.  Wiesemann et al. (2013) Wiesemann, W., Kuhn, D., and Rustem, B. Robust Markov decision processes. Mathematics of Operations Research, 2013.
Appendix A RCMDP Derivations
a.1 Proof of Proposition 1
We rewrite the objective (1) and perform some algebraic manipulation as below:
Where is the set of all possible trajectories induced by policy under transition function . Similarly, is the set of all possible trajectories induced by policy under transition function . Step above follows by assuming that the initial state distribution concentrates all of its mass to one single state . And follows with and . Note that, and are distinct, independent and depend on rewards and constraint costs respectively. However, the rewards and constraint costs are coupled together in reality, meaning that the set of two trajectories and would not be different. So we select one set of trajectories being either or . This selection of may happen based on our priorities toward robustness of reward (with corresponding trajectory ) or constraint cost (with corresponding trajectory ). Or, it can also be the best (e.g. yielding higher objective value) set among and satisfying the constraint. We then have a simplified formulation for as below:
a.2 Proof of Theorem 3.1
Proof.
The objective as specified in (3):
We first derive the gradient update rule of with respect to as below:
Next, we derive the gradient update rule for with respect to :
∎
a.3 Proof of Proposition 2
Proof.
Here follows by expanding total return given a trajectory and follows by evaluating the onestep immediate transition apart. ∎
a.4 ActorCritic Algorithm
a.5 Convergence Analysis of Algorithm
Assumptions
(A1) For any state , policy is continuously differentiable with respect to parameter and is a Lipschitz function in for every and .
(A2) The step size schedules satisfy:
(8) 
(9) 
(10) 
These assumptions are basically standard stepsize conditions for stochastic approximation algorithms (Borkar, 2009). Equation (8) ensures that the discretization covers the entire time axis. (9
) ensures that the errors resulting from the discretization of the Ordinary Differential Equation (ODE) and errors due to the noise both becomes negligible asymptotically with probability one
(Borkar, 2009). Equations (8) and (9) together ensure that the iterates asymptotically capture the behavior of the ODE. (10) mandates that, updates corresponding to are on a slower time scale than .a.5.1 Policy Gradient Algorithm
The general stochastic approximation scheme used by Borkar (2009) is of the form:
(11) 
where
are a sequence of integrable random variables representing the noise sequence and
are step sizes (e.g. ). The expression inside the square bracket is the noisy measurement where and are not separately available, only their sum is available. The terms of (11) need to satisfy below additional assumptions:(A3)
The function is Lipschitz. That is for some .
(A4)
are martingale difference sequence:
In addition to that, are squareintegrable:
and for some constant .
Our proposed policy gradient algorithm is a two timescale stochastic approximation algorithm. The parameter update iterations of the policy gradient algorithm are defined as below:
(12) 
(13) 
(14) 
Where is a zero mean i.i.d. random variable representing noise. To apply general convergence analysis techniques derived for (11) in Borkar (2009), we take the special form in (14) and transform it to the general format of (11) as below:
(15) 
With these transformation techniques, we obtain the general update for from (12):
update:
(16) 
where, is the gradient w.r.t , , and . Note that, the noise term is omitted because the noise is inherent in our sample based iterations.
Proposition 5.
is Lipschitz in .
Proof.
Recall that the gradient of with respect to is:
(17) 
Assumption (A1) implies that, in the equation (17) is a Lipschitz function in for any and . As the expectation of sum of number of Lipschitz functions is also Lipschitz, we conclude that is Lipschitz in .
∎
Proposition 6.
of (16) satisfies assumption (A4).
We transform our update rule of (13) as:
update:
(18) 
where, is the gradient w.r.t , , and .
Notice that is a constant function of . And therefore, is a constant function of .
Proposition 7.
of (18) satisfies assumption (A4).
(19) 
(20) 
With assumption (A2), is quasistatic from the perspective of turning (19) into an ODE. where is held fixed:
(21) 
We additionally assume that:
(A5)
(21) has a globally asymptotically stable equilibrium such that is a Lipschitz map.
Assumption (A5) turns (20) into:
(22) 
Let’s further assume that:
(A6)
The ODE (22) has a globally asymptotically stable equilibrium .
(A7)
almost surely.
Proof of Theorem 3.2
Proof.
Above are the necessary conditions to apply Theorem 2 from chapter 6 of Borkar (2009), which shows that