Reinforcement learning (RL) is a framework to address sequential decision-making problems (Sutton & Barto, 2018; Szepesvári, 2010). In RL, a decision maker learns a policy to optimize a long-term objective by interacting with the (unknown or partially known) environment. The RL agent obtains evaluative feedback usually known as reward or cost for its actions at each time step, allowing it to improve the performance of subsequent actions (Sutton & Barto, 2018)
. With the advent of deep learning, RL has witnessed huge successes in recent times(Silver et al., 2017). However, since most of these methods rely on model-free RL, there are several unsolved challenges, which restrict the use of these algorithms for many safety critical physical systems (Vamvoudakis et al., 2015; Benosman, 2018). For example, it is very difficult for most model-free RL algorithms to ensure basic properties like stability of solutions, robustness with respect to model uncertainties, etc. This has led to several research directions which study incorporating robustness, constraint satisfaction, and safe exploration during learning for safety critical applications. While robust constraint satisfaction and stability guarantees are highly desirable properties, they are also very challenging to incorporate in RL algorithms. The main goal of our work is to formulate this incorporation into robust constrained-MDPs (RCMDPs), and derive corresponding theories necessary to solve them.
Constrained Markov Decision Processes (CMDPs) are a super class of MDPs that incorporate expected cumulative cost constraints(Altman, 2004). Several solution methods have been proposed in the literature for solving CMDPs: trust region based methods (Achiam et al., 2017)
, linear programming-based solutions(Altman, 2004), surrogate-based methods (Chamiea et al., 2016; Dalal et al., 2018), Lagrangian methods (Geibel & Wysotzki, 2005; Altman, 2004). We refer to these CMDPs as non-robust, since they do not take model uncertainties into account. On the other hand, another line of work explicitly handles model uncertainties and is known as Robust MDPs (RMDPs) (Nilim & Ghaoui, 2004; Wiesemann et al., 2013). RMDPs consider a set of plausible models from so called ambiguity sets. They compute solutions that can perform well even for the worst possible realization of models (Russel & Petrik, 2019a; Wiesemann et al., 2013; Iyengar, 2005). However, unlike CMDPs, these RMDPs are not capable of handling safety constraints.
Safety constraints are important in real-life applications (Altman, 2004). One cannot afford to risk violating some given constraints in many real-life situations. For example, in autonomous cars, there are hard safety constraints on the car velocities and steering angles (Lin et al., 2018). Moreover, training often occurs on a simulated environment for many practical applications. The goal is to mitigate the sample inefficiency of model-free RL algorithms (van Baar et al., 2019)
. The result is then transferred to the real world, typically followed by fine-tuning, a process referred to as Sim2Real. The simulator is by definition inaccurate with respect to the targeted problem, due to approximations and lack of system identification. Heuristic approaches like domain randomization(van Baar et al., 2019) and meta-learning (Finn et al., 2017) try to address model uncertainty in this setting, but they often are not theoretically sound. In safety critical applications, it is expected that a trained policy in simulation will offer certain guarantees about safety, when transferred to the real-world.
In light of these practical motivations, we propose to unite the two concepts of RMDPs and CMDPs, leading to a new framework we refer as RCMDPs. The motivation is to ensure both safety and robustness. The goal of RCMDPs is to learn policies that simultaneously satisfy certain safety constraints and also perform well under worst-case scenarios. The contributions of this paper are four-fold: 1) formulate the concept of RCMDPs and derive related theories, 2) propose gradient based methods to optimize the RCMDP objective, 3) independently derive a Lyapunov based reward shaping technique, and 4) empirically validate the utility of the proposed ideas on several problem domains.
The paper is organized as follows: Section 2 describes the formulation of our RCMDP framework and the objective we seek to optimize. A Lagrange-based approach is presented in Section 3 along with required gradient update formulas and corresponding policy optimization algorithms. Section 4 is dedicated to the Lyapunov stable RCMDPs and presents the idea of Lyapunov based reward shaping. We draw the concluding remarks in Section 5.
2 Problem Formulation: RCMDP concept
We consider Robust Markov Decision Processes (RMDPs) with a finite number of states and finite number of actions . Every action is available for the decision maker to take in every state . After taking an action in state , the decision maker transitions to a next state according to the true, but unknown
, transition probabilityand receives a reward . We use to denote transition probabilities from and , and condense it to refer to a transition function as
. We condense the rewards to vectorsand .
Our RMDP setting assumes that the transition is chosen adversarially from an ambiguity set for each and . An ambiguity set , defined for each state and action , is a set of feasible transitions quantifying the uncertainty in transition probabilities. We restrict our attention to rectangular ambiguity sets which simply assumes independence between transition probabilities of different state-action pairs (Le Tallec, 2007; Wiesemann et al., 2013). We define the norm bounded ambiguity sets around the nominal transition probability , for some dataset as:
where is the budget of allowed deviations. This budget can be computed for each , using Hoeffding bound (Russel & Petrik, 2019b): , where is the number of transitions in dataset originating from state and an action , and is the confidence level. This , if used to compute a policy in RMDPs, then guarantees that the computed return is a lower bound with probability . Note, that this is just one specific choice for the ambiguity set. Our method can be extended to any other type of ambiguity set, e.g., norm, Bayesian, weighted, sampling based, etc. We use to generally refer to , where denotes the total number of time steps starting from , with the length of the horizon, and . For example, with we have starting from time step
. This collectively represents the ambiguity set along with the notion of independence between state-action pairs in a tabular setting with discrete states and actions. Sampling-based sets under approximate methods, e.g., neural networks, for large and continuous problems also extend on this similar notion of ambiguity sets(Tamar et al., 2014; Derman et al., 2018).
A stationary randomized policy for state
defines a probability distribution over actions. The set of all randomized stationary policies is denoted by . We parameterize the randomized policy for state as where is a dimensional parameter vector. Let be a sampled trajectory generated by executing a policy from a starting state under transition probabilities , where is the distribution of initial states. Then the probability of sampling a trajectory is: and the total reward along the trajectory is: (Puterman, 2005; Sutton & Barto, 2018). The value function for a policy and transition probability is: and the total return is:
Because the RMDP setting considers different possible transition probabilities within the ambiguity set , we use a subscript (e.g. ) to indicate which one is used, in case it is not clear from the context.
We define a robust value function for an ambiguity set as: . Similar to ordinary MDPs, the robust value function can be computed using the robust Bellman operator (Iyengar, 2005; Nilim & Ghaoui, 2005):
The optimal robust value function , and the robust value function for a policy are unique and satisfy and (Iyengar, 2005). The robust return for a policy and ambiguity set is defined as (Nilim & Ghaoui, 2005; Russel & Petrik, 2019a):
where is the initial state distribution.
Constrained RMDP (RCMDP)
In addition to rewards for RMDPs described above, we incorporate a constraint cost , where and , representing some kind of constraint on safety for the agent’s behavior. Consider for example an autonomous car that makes money (reward ) for each complete trip but incurs a big fine (constraint cost ) for traffic violations or a collision. We define the constraint cost to be a negative reward , which brings consistency in representing the worst-case with a minimum over the ambiguity set for both the objective and the constraint. An associated constraint budget describes the total budget for constraint violations. This arrangement resembles the constrained-MDP setting as described in (Altman, 2004), but with additional robustness.
Similar to reward based estimates described above, the total constraint cost along a trajectoryis: , the robust value function for policy and ambiguity set is: and the robust return:
Similar to , the optimal constraint value function is also unique and independently satisfies the Bellman optimality equation (Altman, 2004). We now formally define the objective of Robust Constrained MDP (RCMDP) as below: π_θ∈Π ^ρ(π_θ,P,r) , ^ρ(π_θ,P,d)≥β. This objective resembles the objective of a CMDP (Altman, 2004), but with additional robustness integrated by the quantification of the uncertainty about the model. The interpretation of the objective is to find a policy that maximizes the worst-case return estimates, while satisfying the constraints in all possible situations.
3 Robust Constrained Optimization
where is known as the Lagrange multiplier. Note that, the objective in (1) is non-convex and therefore is not tractable. The dual function of involves a point-wise maximum with respect to and is written as (Paternain et al., 2019):
The dual function provides an upper bound on (1) and therefore needs to be minimized to contract the gap from optimality:
The dual problem in (2) is convex and tractable, but the question remains about how large the duality gap is. In other words, how sub-optimal the solution of the dual problem (2) is with respect to the solution of the original problem stated in (2). To answer that question, Paternain et al. (2019) show that strong duality holds in this case under some mild conditions and the duality gap is arbitrarily small even with the parameterization () of policies. We thus aim to optimize the dual version of this problem using gradients.
The relaxed RCMDP objective of (1) can be restated as:
We defer the detailed derivation to Appendix A.1. ∎
The goal is then to find a saddle point of in (3) that satisfies , and . This is achieved by ascending in and descending in using the gradients of objective with respect to and respectively (Chow & Ghavamzadeh, 2014).
The gradient of with respect to and can be computed as:
See Appendix A.2 for the detailed derivation. ∎
With a fixed Lagrange multiplier , the constraint budget in (3) offsets the sum by a constant amount. We can therefore omit this constant and define the Bellman operator for RCMDPs. We then show that this operator is a contraction.
(Bellman Equation) For a fixed policy and discount factor , the RCMDP value function satisfies a Bellman equation for each :
The proof is deferred to Appendix A.3. ∎
We define the Bellman optimality equation for RCMDPs as:
(Contraction) The Bellman operator defined in (5) is a contraction.
The proof follows directly from Theorem 3.2 of Iyengar (2005). ∎
The RCMDP Bellman operator therefore satisfies the Bellman optimality equation and converges to a fixed point of the optimal RCMDP value function .
Policy Gradient Algorithm
Algorithm LABEL:alg:rcpg presents a robust constrained policy gradient algorithm based on the gradient update rules derived above in Theorem 3.1. The algorithm proceeds in an episodic way based on trajectories and updates parameters based on the Monte-Carlo estimates. The algorithm requires an ambiguity set as its input, which can be constructed with empirical estimates for smaller problems (Wiesemann et al., 2013; Russel & Petrik, 2019a; Behzadian et al., 2021). For larger problems it can be a parameterized estimate instead (Janner et al., 2019).
The step size schedules used in Algorithm LABEL:alg:rcpg satisfy the standard conditions for stochastic approximation algorithms (Borkar, 2009). That is, -update is on the fastest time-scale , whereas -update is on a slower time-scale , and thus results in a two time-scale stochastic approximation algorithm. We derive its convergence to a saddle point as below.
Actor Critic Algorithm
The general issue of having high variance in the Monte Carlo based policy gradient algorithm can be handled by introducing state values to use as baselines(Sutton & Barto, 2018). As the optimal value function for RCMDPs can be computed using Bellman style recursive updates as shown in (4), an extension of the above PG algorithm to the actor-critic framework is straightforward. Algorithm (1) reported in Appendix A.4 presents an actor critic (AC) algorithm for RCMDPs. The state-value parameterization with brings a new dimension in algorithm (1) and results in a three time-scale stochastic algorithm. The convergence properties for this AC algorithm can be derived in a way similar to Theorem 3.2 and we therefore omit the detailed derivations.
4 Stable Robust-Constrained RL: Lyapunov-based RCMDP Concept
In this section, we propose Lyapunov-based111Other works have applied different notions of Lyapunov stability in the context of model-based RL Farahmand & Benosman (2017); ETH_Bastards2017 and MDPs Perkins & Barto (2000); Chow et al. (2018), however, none of these works incorporate explicit robustness in their formulation, i.e., in the context of RCMDP. reward shaping for RCMDPs. The motivation of this is threefold: i) learn a good policy faster, ii) serve as a proxy to guide robustness when an estimate for the value function is not readily available and iii) guarantee stability (in the sense of Lyapunov) in the learning process. We first briefly introduce the idea of Lyapunov stability, Lyapunov function, and some of its useful characteristics. We then introduce the notion of additive shaping reward strategy based on Lyapunov functions and analyze its properties.
(Lyapunov stability) (Haddad, 2008) Consider the general nonlinear discrete system , where , is an open set containing , is a continuous function on . Then, the equilibrium point of satisfying , is said to be:
- Lyapunov stable if , , s.t., if , then
- Asymptotically stable if Lyapunov stable and , s.t., if , then
(Lyapunov direct method) (Haddad, 2008) Consider the system (Sy), and assume that there exists a continuous Lyapunov function , s.t.,
then the equilibrium point is Lyapunov stable. If, in addition , then is asymptotically stable.
4.1 Stability Constraints for RMDPs
We propose to incorporate the Lyapunov stability descent property (6c) as a constraint in the RCMDP objective (2) , where the constraint cost is given by . We set the budget to enforce Lyapunov stability or set for achieving asymptotic stability. Note that in this setting, we assume that the only constraint cost is the stability cost , and thus we are in the setting of RMPDs to which we add a virtual stability constraint cost. In this setting, we apply Algorithm 1 to propose a Lyapunov stable-RCPG algorithm, and use the results of Theorem 3.2, to deduce its asymptotic convergence to a local optimal stationary policy for the infinite horizon case. We summarize this in the following proposition.
Under assumptions (A1) - (A7) as stated in Section A.5, the sequence of parameter updates of LABEL:alg:rcpg, where converges almost surely to a locally optimal a.s. Lyapunov stable policy as . Furthermore, if , the policy is a.s. asymptotically stable.
Consider the control problem defined by (2), under assumptions (A1) - (A7), and where . Then, based on Theorem 3.2, we can conclude that Algorithm 1, converges asymptotically almost surely to a local optimal policy . Furthermore, since is computed under the constraint of Lyapunov descent property in expectation, the equilibrium point of the controlled system is a.s.222Almost surely–a.s.–(asymptotic) Lyapunov stability is to be understood as (asymptotic) Lyapunov stability for almost all samples of the states. Lyapunov stable (Definition 3.5, Mahmoud et al. (2003)) when , and a.s. asymptotically Lyapunov stable (Definition 3.8, Mahmoud et al. (2003)) when . ∎
4.2 Stability Constraints for RCMDPs
In the case where the problem at hand is an RCMPD with a constraint cost (e.g. physical obstacle avoidance constraints for a mobile robot), we propose two main approaches to incorporate the stability descent constraint. We take the parallel between the notions of soft constraints, where the Lyapunov descent constraints is not enforced as a constraint cost as in Sec. 4.1, and reward shaping (Ng et al., 1999). Indeed, we propose to add the Lyapunov stability descent constraint directly to the reward of the RCMDP (2).
Reward Shaping with Lyapunov Constraint
We define the shaping reward function based on this Lyapunov descent property.
The motivation behind this is quite intuitive: a transition towards descend direction leads to a desired region of the state space faster and therefore should be rewarded. So, if we were to receive a reward in the original setting, we instead would pretend to receive a reward of on the same event. This renders a transformed RCMDP with same state space, action space and transition probabilities. Only the reward function is reshaped with additional reward signals .
Every optimal finite-horizon policy in transformed RCMDP is also an optimal finite-horizon policy in the original RCMDP under Lyapunov based reward transformation stated in (7). Furthermore, under the assumption of transient MDP, every infinite-horizon policy in transformed RCMDP is also an optimal finite-horizon policy in the original RCMDP .
In the finite-horizon case, this result is a simple extension of Theorem 1 of Ng et al. (1999) into the RCMDP setting and the proof follows directly from Ng et al. (1999). In the infinite-horizon case, one needs to rely on the transient assumption for the MDP (in the sense of Def. 7.1 in Altman (2004)) to conclude about the convergence of the finite-horizon problem to the infinite-horizon problem, using the arguments in (Theorem 15.1, Altman (2004)) . See Section B.1 for the full derivation. ∎
Note that the concept of Lyapunov reward transformation is independent of the RL algorithm, and thus can be applied with any existing mainstream approaches such as TRPO, PPO, or CPO. The Lyapunov reward transformation will allow faster convergence for these existing approaches, as verified in our empirical analysis.
In this paper, we studied robust constrained MDPs (RCMDPs) to simultaneously deal with constraints and model uncertainties in reinforcement learning. We proposed the RCMDP framework, derived related theoretical analysis and proposed algorithms to optimize the objective of RCMDPs. We also proposed an extension to Lyapunov-RCMDPs (L-RCMDPs) for RCMDPs based on the Lyapunov function. We analyzed the performance of our L-RCMDP algorithms in the context of reward-shaping. We provided theoretical analysis of Lyapunov stability and asymptotic convergence for our methods. We also empirically validated the proposed algorithms on three different problem domains. Future work should focus on automated learning of the Lyapunov function from the domain itself and apply the proposed approach to more complex practical problem domains.
Achiam et al. (2017)
Achiam, J., Held, D., Tamar, A., and Abbeel, P.
Constrained Policy Optimization.
International Conference on Machine Learning, 2017.
- Altman (2004) Altman, E. Constrained Markov Decision Processes. 2004.
Behzadian et al. (2021)
Behzadian, B., Russel, R. H., Petrik, M., and Ho, C. P.
Optimizing Percentile Criterion Using Robust MDPs.
International Conference on Artificial Intelligence and Statistics (AISTATS), 2021.
- Benosman (2018) Benosman, M. Model‐based vs data‐driven adaptive control: An overview. International Journal of Adaptive Control and Signal Processing, 2018.
- Bertsekas (2003) Bertsekas, D. P. Nonlinear programming. Athena Scientific, 2003.
- Borkar (2009) Borkar, V. S. Stochastic Approximation: A Dynamical Systems Viewpoint. International Statistical Review, 2009.
Chamiea et al. (2016)
Chamiea, M. E., Yu, Y., and Acikmese, B.
Convex synthesis of randomized policies for controlled markov chains with density safety upper bound constraints.In IEEE American Control Conference, pp. 6290–6295, 2016.
- Chow & Ghavamzadeh (2014) Chow, Y. and Ghavamzadeh, M. Algorithms for CVaR optimization in MDPs. Advances in Neural Information Processing Systems, 2014.
- Chow et al. (2018) Chow, Y., Nachum, O., Duenez-Guzman, E., and Ghavamzadeh, M. A lyapunov-based approach to safe reinforcement learning. Advances in Neural Information Processing Systems, 2018.
- Dalal et al. (2018) Dalal, G., Dvijotham, K., Vecerik, M., Hester, T., Paduraru, C., and Tassa, Y. Safe exploration in continuous action spaces, 2018.
- Derman et al. (2018) Derman, E., Mankowitz, D. J., Mann, T. A., and Mannor, S. Soft-robust actor-critic policy-gradient. Conference on Uncertainty in Artificial Intelligence (UAI), 2018.
- Farahmand & Benosman (2017) Farahmand, A.-M. and Benosman, M. Towards stability in learning based control: A bayesian optimization based adaptive controller. In The Multi-disciplinary Conference on Reinforcement Learning and Decision Making, 2017.
Finn et al. (2017)
Finn, C., Yu, T., Zhang, T., Abbeel, P., and Levine, S.
In Levine, S., Vanhoucke, V., and Goldberg, K. (eds.),
One-Shot Visual Imitation Learning via Meta-Learning, 2017.
- Geibel & Wysotzki (2005) Geibel, P. and Wysotzki, F. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research, 2005.
- Haddad (2008) Haddad, W. M. Nonlinear dynamical systems and control: A Lyapunov-based approach. Princeton University Press, 2008.
- Iyengar (2005) Iyengar, G. N. Robust dynamic programming. Mathematics of Operations Research, 2005.
- Janner et al. (2019) Janner, M., Fu, J., Zhang, M., and Levine, S. When to trust your model: Model-based policy optimization. arXiv, 2019.
- Le Tallec (2007) Le Tallec, Y. Robust, Risk-Sensitive, and Data-driven Control of Markov Decision Processes. PhD thesis, MIT, 2007.
- Lin et al. (2018) Lin, S. C., Zhang, Y., Hsu, C. H., Skach, M., Haque, M. E., Tang, L., and Mars, J. The architectural implications of autonomous driving: Constraints and acceleration. ACM SIGPLAN Notices, 2018.
- Mahmoud et al. (2003) Mahmoud, M. M., Jiang, J., and Zhang, Y. Active Fault Tolerant Control Systems: Stochastic Analysis and Synthesis. Springer, 2003.
- Ng et al. (1999) Ng, A., Harada, D., and Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning, 1999.
- Nilim & Ghaoui (2004) Nilim, A. and Ghaoui, L. E. Robust solutions to Markov decision problems with uncertain transition matrices. Operations Research, 53(5):780, 2004.
- Nilim & Ghaoui (2005) Nilim, A. and Ghaoui, L. E. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, sep 2005. ISSN 0030-364X. doi: 10.1287/opre.1050.0216.
- Paternain et al. (2019) Paternain, S., Chamon, L. F., Calvo-Fullana, M., and Ribeiro, A. Constrained reinforcement learning has zero duality gap. Conference on Neural Information Processing Systems, 2019.
- Perkins & Barto (2000) Perkins, T. J. and Barto, A. G. Lyapunov-Constrained Action Sets for Reinforcement Learning. 2000.
- Puterman (2005) Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2005.
- Russel & Petrik (2019a) Russel, R. H. and Petrik, M. Beyond Confidence Regions: Tight Bayesian Ambiguity Sets for Robust MDPs. Advances in Neural Information Processing Systems (NeurIPS), 2019a.
- Russel & Petrik (2019b) Russel, R. H. and Petrik, M. Beyond confidence regions: Tight Bayesian ambiguity sets for robust MDPs. Advances in Neural Information Processing Systems, 2019b.
- Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 2017.
- Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
- Szepesvári (2010) Szepesvári, C. Algorithms for Reinforcement Learning. Morgan & Claypool Publishers, 2010.
- Tamar et al. (2014) Tamar, A., Glassner, Y., and Mannor, S. Optimizing the CVaR via Sampling. 2014.
- Vamvoudakis et al. (2015) Vamvoudakis, K., Antsaklis, P., Dixon, W., Hespanha, J., Lewis, F., Modares, H., and Kiumarsi, B. Autonomy and machine intelligence in complex systems: A tutorial. 2015.
van Baar et al. (2019)
van Baar, J., Sullivan, A., Corcodel, R., Jha, D., Romeres, D., and Nikovski,
Sim-to-real transfer learning using robustified controllers in robotic tasks involving complex dynamics.In IEEE International Conference on Robotics and Automation (ICRA), 2019.
- Wiesemann et al. (2013) Wiesemann, W., Kuhn, D., and Rustem, B. Robust Markov decision processes. Mathematics of Operations Research, 2013.
Appendix A RCMDP Derivations
a.1 Proof of Proposition 1
We rewrite the objective (1) and perform some algebraic manipulation as below:
Where is the set of all possible trajectories induced by policy under transition function . Similarly, is the set of all possible trajectories induced by policy under transition function . Step above follows by assuming that the initial state distribution concentrates all of its mass to one single state . And follows with and . Note that, and are distinct, independent and depend on rewards and constraint costs respectively. However, the rewards and constraint costs are coupled together in reality, meaning that the set of two trajectories and would not be different. So we select one set of trajectories being either or . This selection of may happen based on our priorities toward robustness of reward (with corresponding trajectory ) or constraint cost (with corresponding trajectory ). Or, it can also be the best (e.g. yielding higher objective value) set among and satisfying the constraint. We then have a simplified formulation for as below:
a.2 Proof of Theorem 3.1
The objective as specified in (3):
We first derive the gradient update rule of with respect to as below:
Next, we derive the gradient update rule for with respect to :
a.3 Proof of Proposition 2
Here follows by expanding total return given a trajectory and follows by evaluating the one-step immediate transition apart. ∎
a.4 Actor-Critic Algorithm
a.5 Convergence Analysis of Algorithm
(A1) For any state , policy is continuously differentiable with respect to parameter and is a Lipschitz function in for every and .
(A2) The step size schedules satisfy:
) ensures that the errors resulting from the discretization of the Ordinary Differential Equation (ODE) and errors due to the noise both becomes negligible asymptotically with probability one(Borkar, 2009). Equations (8) and (9) together ensure that the iterates asymptotically capture the behavior of the ODE. (10) mandates that, updates corresponding to are on a slower time scale than .
a.5.1 Policy Gradient Algorithm
The general stochastic approximation scheme used by Borkar (2009) is of the form:
are a sequence of integrable random variables representing the noise sequence andare step sizes (e.g. ). The expression inside the square bracket is the noisy measurement where and are not separately available, only their sum is available. The terms of (11) need to satisfy below additional assumptions:
The function is Lipschitz. That is for some .
are martingale difference sequence:
In addition to that, are square-integrable:
and for some constant .
Our proposed policy gradient algorithm is a two time-scale stochastic approximation algorithm. The parameter update iterations of the policy gradient algorithm are defined as below:
Where is a zero mean i.i.d. random variable representing noise. To apply general convergence analysis techniques derived for (11) in Borkar (2009), we take the special form in (14) and transform it to the general format of (11) as below:
With these transformation techniques, we obtain the general update for from (12):
where, is the gradient w.r.t , , and . Note that, the noise term is omitted because the noise is inherent in our sample based iterations.
is Lipschitz in .
Recall that the gradient of with respect to is:
Assumption (A1) implies that, in the equation (17) is a Lipschitz function in for any and . As the expectation of sum of number of Lipschitz functions is also Lipschitz, we conclude that is Lipschitz in .
of (16) satisfies assumption (A4).
We transform our update rule of (13) as:
where, is the gradient w.r.t , , and .
Notice that is a constant function of . And therefore, is a constant function of .
of (18) satisfies assumption (A4).
With assumption (A2), is quasi-static from the perspective of turning (19) into an ODE. where is held fixed:
We additionally assume that:
(21) has a globally asymptotically stable equilibrium such that is a Lipschitz map.
Assumption (A5) turns (20) into:
Let’s further assume that:
The ODE (22) has a globally asymptotically stable equilibrium .