## 1 Introduction

One of the central challenges in reinforcement learning (RL) is to design an efficient algorithm for high-dimensional environments. Recently, model-free deep RL has shown promise in being able to tackle high-dimensional continuous environments. The primary model-free approaches on extreme ends of the deep RL spectrum are value-based and policy-gradient-based methods. Value-based approaches mainly learn a function to approximate the value of actions at any state. Then they exploit the approximated function to reason about the actions. On the other hand, policy gradient-based approaches, directly learn the policy and remove the value learning overhead.

Value-based model-free approaches have been successful in a wide variety of simulated domains, but they are limited to MDPs (Mnih et al., 2015). Similarly, even though the principle of policy gradient does not require any model assumption (Aleksandrov, 1968; Rubinstein, 1969; Baxter and Bartlett, 2001; Williams, 1992), previous advances of more sophisticated policy gradient methods are mainly limited to MDPs (Sutton et al., 2000; Schulman et al., 2015, 2017; Lillicrap et al., 2015) However, real-world problems rarely follow a MDP and the entire environment is rarely observable. Moreover, Sutton et al. (1998) argues that when a function approximation is deployed to represent the states, due to loss of information in the representation function the problem becomes POMDP in general. In addition, previous analyses in MDPs have mostly been dedicated to infinite horizon settings (Sutton et al., 2000; Schulman et al., 2015, 2017; Lillicrap et al., 2015). However, empirical examinations of these methods are mostly in episodic environments.

If the underlying environment is an MDP, then limiting to memoryless policies is appropriate since the optimal policy in an MDP is usually deterministic and memoryless. On the other hand, if the environment is a POMDP, then the optimal policy in the class of memoryless policies is in general stochastic. Policies for POMDPs can also depend on the entire history. However, maintaining history dependent policies is infeasible (PSPACE-Complete) (Montúfar et al., 2015; Azizzadenesheli et al., 2016a; Vlassis et al., 2012), and hence, many works dealing with POMDPs limit to the class of stochastic memoryless policies Azizzadenesheli et al. (2016b); Montúfar et al. (2015). Extension of the value-based method to POMDPs requires computation in belief space which is expensive and requires maintaining entire history. Extending the value-based methods to stochastic memoryless and limited memory policies is not possible if optimality is concerned. But with policy gradient methods, we can design efficient algorithm under the class of stochastic memoryless policies.

Despite the MDP assumption in the mainstream of recent policy gradient methods, empirical studies have demonstrated superior performance when the class of stochastic policies is considered (Schulman et al., 2017, 2015). The stochastic policies also contribute to exploration. Interestingly, in many recent works, the stochasticity has been kept toward the end of the training phase, but these works do not explicitly assume a POMDP (Schulman et al., 2017, 2015).

In the policy gradient methods, in on-policy setting we collect data under the current policy at hand and exploit the acquired data to search for a new policy. This procedure iteratively improves the policy and maximizes the expected return. Under the infinite horizon MDP modeling assumption, Kakade and Langford (2002); Schulman et al. (2015) study the trust-region, a class of policy gradients methods which perform the policy search around the vicinity of the current policy, e.g., TRPO. They construct a surrogate objective using advantage functions and propose a policy gradient on this surrogate objective. They prove that the expected return of the updated policy monotonically increases. This is so-called Monotonic Improvement Lemma.

In low sample setting, the precise estimation of trust regions is not tractable.

TRPO (Schulman et al., 2015), as a trust-region method of MDPs, explicitly induces the trust region constraints on the parameter space which might be hard to maintain in the low samples setting. To mitigate this, (Schulman et al., 2017) offer Proximal Policy Optimization (PPO), a simple extension to TRPO, which approximately retains the trust-region constraints directly on the policy space. It also significantly reduces the computation cost of TRPO, therefore it is a reasonable choice for empirical study.Contributions: In this work, We develop a new surrogate objective and advantage function for POMDP environments. We show that policy gradient around the current policy using the new objective and advantage function results in policies whose improvements in the expected returns are bounded below. Therefore, we conserve the Monotonic Improvement Lemma. To achieve this guarantee, we show the advantage function needs to depend on three consecutive observations. Surprisingly, this matches the statement in (Azizzadenesheli et al., 2016b) which shows three consecutive observations are necessary to learn the POMDP dynamics and guarantee a regret upper bound in a model-based RL setting. In the analysis of TRPO, the construction of the trust region is independent of the episode length which results in bias estimation of the region in the episodic setting. We show that it is necessary to incorporate the length of each episode and to construct the trust region as a function of the episode lengths.

Generally, discount factors play an essential role in the construction of the trust regions. Discount factors represent how important each segment of a trajectory is therefore how critical is the role each segment in the construction of the trust region. While non of the prior works considers discount factors in the study of trust region, we further extend the analysis in this work and introduce a new notion of divergence which deploys the discount factor to construct a meaningful trust region. We also show to extend the techniques developed in this work to the prior works, e.g.,TRPO and PPO in episodic MDPs. In GTRPO, we use the same techniques as in PPO to reduce computation complexity. We apply it to a variety of RoboSchool (Schulman et al., 2017) environments, which are the extension to the MuJoCo environments (Todorov et al., 2012). We empirically study GTRPO performance on these simulated environments and report its behavior under different simulation design choices. Throughout the experiments, we observe a similar behavior of the MDP based approach PPO and POMDP based approach GTRPO. This might be due to the simplicity of the environment as well as the close similarity of current state of environments are close to MDP

## 2 Preliminaries

An episodic POMDP is a tuple with latent state space , observation space , action space , discount factor of and stochastic reward distribution of with mean . is the terminal state which is accessible

from any other state, i.e. starting from any other state there is a nonzero probability mass of reaching the

in finite time steps. The episode terminates when the process reaches the . The initial latent states are drawn from distribution , then the dynamics follows stochastically as . The observation process is generated as and a memory less policy is deployed which maps a current observation to a distribution over actions. The graphical model associated to the POMDP is illustrated in Fig. 1.We consider a set of parameterized policies with . For each pair, let

denotes the conditional probability distribution of choosing action

under the policy when an observation is observed. Furthermore, we define a random trajectory as a finite length sequence of events where the termination happens at the step after , i.e. . Let denote the probability distribution of trajectories under policy and is the set of all possible trajectories. Furthermore, denotes the cumulative -discounted rewards of the trajectory and denotes the probability density of the trajectory under policy . The agent goal is to maximize the unnormalized expected cumulative return ;(1) |

with the optimal policy.

## 3 Policy Gradient

In this section, we study the policy gradients methods for POMDPs. Generally, the optimization problem of interest in Eq. 1 is a non-convex problem. Therefore, hill climbing methods might not converge to the global optima. While finding the optimal solution to this problem is intractable in general, we study the gradient ascent based approaches. Gradient ascent for Eq. 1 results in the policy gradient method. It is a well-known Lemma that the gradient of the expected cumulative return does not require the explicit knowledge of the dynamics but just the cumulative reward distribution (Williams, 1992; Baxter and Bartlett, 2001). This Lemma has mainly been proven through the construction of score function (see section A.1). In this section, we re-derive the same Lemma but through importance sampling since it is more related to the latter parts of this paper.

Importance sampling is a general technique for estimating the properties of a particular distribution, while only having samples generated from another distribution. One can estimate , , while the expectation is over the distribution induced by ;

(2) |

as long as for each that also . The gradient of with respect to is

The gradient at is;

(3) |

Since for each trajectory , the ;

and the first part is independent we have;

This derivation suggest that given trajectories under a policy we can compute the gradient of the expected return with respect to the parameters of without the knowledge of the dynamics. In practice, however we are not able to compute the exact expectation. Instead we can deploy Monte Carlo sampling technique to estimate the gradient. Given trajectories with elements and generated under a policy , we can estimate the gradient in Eq. 3 at point ;

(4) |

### 3.1 Natural Policy Gradient

Generally, the notion gradient depends on the metric space it lives. Given a pre-specified Riemannian metric, a gradient direction is defined. When the metric is Euclidean, the notion of gradient reduces to the standard gradient (Lee, 2006). This general notion of gradient adjusts the standard gradient direction based on the local curvature induced by the Riemannian manifold of interest. Valuable knowledge of the curvature assists to find an ascent direction which might conclude to big ascend in the objective function. This approach is also interpreted as a trust region method where we are interested in assuring that the ascent steps do not change the objective beyond a safe region where the local curvature is still valid. In general, a valuable manifold might not be given, and we need to adopt one. Fortunately, when the objective function is an expectation over a parameterized distribution, Amari (2016) recommend employing a Riemannian metric, induced by Fisher information. This choice of metric results in a well knows notion of gradient, so-called natural gradient. For the objective function in 1, the Fisher information matrix is defined as follows;

(5) |

Natural gradients are firstly deployed by Kakade (2002) for RL in MDPs. Consequently, the direction of the gradient with respect to is defined as . One can compute the inverse of this matrix to come up with the direction of the natural gradient. Since neither storing the Fisher matrix is always possible nor computing the inverse is practical, direct utilization of is not feasible. As also used in TRPO, we suggest to first deploy divergence substitution technique and then conjugate gradient method to tackle the computation and storage bottlenecks.

###### Lemma 1.

Under some regularity conditions;

(6) |

with

Proof of Lemma 1 in Subsection A.2. In practice, it is not feasible to compute the expectation of neither the Fisher information matrix nor the divergence, but their empirical estimates. Given trajectories

This derivation of is common between MDPs and POMDPs. The analysis in most of the state-of-the-art policy gradient methods, e.g. TRPO, PPO, are dedicated to infinite horizon MDPs, while almost all the experimental studies are in the episodic settings. Therefore the estimator used in these methods;

is a bias estimation of the in episodic settings.

###### Remark 1.

[ vs ] The use of instead of is motivated by theory also intuitively recommended. A small change in the policy at the beginning of short episodes does not make a drastic shift in the distribution of that trajectory but might cause radical shifts when the horizon length is long. Therefore, for longer horizons, the trust region needs to shrink. Consider two trajectories, one long and one short. The induces a region which allows higher changes in the policy for short trajectory while limiting changes in long trajectory. While induces the region which does not convey the length of each trajectory and look at the samples as they happened in stationary distribution of an infinite horizon MDP. Consider a simpe game. where at the beginning of the learning, when the policy is not good, the agent dies at early stages of the episodes, and the game terminates. In this case, the trust region under is vast and allows for the more substantial change in the policy space, while again does not consider the length of the episode. On the other hand, toward the end of learning, when the agent plays a good policy, the length of the horizon grows, and small changes in the policy cause drastic changes in the trajectory distribution. Therefore the trust region shrinks again, and tiny changes in the policy space are allowed, which is again captured by but not .

##### Compatible Function Approximation

As it is mentioned before, one way of computing the direction of the natural gradient is to estimate the and use conjugate gradient methods to find . There is also another interesting way to estimate , which is based on compatible function approximation methods. Kakade (2002) study this approach in the context of MDPs. In the following, we develop this approach for POMDPs. Consider a feature map in some ambient space defined on . We approximate the return by a linear function on the feature representation , i.e.,

To find the optimal we take the gradient of and set it to zero;

For the optimality,

If we consider the , the LHS of this equation is . Therefore

In practice, either of the discussed approaches of computing the natural gradient is applicable, and one needs to choose one of them depending on the problem and application at hand. Due to the close relationship between and Fisher information matrix Lemma.1 and also the fact that the Fisher matrix is equal to second order Taylor expansion of , instead of considering the area , or , we can approximately consider . The relationship between these three approaches toward trust-regions is used throughout this paper.

## 4 Trpo for Pomdps

In this section we develop the MDP analysis in Kakade and Langford (2002); Schulman et al. (2015) to POMDPs, propose GTRPO, and derive a guarantee on its monotonic improvement property. We prove the monotonic improvement property using . We also develop a new discount factor depended divergence and provide the same guarantee under the new divergence.

The divergence and Fisher information matrix in Eq. 6, Eq. 5 do not convey the effect of the discount factor. Consider a setting with a small discount factor . In this setting, we do not mind drastic distributional changes in the latter part of episodes. Therefore, we desire to have a even wider trust region and allow bigger changes for later parts of trajectories. This is a valid intuition and in the following, we re-derive the divergence by also incorporating . Let denote the elements in up to time step ; we rewrite as follows;

Following the Amari (2016) reasoning for Fisher information of each component of the sum, we derive a dependent divergence;

(7) |

For some upper bound on trajectory length . This divergence less penalizes the distribution mismatch in the later part of trajectories. Similarly, taking into account the relationship between KL divergence and Fisher information we have discount factor dependent definition of the Fisher information;

In the following we develop GTRPO monotonic improvement guarantee under both and .

### 4.1 Advantage function on the hidden states

In the following let denote policy under which we collect data, so-called the current policy, and the policy which we evaluate its performance, so-called the new policy. Generally, any policy on the on observation space is transferable to a policy on the latent states as follows; for each pair of . Consider the case where the agent also observes the latent state, i.e. POMDP MDP. Since the dynamics on the latent states is MDP, we define the advantage function on the latent states: at time step of episode;

Where denote the value function of underlying MDP of latent states when a policy is deployed. For this choice of advantage function we have Therefore,

If we have the advantage function of the current policy and sampled trajectories from , we could compute the improvement in the expected return therefore maximize it. In this case we also could potently just maximize the expected return for without incorporating any knowledge from . Instead, in practice, we do not have sampled trajectories from the new policy , rather we have sampled trajectories from the current policy . Therefore, we might be interested in maximizing the following surrogate objective function since we can compute it;

For infinite horizon MDPs when

is an identity matrix, i.e. at each time step

, Kakade and Langford (2002); Schulman et al. (2015) show that optimizing over can provide an improvement in the expected discounted return. They derive a lower bound on this improvement if the divergence of and for all ’s is bounded. In the following, we extend these analyses to the general class of environments, i.e. POMDPs and show such guarantees are conserved.Generally, in POMDPs, when classes of memory-less policies are regarded, neither nor functions are well-defined as they are for MDP by the Bellman equation. In the following, we define two quantities similar to the and in MDPs while for simplicity use the same and notation for them. The conditional value and Q-value functions of POMDPs

(8) | |||

(9) |

For we relax the conditioning on for and simply denote it as . Deploying these two quantities, we define the advantage function as follows;

Here

denotes the reward random variable conditioned on current observation, action as well as one step successor observation. It worth noting that since the reward process is not Markovian, this conditioning does not make it independent of past or future. Furthermore, we defined the following surrogate objective function;

(10) |

Similar to MDPs, one can compute and maximize this surrogate objective function in Eq. 4.1 by just having sampled trajectories and advantage function of the current policy .

###### Lemma 2.

The improvement in expected return, , is as follows;

In practice, one can estimate the advantage function by approximating and using on-policy data of . It is worth noting that has also following nice property from the derivation of policy gradient theorem

In the following we show that maximizing over results in a lower bound on the improvement when and are close under or divergence. Lets define the averaged advantage function

also the maximum span of the averaged advantage function and its discounted sum as follows;

###### Theorem 1 (Monotonic Improvement Guarantee).

For two and , construct , then

and also

###### Proof.

Following the result in the Lemma 2 we have

therefore,

following the definition of

Deploying the maximum span of averaged advantage function and the Pinsker’s inequality we have

Which results in the first part of the theorem. On the other hand

Deploying the definition of and the Pinsker’s inequality again we have

and the second part of the theorem goes through. ∎

The Theorem. 1 recommend optimizing over around the vicinity defined by or divergences Therefore, given the current policy we are interested in either of the following optimization:

Where and are the problem dependent constants which also are the nobs to restrict the trust region. Similar to TRPO, using and as they are might result in tiny changes in policy. Therefore, for practical purposes, we view them as the nobs to restrict the trust region denoted by , and turn these optimization problems to constraint optimization problems;

Taking into account the relationship between the KL divergence and Fisher information, we can also approximate these two optimization up to their second order Taylor expansion of the constraints;

These analyses provide insights to design similar algorithm as TRPO and PPO for the general class of problems, i.e., POMDPs.

## 5 Experiments

##### Extension to Ppo:

Usually, in low sample setting, estimating and then constructing the trust region is hard, especially when the region dependents on the inverse of the estimated Fisher matrix or optimizing over the non-convex function of in KL divergence. Therefore, trusting the estimated trust region is questionable. While TRPO construct the trust region in the parameter space, its final goal is to keep the new policy close to the current policy, i.e., small or . Proximal Policy Optimization (PPO) is instead proposed to impose the structure of the trust region directly onto the policy space. This method approximately translates the constraints developed in TRPO to the policy space. It penalized the gradients of the objective function when the policy starts to operate beyond the region of trust by setting it to zero.

We dropped the dependency in the advantage function since this approach is for the infinite horizon. If the advantage function is positive, and the importance weight is above this objective function saturates. When the advantage function is negative, and the importance weight is below this objective function saturates again. In either case, when the objective function saturates, the gradient of this objective function is zero therefore further development in that direction is obstructed. This approach, despite its simplicity, approximates the trust region effectively and substantially reduce the computation cost of TRPO. Note: In the original PPO paper .

Following the TRPO, the clipping trick ensures that the importance weight, derived from estimation of does not go beyond a certain limit. i.e. each . This results in

(11) |

As discussed in the Remark. 1 we propose a principled change in the clipping such that it matches Eq. 6 and conveys information about the length of episodes; ; therefore for

(12) |

This change ensures more restricted clipping for longer trajectories, while softer for shorter ones. Moreover, as it is suggested in theorem. 1, and the definition of in Eq. 7, we propose a further extension in the clipping to conduct information about the discount factor. For a sample at time step of an episode we have . Therefore;

Comments

There are no comments yet.