# Diverse Exploration via Conjugate Policies for Policy Gradient Methods

We address the challenge of effective exploration while maintaining good performance in policy gradient methods. As a solution, we propose diverse exploration (DE) via conjugate policies. DE learns and deploys a set of conjugate policies which can be conveniently generated as a byproduct of conjugate gradient descent. We provide both theoretical and empirical results showing the effectiveness of DE at achieving exploration, improving policy performance, and the advantage of DE over exploration by random policy perturbations.

## Authors

• 7 publications
• 17 publications
• 60 publications
• 1 publication
• 4 publications
02/22/2018

### Diverse Exploration for Fast and Safe Policy Improvement

We study an important yet under-addressed problem of quickly and safely ...
07/14/2020

### Lifelong Policy Gradient Learning of Factored Policies for Faster Training Without Forgetting

Policy gradient methods have shown success in learning control policies ...
05/31/2019

### Diversity-Inducing Policy Gradient: Using Maximum Mean Discrepancy to Find a Set of Diverse Policies

Standard reinforcement learning methods aim to master one way of solving...
04/28/2019

### Learning walk and trot from the same objective using different types of exploration

In quadruped gait learning, policy search methods that scale high dimens...
09/26/2020

### Neurosymbolic Reinforcement Learning with Formally Verified Exploration

We present Revel, a partially neural reinforcement learning (RL) framewo...
02/26/2020

### Policy Evaluation Networks

Many reinforcement learning algorithms use value functions to guide the ...
06/29/2021

### Curious Explorer: a provable exploration strategy in Policy Learning

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Introduction

Policy gradient (PG) [Peters and Schaal2008, Schulman et al.2015, Sutton et al.1999, Wu et al.2017]

methods in reinforcement learning (RL)

[Sutton and Barto1998] have shown the ability to train large function approximators with many parameters but suffer from slow convergence and data inefficiency due to a lack of exploration. Achieving exploration while maintaining effective operations is a challenging problem as exploratory decisions may degrade performance. Conventional exploration strategies which achieve exploration via noisy action selection (like -greedy [Sutton and Barto1998]) or actively reducing uncertainty (like R-MAX [Brafman and Tennenholtz2003]) do not guarantee performance of the behavior policy.

This work follows an alternative line that performs exploration in policy space, and is inspired by two recent advances: Diverse Exploration (DE) [Cohen, Yu, and Wright2018] and parameter space noise for exploration [Plappert et al.2018]. The key insight of DE is that, in many domains, there exist multiple different policies at various levels of policy quality. Effective exploration can be achieved without sacrificing exploitation if an agent learns and deploys a set of diverse behavior policies within some policy performance constraint. This work shares a similar intuitive motivation in the context of PG methods: multiple parameterizations of “good” but, importantly, different

policies exist in the local region of a main policy. Deploying a set of these policies increases the knowledge of the local region and can improve the gradient estimate in policy updates. Though similarly motivated, this work provides distinct theoretical results and an algorithmic solution to a unique challenge in PG methods:

to maximally explore local policy space in order to improve the gradient estimate while ensuring performance.

Parameter space noise for exploration [Plappert et al.2018] can be thought of as a DE approach specific to the PG context. To achieve exploration, different behavior policies are generated by randomly perturbing policy parameters. To maintain the guarantees of the policy improvement step from the previous iteration, the magnitude of these perturbations has to be limited which inherently limits exploration. Thus, for effective exploration in PG methods, we need an optimal diversity objective and a principled approach of maximizing diversity. In light of this, we propose DE by conjugate policies that maximize a theoretically justified Kullback–Leibler (KL) divergence objective for exploration in PG methods.

The novel contributions of this paper are three–fold. First, it proposes a DE solution via conjugate policies for natural policy gradient (NPG) methods. DE learns and deploys a set of conjugate policies in the local region of policy space and follows the natural gradient descent direction during each policy improvement iteration.

Second

, it provides formal explanation for why DE via conjugate policies is effective in NPG methods. Our theoretical results show that: (1) maximizing the diversity (in terms of KL divergence) among perturbed policies is inversely related to the variance of the perturbed gradient estimate, contributing to more accurate policy updates; and (2) conjugate policies generated by conjugate vectors maximize pairwise KL divergence among a constrained number of perturbations. In addition to justifying DE via conjugate policies, these theoretical results explain why parameter space noise

[Plappert et al.2018] improves upon NPG methods but is not optimal in terms of the maximum diversity objective proposed in this work.

Finally, it develops a general algorithmic framework of DE via conjugate policies for NPG methods. The algorithm efficiently generates conjugate policies by taking advantage of conjugate vectors produced in each policy improvement iteration when computing the natural gradient descent direction. Experimental results based on Trust Region Policy Optimization (TRPO) [Schulman et al.2015] on three continuous control domains show that TRPO with DE significantly outperforms the baseline TRPO as well as TRPO with random perturbations.

## Preliminaries

RL problems are described by Markov Decision Processes (MDP)

[Puterman1994]. An MDP, , is defined as a 5-tuple, , where is a fully observable set of states, is a set of possible actions, is the state transition model such that

describes the probability of transitioning to state

after taking action in state , is the expected value of the immediate reward after taking in , resulting in , and is the discount factor on future rewards. A trajectory of length is an ordered set of transitions: .

A solution to an MDP is a policy, which provides the probability of taking action in state when following policy . The performance of policy is the expected discounted return

 J(π)=Eτ[R(τ)]=Es0,a0..[∞∑t=0γtr(at,st)] where s0∼ρ(s0), at∼π(⋅|st), st+1∼P(⋅|st,at)

and is the distribution over start states.

The state-action value function, value function and advantage function are defined as:

 Qπ(st,at)=Est+1,at+1..[∞∑l=0γlr(at+l,st+l)] Vπ(st)=Eat,st+1..[∞∑l=0γlr(at+l,st+l)] Aπ(st,at)=Qπ(st,at)−Vπ(st) where at∼π(⋅|st), st+1∼P(⋅|st,at).

is represented by a function approximator such as a neural network parameterized by vector

. These methods maximize via gradient descent on the expected return of captured by the objective function:

 maxθJ(θ)=Eτ[R(τ)].

The gradient of the objective is:

 ∇θJ(θ)=Eτ[T∑t=0∇θlog(π(at|st;θ))Rt(τ)],

which is derived using the likelihood ratio. This is estimated empirically via

 ∇θ^J(θ)=1NN∑i=0[T∑t=0∇θlog(π(at|st;θ))^Aπ(st,at)],

where an empirical estimate of the advantage function is used instead of to reduce variance and is the number of trajectories. The policy update is then where is the stepsize. This is known as the ‘vanilla’ policy gradient.

### Natural Gradient Descent and TRPO

A shortcoming of vanilla PG methods is that they are not invariant to the scale of parameterization nor do they consider the more complex manifold structure of parameter space. Natural Gradient Descent methods [Kakade2002, Amari and Nagaoka2000] address this by correcting for the curvature of the parameter space manifold by scaling the gradient with the inverse Fisher Information Matrix (FIM) where

 Fij,θ=−Es∼ρ[∂∂θi∂∂θjlog(πθ(⋅|s))]

is the th entry in the FIM and is the state distribution induced by policy . The natural policy gradient descent direction and policy update is then

 ∇θ~J(θ)=F−1θ∇θJ(θ), θi+1=θi+α∇θ~J(θ).

Selecting the stepsize is not trivial. TRPO [Schulman et al.2015], a robust and state of the art approach, follows the natural gradient descent direction from the current policy but enforces a strict KL divergence constraint by optimizing

 maxθEs∼ρθ′,a∼πθ′[π(a|s;θ)π(a|s;θ′)Rt(τ)] subject to DKL(πθ||πθ′)≤δ

which is equivalent to the standard objective. The KL divergence between two policies is

 DKL(πθ||πθ′):=Es∼ρ[DKL(πθ(⋅|s)||πθ′(⋅|s))].

Via a Taylor expansion of , one can obtain the following local approximation

 ~DKL(θ||θ+dδ)=12(θ+dδ−θ)TFθ(θ+dδ−θ) =12dδTFθdδ,

where and parameterize two policies. This approximation is used throughout this work.

### Parameter Space Noise for Exploration

The reinforcement learning gradient estimation can be generalized with inspiration from evolutionary strategies [Wierstra et al.2014] by sampling parameters from a search distribution [Plappert et al.2018].

 ∇ϕ,ΣEθ∼N(ϕ,Σ)τ∼π[R(τ)]=Eϵ∼N(0,I)τ∼π[T∑t=0∇ϕ,Σlog(π(at|st;ϕ+ϵΣ12))Rt(τ)] (1)

which is derived using likelihood ratios and reparameterization [Kingma and Welling2014]. The corresponding empirical estimate is

 1NN∑i=0[T∑t=0∇ϕ,Σlog(π(at|st;ϕ+ϵiΣ12))^Aπ(st,at)].

This gradient enables exploration because it aggregates samples from multiple policies (each defines a different, perturbed policy). It may seem that using trajectories collected from perturbed policies introduces off-policy bias (and it would for the standard on-policy gradient estimate). However, the generalized objective in Equation (1) does not have this issue since the gradient is computed over a perturbation distribution.

Perturbation approaches suffer from an exploration-exploitation dilemma. Large perturbations increase exploration but potentially degrade performance since the perturbed policy becomes significantly different from the main, unperturbed policy. A small perturbation provides limited exploration but will benefit from online performance similar to that of the main policy. Our approach, DE, is designed to maximize diversity in a limited number of perturbations within a bounded local region of parameter space to address this tradeoff which random perturbations do not. From here on, we refer to Equation (1) as the “perturbed gradient estimate” and refer to the approach that samples random perturbations as RP.

## Theoretical Analysis

In this section, we first prove that increasing the KL divergence between perturbed policies reduces the variance of the perturbed gradient estimate. Further, we prove that conjugate vectors maximize pairwise KL divergence among a constrained number of perturbations.

### Variance Reduction of Parameter Perturbation Gradient Estimator

We consider the general case where where is the perturbation distribution. When , we recover the gradient in Equation (1) in the Preliminaries. To simplify notations in the variance analysis of the perturbed gradient estimate, we write as shorthand for and let be the policy with parameters perturbed by . Moreover,

 Gϵ:=Eτ∼πϵ[T∑t=0∇ϕlog(πϵ(at|st))Rt(τ)]

is the gradient with respect to with perturbation . The final estimate to the true gradient in Equation  (1) is the Monte Carlo estimate of () over perturbations. For any ,

is an unbiased estimate of the gradient so the averaged estimator is too. Therefore, by reducing the variance, we reduce the estimate’s mean squared error. The variance of the estimate over

perturbations is

 V(1kk∑i=1Gϵi)=1k2k∑i=1Vϵi(Gϵi)+2k2k−1∑i=1k∑j=i+1Covϵi,ϵj(Gϵi,Gϵj) (2)

where is the variance of the gradient estimate and is the covariance between the gradients and .

is equal to a constant for all because are identically distributed. So, the first term in Equation (2) approaches zero as increases and does not contribute to the asymptotic variance. The covariance term determines whether the overall variance can be reduced. To see this, consider the extreme case when for . Equation (2) becomes because all . The standard PG estimation (i.e. TRPO) falls into this extreme as a special case of the perturbed gradient estimate where all perturbations are the zero vector.

Next consider the special case where for . Then, the second term vanishes and . The RP approach strives for this case by i.i.d. sampling of perturbations . This explains why RP was shown to outperform TRPO in some experiments [Plappert et al.2018]. However, it is important to note that i.i.d. do not necessarily produce uncorrelated gradients as this depends on the local curvature of the objective function. For example, perturbations in a flat portion of parameter space will produce equal gradient estimates that are perfectly positively correlated. Thus, are identically distributed but not necessarily independent. This suggests that using a perturbation distribution such as may suffer from potentially high variance if further care is not taken. This work develops a principled way to select perturbations in order to reduce the covariance.

There are two major sources of variance in the covariance terms; the correlations among and and correlations related to . The difference in performance of two policies (as measured by ) can be bounded by a function of the average KL divergence between them [Schulman et al.2015]. So, the contribution to the covariance from will be relatively fixed since all perturbations have a bounded KL divergence to the main policy. In view of this, we focus on controlling the correlation between and .

This brings us to Theorem 1 (with proof in the supplementary file) in which we show that maximizing the diversity in terms of KL divergence between two policies and minimizes the trace of the covariance between and .

###### Theorem 1.

Let and be two perturbations such that = . Then, (1) the trace of is minimized and (2) the estimated KL divergence is maximized, when

and they are along the direction of the eigenvector of

with the largest eigenvalue.

This theorem shows that, when two perturbations and have a fixed norm , the perturbations that maximize the KL divergence and also minimize the trace of the covariance are uniquely defined by the positive and negative directions along the eigenvector with the largest eigenvalue. This provides a principled way to select two perturbations to minimize the covariance.

### Conjugate Vectors Maximize KL Divergence

In domains with high sample cost, there is likely a limit on the number of samples an agent can collect and so too on the number of policies which can be deployed per iteration. Therefore, it is important to generate a small number of perturbations which yield maximum variance reduction. Theorem 1 shows that the reduction of the covariance can be done by maximizing the KL divergence. We show in the theorem that eigenvectors can achieve this. Eigenvectors are a special case of what are known as conjugate vectors. Later in this section, Theorem 2 shows that when there is a fixed set of perturbations, conjugate vectors maximize the sum of the pairwise KL divergences. We first establish notation.

Since the FIM is symmetric positive definite, there exist conjugate vectors with respect to where is the length of the parameter vector . Formally, and are conjugate if . We define and as conjugate policies if their parameterizations can be written as and for two conjugate vectors and . forms a basis for so any local perturbation to , after scaling, can be written as a linear combination of ,

 ϵ=η1μ1+η2μ2+ . . +ηnμn where ∥η∥≤1, (3)

For convenience, we assume that . Since the negative of a conjugate vector is also conjugate, if there is a negative , we may flip the sign of the corresponding to make it positive.

Recall the approximation of KL divergence from the Preliminaries,

 ~DKL(ϕ||ϕ+ϵ)=12ϵTFϕϵ

The measure of KL divergence that concerns us is the total divergence between all pairs of perturbed policies:

 k−1∑i=1k∑j=i+1~DKL(ϕ+ϵj||ϕ+ϵi)=k−1∑i=1k∑j=i+112(ϵi−ϵj)TFϕ(ϵi−ϵj) (4)

where is the number of perturbations. Note that we use and not in the subscript of the FIM which would be more precise with respect to the local approximation. The use of the former is a practical choice which allows us to estimate a single FIM and avoid estimating the FIM of each perturbation. Estimating the FIM is already a computational burden and, since perturbations are small and bounded, using instead of has little effect and performs well in practice as demonstrated in experiments. For the remainder of this section, we omit in the subscript of for convenience. The constraint on the number of perturbations brings us to the following optimization problem that optimizes a set of perturbations to maximize (4) while constraining .

 P∗=argmaxPk−1∑i=1k∑j=i+1~DKL(ϕ+ϵj||ϕ+ϵi)subject to |P|=k≤n (5)

We define as the norm induced by , that is,

 ∥x∥F=xTFx.

Without the loss of generality, assume the conjugate vectors are ordered with respect to the -norm,

 ∥μ1∥F≥∥μ2∥F≥ . . ≥∥μn∥F.

The following theorem gives an optimal solution to the objective (5). The proof comes easily by induction on (full details in the supplementary file).

###### Theorem 2.

The set of conjugate vectors maximize the objective (5) among any perturbations.

If we relax the assumption that , then the set of vectors that maximize the objective (5) simply includes the negative of each conjugate vector as well, i.e., . Including the negatives of perturbations is known as symmetric sampling [Sehnke et al.2010] which is discussed further in the next section.

Theorem 2 makes clear that randomly generated perturbations will be sub-optimal with high probability with respect to the objective (5) because the optimal solution is uniquely the top conjugate vectors. Identifying the top conjugate vectors in each iteration of policy improvement will require significant computation when the FIM is large. Fortunately, there exist computationally efficient methods of generating sequences of conjugate vectors such as conjugate gradient descent [Wright and Nocedal1999] (to be discussed), although they may not provide the top . From Theorem 2, we also observe that when all conjugate vectors have the same -norm, then any set of conjugate vectors maximize the objective (5). If we bound the perturbation radius (the maximum KL divergence a perturbation may have from the main policy) as in [Plappert et al.2018], DE achieves a computationally efficient, optimal solution to the objective (5).

## Method

In this section, we first discuss an efficient method to generate conjugate policies and then provide a general algorithmic framework of DE via conjugate policies.

### Generating Conjugate Policies

Generating conjugate policies by finding the top conjugate vectors is feasible but computationally expensive. It would require estimating the full empirical FIM of a large neural network (for which efficient approximate methods exist [Grosse and Martens2016]) and a decomposition into conjugate vectors. We avoid this additional computational burden altogether and generate conjugate policies by taking advantage of runoff from the conjugate gradient descent (CGD) algorithm [Wright and Nocedal1999]. CGD is often used to efficiently approximate the natural gradient descent direction as in [Schulman et al.2015].

CGD iteratively minimizes the error in the estimate of the natural gradient descent direction along a vector conjugate to all minimized directions in previous iterations. We utilize these conjugate vectors in DE to be used as perturbations. Although these are not necessarily the top conjugate vectors, they are computed essentially for free because they are generated from one application of CGD when estimating the natural gradient descent direction. To account for the suboptimality, we introduce a perturbation radius such that for any perturbation

 ~DKL(ϕ||ϕ+ϵ)≤δp. (6)

We can perform a line search along each perturbation direction such that . With this constraint, the use of any vectors are optimal as long as they are conjugate and the benefit comes from achieving the optimal pairwise divergence.

For each conjugate vector, we also include its negative (i.e., symmetric sampling) as motivated by the more general form of Theorem 2 with relaxed assumptions (without

). In methods following different gradient frameworks, symmetric sampling was used to improve gradient estimations by alleviating a possible bias due to a skewed reward distribution

[Sehnke et al.2010]. Finally, we linearly reduce motivated by the observation in [Cohen, Yu, and Wright2018] that as a policy approaches optimal there exist fewer policies with similar performance.

### Algorithm Framework

A general framework for DE is sketched in Algorithm 1. In line 1, DE assumes a starting policy (e.g., one generated randomly) which is used to initialize conjugate policies as exact copies. The initial parameterization of is the mean vector . The number of conjugate policies to be generated is user defined by an argument . The number of samples to collect from the main and conjugate policies are specified by and , respectively. The relative values of , and control how much exploration will be performed by conjugate policies. It’s worth noting that DE reduces to the standard PG algorithm when or .

In the th iteration, after sampling the main and conjugate policies in line 3, line 4 updates via natural gradient descent using the perturbed gradient estimate and returns the updated policy parameterized by and the set of conjugate policies parameterized by perturbed by conjugate vectors; is a placeholder for any RL algorithm that accomplishes this. Computing perturbations could be done in a separate subroutine (i.e. estimating the FIM and taking an eigendecomposition). When computing the natural gradient by CGD as discussed in the previous section, the intermediate conjugate vectors are saved to be used as perturbations.

## Empirical Study

We evaluate the impact of DE via conjugate policies on TRPO [Schulman et al.2015]. TRPO is state-of-the-art in its ability to train large neural networks as policies for complex problems. In its standard form, TRPO only uses on-policy data, so its capacity for exploration is inherently limited.

In experiments, we investigate three aspects of DE in comparison with baseline methods. First, the performance of all deployed policies through iterations of policy improvement. It is worth noting the importance of examining the performance of not only the main policy but also the perturbed policies in order to take the cost of exploration into account. Second, the pairwise KL divergence achieved by the perturbed policies of DE and RP, which measures the diversity of the perturbed policies. Third, the trace of the covariance matrix of perturbed gradient estimates. We demonstrate that high KL divergence correlates with a low trace of covariance in support of the theoretical analysis. Additionally, we demonstrate the diminishing benefit of exploration when decreasing the number of perturbed policies.

### Methods in Comparison

We use two different versions of TRPO as baselines; the standard TRPO and TRPO with random perturbations (RP) and symmetric sampling. The RP baseline follows the same framework as DE but with random perturbations instead of conjugate perturbations. When implementing RP, we replace learning the covariance in the perturbed gradient estimate with a fixed as in [Plappert et al.2018] in which it was noted that the computation for learning was prohibitively costly. The authors also propose a simple scheme to adjust to control for parameter sensitivity to perturbations. The adjustment ensures perturbed policies maintain a bounded distance to the main policy. We achieve this by, for both conjugate and random, searching along the perturbation direction to find the parameterization furthest from the main policy but still within the perturbation radius . In light of the theoretical results, the use of symmetric sampling in RP serves as a more competitive baseline.

Policies are represented by feedforward neural networks with two hidden layers containing 32 nodes and activation functions. We found that increasing complexity of the networks did not significantly impact performance and only increased computation cost. Additionally, we use layer normalization [Ba, Kiros, and Hinton2016] as in [Plappert et al.2018]

to ensure that networks are sensitive to perturbations. Policies map a state to the mean of a Gaussian distribution with an independent variance for each action dimension that is independent of the state as in

[Schulman et al.2015]. We significantly constrain the values of these variance parameters to align with the motivation for parameter perturbation approaches discussed in the Introduction. This will also limit the degree of exploration as a result of noisy action selection. We use the TD(1) [Sutton and Barto1998] algorithm to estimate a value function over all trajectories collected by both the main and perturbed policies. To estimate the advantage function, the empirical return of the trajectory is used as the component and

as a baseline. TRPO hyperparameters are taken from

[Schulman et al.2015, Duan et al.2016].

We display results on three difficult continuous control tasks, Hopper, Walker and HalfCheetah implemented in OpenAI gym [Brockman et al.2016] and using the Mujoco physics simulator [Todorov, Erez, and Tassa2012]. As mentioned in the discussion of Algorithm 1, the values of , and determine exploration performed by perturbed policies. TRPO is at the extreme of minimal exploration since all samples come from the main policy. To promote exploration, in DE and RP we collect samples equally from all policies. More specifically, we use perturbations for Hopper and perturbations for Walker and HalfCheetah for both DE and RP. Walker and HalfCheetah each have more action dimensions than Hopper and so require more exploration and hence more agents. For a total of ( for Hopper and for Walker and HalfCheetah in the reported results) samples collected in each policy improvement iteration, TRPO collects samples per iteration while DE and RP collect samples from the main and each perturbed policy. Through our experiments, we observed a trend of diminishing effect of exploration on policy performance when the total samples are held constant and increases. The initial perturbation radius used in experiments is for Hopper and HalfCheetah and for Walker. Larger perturbation radiuses caused similar performance to the reported results but suffered from greater instability. Reducing sensitivity to this hyperparameter is a direction for future research.

### Results

The two rows of Figure 1 and Table 1 aim to address the three points of investigation raised at the beginning of this section. Our goal is to show that perturbations with larger pairwise KL divergence are key to both strong online performance and enhanced exploration.

In the first column of Figure 1 and Table 1

, we report results on the Hopper domain. Figure (a) contains curves of the average performance (sum of all rewards per episode) attained by TRPO, RP and DE. For RP and DE, this average includes the main and perturbed policies. RP has a slight performance advantage over TRPO throughout all iterations and converges to a superior policy. DE shows a statistically significant advantage in performance over RP and TRPO; a two-sided paired t-test of the average performance at each iteration yields

. Additionally, DE converges to a stronger policy and shows a larger rate of increase over both RP and TRPO. DE also results in the smallest variance in policy performance as shown by the interquartile range (IQR) which indicates that DE escapes local optima more consistently than the baselines. These results demonstrate the effect of enhanced exploration by DE over TRPO and RP.

The trace of covariance of the perturbed gradient estimates are contained in Figure (d). Note, the covariance of TRPO gradient estimates can be computed by treating TRPO as RP but with policies perturbed by the zero vector. Interestingly, Figure (d) shows an increasing trend for all approaches. We posit two possible explanations for this; that policies tend to become more deterministic across learning iterations as they improve and, for DE and RP, the decreasing perturbation radius. Ultimately, both limit the variance of action selection and so yield more similar gradient estimates. Nevertheless, at any iteration, DE can significantly reduce the trace of covariance matrix due to its diversity.

Column of Table 1 reports the average total pairwise KL divergence over all perturbed policies for the Hopper domain. DE’s conjugate policies have significantly larger pairwise KL divergence than RP. This significant advantage in pairwise KL divergence yields lower variance gradient estimates which explain the observed superiority in performance, rate of improvement and lower IQR as discussed.

Similar trends are observed in Figures (b) and (e) and column in Table 1 on the Walker domain. The performance of DE is clearly superior to both baselines but, due to the higher variance of the performance of the baselines, does not yield a statistically significant advantage. Despite this, DE maintains a significantly higher KL divergence between perturbed policies and significantly lower trace covariance estimates across iterations. Additionally, the same trends are observed in Figures (c) and (f) and column in Table 1 in the HalfCheetah domain. DE shows a statistically significant advantage in terms of performance and pairwise KL divergence () over RP and TRPO despite their more similar covariance estimates.

Finally, we present a study of the impact of decreasing the number of perturbed policies while keeping the samples collected constant on the Hopper domain. In Figure 2, we display the average performance of DE for as well as TRPO (). Decreasing leads to decreasing average performance and rate of improvement. Additionally, decreasing leads to increasing performance variance. Both of these observations demonstrate that increasing diversity among behavior policies is key to strong online performance and exploration.

## Related Work

Achieving exploration by deploying multiple policies has surfaced in the literature in varied contexts. The most closely related is [Plappert et al.2018] which follows the same framework but does not provide theory or consider an optimal diversity objective. [Cohen, Yu, and Wright2018] provides a DE algorithm and theory for exploration under a safety model but does not address exploration for policy gradient methods. [Dimakopoulou and Roy2018] uses multiple agents with diverse estimates of the MDP to efficiently learn a model of the reward and transition function. [Hong et al.2018]

adds an explicit divergence regularizer to the policy gradient objective which encourages divergence from a heuristically chosen subset of previously deployed policies.

Others have studied parameter space noise for exploration in gradient-based evolutionary strategies [Salimans et al.2017, Sehnke et al.2010, Wierstra et al.2014], but they do not optimize diversity within policy performance constraints. [Fortunato et al.2018] proposes exploration by adding a trainable noise parameter to each network parameter, which incurs significant computational cost.

Different generalizations of the policy gradient with the common goal of variance reduction in gradient estimates [Ciosek and Whiteson2018, Gu et al.2017] exist but do not address the same exploration issue studied in this work. An alternate line of work reduces variance using control variates [Liu et al.2018].

## Conclusions and Future Work

We have proposed a novel exploration strategy and an algorithm framework for DE via conjugate policies for policy gradient methods. We have also provided a theoretical explanation for why DE works, and experimental results on three continuous control problems showing that DE outperforms the two baselines (TRPO and RP).

One future research direction is to investigate other efficient ways of generating a limited number of conjugate vectors that optimize KL divergence. Another is a more principled way of selecting the perturbation radius and reduction factor. More generally, future research will expand the DE theory and algorithm framework to other policy gradient methods and paradigms of RL.

## Acknowledgements

Yu’s work is in part supported by the National Science Foundation (Award 1617915). Tong’s work is in part supported by the National Nature Science Foundation of China (Award 61572418).

## References

• [Amari and Nagaoka2000] Amari, S., and Nagaoka, H., eds. 2000. Methods of Information Geometry. Oxford University Press.
• [Ba, Kiros, and Hinton2016] Ba, J. L.; Kiros, R.; and Hinton, G. 2016. Layer normalization. In arXiv preprint arXiv:1607.06450.
• [Brafman and Tennenholtz2003] Brafman, R. I., and Tennenholtz, M. 2003. R-max - a general polynomial time algorithm for near-optimal reinforcement learning.

Journal of Machine Learning Research

3:213–231.
• [Brockman et al.2016] Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym.
• [Ciosek and Whiteson2018] Ciosek, K., and Whiteson, S. 2018. Expected policy gradients. In

Proceedings of the 32nd Conference on Artificial Intelligence

, 2868–2875.
• [Cohen, Yu, and Wright2018] Cohen, A.; Yu, L.; and Wright, R. 2018. Diverse exploration for fast and safe policy improvement. In Proceedings of the 32nd Conference on Artificial Intelligence, 2876–2883.
• [Dimakopoulou and Roy2018] Dimakopoulou, M., and Roy, B. V. 2018. Coordinated exploration in concurrent reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, 80:1271–1279.
• [Duan et al.2016] Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; and Abbeel, P. 2016. Benchmarking deep reinforcement learning for continuous control. In Proceedings of The 33rd International Conference on Machine Learning, 1329–1338.
• [Fortunato et al.2018] Fortunato, M.; Azar, M.; B, P.; Menick, J.; Osband, I.; Graves, A.; Mnih, V.; Munos, R.; Hassabis, D.; Pietquin, O.; Blundell, C.; and Legg, S. 2018. Noisy networks for exploration. In International Conference on Learning Representations.
• [Grosse and Martens2016] Grosse, R., and Martens, J. 2016. A kronecker-factored approximate fisher matrix for convolution layers. In Proceedings of The 33rd International Conference on Machine Learning, 573–582.
• [Gu et al.2017] Gu, S.; Lillicrap, T.; Turner, R.; Ghahramani, Z.; Schölkopf, B.; and Levine, S. 2017. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems 30, 3849–3858.
• [Hong et al.2018] Hong, Z.; Shann, A.; Su, S.; Chang, Y.; Fu, T.; and Lee, C. 2018. Diversity-driven exploration strategy for deep reinforcement learning. In Proceedings of the 32nd Conference on Neural Information Processing Systems.
• [Kingma and Welling2014] Kingma, D. P., and Welling, M. 2014. Auto-encoding variational bayes. In International Conference on Learning Representations.
• [Liu et al.2018] Liu, H.; Feng, Y.; Mao, Y.; Zhou, D.; Peng, J.; and Liu, Q. 2018. Action-dependent control variates for policy optimization via stein’s identity. In International Conference on Learning Representations.
• [Peters and Schaal2008] Peters, J., and Schaal, S. 2008. Natural actor-critic. In Neurocomputing, 1180–1190.
• [Plappert et al.2018] Plappert, M.; Houthooft, R.; Dhariwal, P.; Sidor, S.; Chen, R.; Chen, X.; Asfour, T.; Abbeel, P.; and Andrychowicz, M. 2018. Parameter space noise for exploration. In International Conference on Learning Representations.
• [Puterman1994] Puterman, M. 1994. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc. New York, NY, USA.
• [Salimans et al.2017] Salimans, T.; Ho, J.; Chen, X.; and Sutskever, I. 2017. Evolution strategies as a scalable alternative to reinforcement learning. In arXiv preprint arXiv:1703.03864.
• [Schulman et al.2015] Schulman, J.; Levine, S.; P, M.; Jordan, M.; and Abbeel, P. 2015. Trust region policy optimization. In Proceedings of the 31st International Conference on Machine Learning, 1889–1897.
• [Sehnke et al.2010] Sehnke, F.; Osendorfer, C.; Rückstieß, T.; Peters, J.; and Schmidhuber, J. 2010. Parameter-exploring policy gradients. In Neural Networks, volume 23, 551–559.
• [Sutton and Barto1998] Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. The MIT Press.
• [Sutton et al.1999] Sutton, R. S.; McAllester, D. A.; Sing, S. P.; and Mansour, Y. 1999. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 1057–1063.
• [Todorov, Erez, and Tassa2012] Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, 5026–5033. IEEE.
• [Wierstra et al.2014] Wierstra, D.; Schaul, T.; Glasmachers, T.; Sun, Y.; Peters, J.; and Schmidhuber, J. 2014. Natural evolution strategies. In Journal of Machine Learning Research, 949–980.
• [Wright and Nocedal1999] Wright, S., and Nocedal, J., eds. 1999. Numerical Optimization. New York, New York: Springer.
• [Wu et al.2017] Wu, Y.; Mansimov, E.; Grosse, R.; Liao, S.; and Ba, J. 2017. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in Neural Information Processing Systems 30, 5285–5294.

## Appendix A Proof of Theorem 1

###### Theorem 1.

Let and be two perturbations such that = . Then, (1) the trace of is minimized and (2) the estimated KL divergence is maximized, when and they are along the direction of the eigenvector of with the largest eigenvalue.

###### Proof.

First, we note that the covariance

 Cov[∇ϕlog(πϵj),∇ϕlog(πϵi)] = Es,a[∇ϕlog(πϵj(a|s))∇ϕlog(πϵi(a|s))T] −Es,a[∇ϕlog(πϵi(a|s))]TEs,a[∇ϕlog(πϵj(a|s))] = Es,a[∇ϕlog(πϵj(a|s))∇ϕlog(πϵi(a|s))T]

since the score function .

The arguments are dropped in the following for clarity. Consider the multivariable Taylor expansion,

 ∇ϕlog(πϵj)=∇ϕlog(πϵi)+∇2ϕlog(πϵi)(ϵj−ϵi)+R(πϵj),

where the remainder is a vector whose th element is , in which represents the partial derivative with respect to the th component of and and for . Taking the trace of the covariance, we have

 Trace{Es,a[∇ϕlog(πϵj)∇ϕlog(πϵi)T]} = Trace{Es,a[∇ϕlog(πϵi)∇ϕlog(πϵi)T +[∇2ϕlog(πϵi)(ϵj−ϵi)]∇ϕlog(πϵi)T+R(πϵj)∇ϕlog(πϵi)T]} = Trace{F(ϵi)}−Trace{Es,a[^F(ϵi)(ϵj−ϵi)∇ϕlog(πϵi)T}] +o(∥ϵj−ϵi∥) = Trace{F(ϵi)}−{Es,a[∇ϕlog(πϵi)T^F(ϵi)(ϵj−ϵi)]} +o(∥ϵj−ϵi∥)

Since the first term is independent of the choice of and and the remainder term converges to zero as , to minimize the trace of the covariance we need to focus on maximizing the second term.

 Es,a[∇ϕlog(πϵi)T^F(ϵi)(ϵj−ϵi)] = Es,a{[(ϵj−ϵi)T^F(ϵi)∇ϕlog(πϵi)∇ϕlog(πϵi)T^F(ϵi)(ϵj−ϵi)]}1/2 = Es,a{(ϵj−ϵi)T^F(ϵi)~F(ϵi)^F(ϵi)(ϵj−ϵi)}1/2 ≈ {(ϵj−ϵi)TF(ϵi)3(ϵj−ϵi)}1/2,

Here is the true Fisher information, is the observed information, and , and we note that . Consider the eigen-decomposition of . We have

 (ϵj−ϵi)TF(ϵi)3(ϵj−ϵi) = (ϵj−ϵi)TUΛ3UT(ϵj−ϵi) = p∑k=1λ3k[uTk(ϵj−ϵi)]2.

This objective is maximized at (or ), that is, they are opposite to one another and they are along the eigenvector corresponding to the largest eigenvalues . In this case, the maximal value is .

We now note that is closely related to the KL-divergence . Using the same argument as above, this quantity is maximized by or ().

In conclusion, the pair of pertubations maximize the KL divergence and minimize the trace of the covariance up to a constant which converges to 0 faster than . ∎

## Appendix B Proof of Theorem 2

###### Theorem 2.

The set of conjugate vectors maximize the objective (5) among any perturbations.

###### Proof.

Let , where , be a set of perturbations. First, we show the result for . In this case, the value of the objective (5) is

 ~DKL(ϵ1||ϵ2) = = n∑i=1(η21i−2η1iη2i+η22i)∥μi∥F = n∑i=1η21i∥μi∥F+n∑i=1η22i∥μi∥F−2n∑i=1η1iη2i∥μi∥F

Recall that . Therefore, to maximize the objective, we must select and so that for all . Next, since , it suffices to let and .

The same argument can be generalized to . In the end, we choose to be the vector of all zeros except 1 at the th entry. This choice corresponds to the case that the th perturbation is the th conjugate vector. ∎