1 Introduction
In the paradigm of Reinforcement Learning, an agent interacts with the environment to learn a policy that can maximize a certain form of cumulative rewards (Sutton and Barto, 1998)
. Modeling the policy function with a Deep Neural Network, the policy gradient method can be applied to optimize current policy
(Sutton et al., 2000). However, direct optimization with respect to the reward function is prone to get stuck in suboptimal solutions and therefore hinders the policy optimization (Liepins and Vose, 1991; Lehman and Stanley, 2011; Plappert et al., 2018). Consequently, an appropriate exploration strategy is crucial for the success of policy learning (Auer, 2002; Bellemare et al., 2016; Houthooft et al., 2016; Tang et al., 2017; Ostrovski et al., 2017; Tessler et al., 2019; Ciosek et al., 2019; Plappert et al., 2017).Recently many works have shown that incorporating curiosity in the policy learning leads to better exploration strategies (Pathak et al., 2017; Burda et al., 2018a, b; Liu et al., 2019). In these works, visiting a previous unseen or infrequent state is assigned with an extra curiosity bonus reward. Different from those curiositydriven methods which focus on the discovery of new states within the learning procedure of a repeated single policy, another direction Novel Policy Seeking (Lehman and Stanley, 2011; Zhang et al., 2019; Pugh et al., 2016) focuses on learning different policies with diverse or the socalled novel behaviors to solve the primal task. In the process of novel policy seeking, policies in new iterations are usually encouraged to be different from previous policies. Therefore novel policy seeking can be viewed as an extrinsic curiositydriven method at the level of policies, as well as an exploration strategy for a population of agents. Besides encouraging exploration (Eysenbach et al., 2018; Gangwani et al., 2018; Liu et al., 2017), novel policy seeking is also related to policy ensemble (Osband et al., 2018, 2016; Florensa et al., 2017) and evolution strategies (ES) (Salimans et al., 2017; Conti et al., 2018).
In order to generate novel policies, previous work often defines a heuristic metric for novelty estimation, e.g., differences of state distributions estimated by neural networks are used in
(Zhang et al., 2019), and tries to solve the problem under the formulation of multiobjective optimization. However, most of these metrics suffer from the difficulty when dealing with episodic novelty reward, i.e., the difficulty of episodic credit assignment (Sutton et al., 1998), thus their effectiveness in learning novel policies is limited. Moreover, the difficulty of balancing different objectives impedes the agent to find a wellperforming policy for the primal task, as shown by Fig. 1 which compares the policy gradients of three cases, namely the one without novel policy seeking, novelty seeking with multiobjective optimization and novelty seeking with constrained optimization methods, respectively.In this work, we intend to take into consideration both the novelty of learned policies as well as their performances in terms of the primal task when addressing the problem of novel policy seeking. To achieve this goal, we propose to seek novel policies with a constrained optimization formulation. Two specific algorithms under such a formulation are designed to seek novel policies while keeping their performances in the primal task, avoiding excessive novelty seeking. As a consequence, with these two algorithms, the performances of our learned novel policies can be guaranteed and even further improved.
Our contributions can be summarized in threefolds. Firstly, we introduce a new metric to compute the difference between policies with instant feedback at every timestep; Secondly, we propose a constrained optimization formulation for novel policy seeking and design two practical algorithms resembling two approaches in constrained optimization literature; Thirdly, we evaluate our proposed algorithms on the MuJoCo locomotion environments, showing the advantages of these constrained optimization noveltyseeking methods which can generate a series of diverse and wellperforming policies over previous multiobjective novelty seeking methods.
2 Related Work
Intrinsic motivation methods In previous work, different approaches are proposed to provide intrinsic motivation or intrinsic reward as a supplementary to the primal task reward for better exploration (Houthooft et al., 2016; Pathak et al., 2017; Burda et al., 2018a, b; Liu et al., 2019). All those approaches leverage the weighted sum of two rewards, the primal rewards provided by environments, and intrinsic rewards that provided by different heuristics. On the other hand, the work of DIAYN and DADS (Eysenbach et al., 2018; Sharma et al., 2019) learn diverse skills without extrinsic reward. Those approaches focus on decomposing diverse skills of a single policy, while our work focuses on learning diverse behaviors among a batch of policies for the same task.
Diverse policy seeking methods The work of Such et al. shows that different RL algorithms may converge to different policies for the same task (Such et al., 2018). On the contrary, we are interested in how to learn different policies through a single learning algorithm with the capability of avoiding local optimum. The work of Pugh et al. establishes a standard framework for understanding and comparing different approaches to search for quality diversity (QD) Pugh et al. (2016). Conti et al. proposes a solution which avoids local optima as well as achieves higher performance by adding novelty search and QD to evolution strategies Conti et al. (2018). The TaskNovelty Bisector (TNB) learning method (Zhang et al., 2019) aims to solve novelty seeking problem by jointly optimize the extrinsic rewards and novelty rewards defined by an autoencoder. In this work, one of the two proposed methods is closely related to TNB, but is adapted to the constrained optimization formulation.
Constrained Markov Decision Process
The Constrained Markov Decision Process (CMDP)
(Altman, 1999) considers the situation where an agent interact with the environment under certain constraints. Formally, the CMDP can be defined as a tuple , where and are the state and action space; is a discount factor; and denote the reward function and cost function; is the upper bound of permitted expected cumulative cost; denotes the transition dynamics, and is the initial state. Denote the Markovian policy class as , where The learning objective of a policy for CMDP is to find a , such that(1) 
where indicates a trajectory and represents the distribution over trajectories following policy : . Previous literature provide several approaches to solve CMDP (Achiam et al., 2017; Chow et al., 2018; Ray et al., 2019), and in this work we include the CPO (Achiam et al., 2017) as baseline according to the comparison in (Ray et al., 2019).
3 Methodology
In Sec.3.1, we start with defining a metric space that measures the difference between policies, which is the fundamental ingredient for the methods introduced later. In Sec.3.2, we develop a practical estimation method for this metric. Sec.3.3 describes the formulation of constrained optimization on novel policy seeking. The implementations of two practical algorithms are further introduced in Sec.3.4.
We denote the policies as , wherein represents parameters of the th policy, denotes the whole parameter space. In this work, we focus on improving the behavior diversity of policies from PPO (Schulman et al., 2017), thus we use to represent in this paper. It is worth noting that the proposed methods can be easily extended to other RL algorithms (Schulman et al., 2015; Lillicrap et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018). To simplify the notation, we omit and denote a policy as unless stated otherwise.
3.1 Measuring the Difference between Policies
In this work, we use the Wasserstein metric (Rüschendorf, 1985; Villani, 2008; Arjovsky et al., 2017) to measure the distance between policies. Concretely, in this work we consider the Gaussianparameterized policies, where the over two policies can be written in the closed form as , where
are mean and covariance metrics of the two normal distributions. In the following of this paper, we use
to denote the and it is worth noting that when the covariance matrix is identical, the trace term disappears and only the term involving the means remains, i.e., for Dirac delta distributions located at points and . This diversity metric satisfies the three properties of a metric, namely identity, symmetry as well as triangle inequality.Proposition 1 (Metric Space )
The expectation of of two policies over any state distribution :
(2) 
is a metric on , thus is a metric space.
The proof of Proposition 1 is straightforward. It is worth mentioning that Jensen Shannon divergence
or Total Variance Distance
(Endres and Schindelin, 2003; Fuglede and Topsoe, 2004; Schulman et al., 2015) can also be applied as alternative metric spaces, we choose in our work for that the Wasserstein metric better preserves the continuity (Arjovsky et al., 2017).On top of the metric space , we can then compute the novelty of a policy as follows.
Definition 1 (Novelty of Policy)
Given a reference policy set such that , the novelty of policy is the minimal difference between and all policies in the reference policy set, i.e.,
(3) 
Consequently, to encourage the discovery of novel policies discovery, typical noveltyseeking methods tend to directly maximize the novelty of a new policy, i.e., , where the includes all existing policies.
3.2 Estimation of and the Selection of
In practice, the calculation of is based on Monte Carlo estimation where we need to sample from . Although in Eq.(2)
can be selected simply as a uniform distribution over the state space, there remains two obstacles: first, in a finite state space we can get precise estimation after establishing ergodicity, but problem arises when facing continuous state spaces due to the difficulty of efficiently obtaining enough samples; second, when
is sampled from a uniform distribution , we can only get sparse episodic reward instead of dense online reward which is more useful in learning. Therefore, we make an approximation here based on importance sampling.Formally, we denote the domain of as and assume to be a uniform distribution over , without loss of generality in later analysis. Notice is closely related to the algorithm being used in generating trajectories (Henderson et al., 2018). As we only care about the reachable regions of a certain algorithm (in this work, PPO), the domain can be decomposed by , where denotes all the possible states a policy can visit given a starting state distribution.
In order to get onlinereward, we estimate Eq.(2) with
(4) 
where we use to denote the stationary state visitation frequency under policy , i.e., in finite horizon problems. We propose to use the averaged stationary visitation frequency as , e.g., for PPO, . Clearly, choosing will be much better than choosing a uniform distribution as the importance weight will be closer to . Such an importance sampling process requires a necessary condition that and have the same domain, which can be guaranteed by applying a sufficient exploration noise on .
Another difficulty lies in the estimation of , which is always intractable given a limited number of trajectories. However, during training, is a policy to be optimized and is a fixed reference policy. The error introduced by approximating the importance weight as will get larger when becomes more distinct from normal policies, at least in terms of the state visitation frequency. We may just regard increasing of the approximation error as the discovery of novel policies.
Proposition 2 (Unbiased Single Trajectory Estimation)
The estimation of using a single trajectory is unbiased.
The Proposition 2 follows the usual trick in RL that uses a single trajectory to estimate the stationary state visitation frequency. Given the definition of novelty and a practically unbiased sampling method, the next step is to develop an efficient learning algorithm.
3.3 Constrained Optimization Formulation for Novel Policy Seeking
In the traditional RL paradigm, maximizing the expectation of cumulative rewards is commonly used as the objective. i.e., , where and denotes a trajectory sampled from the policy .
To improve the diversity of different agents’ behaviors, the learning objective must take both the reward from the primal task and the policy novelty into consideration. Previous approaches (Houthooft et al., 2016; Pathak et al., 2017; Burda et al., 2018a, b; Liu et al., 2019) often directly use the weighted sum of these two terms as the objective:
(5) 
where is a weight hyperparameter, is the reward from the primary task, and is the cumulative intrinsic reward of the intrinsic reward . In our case, the intrinsic reward is the novelty reward . These methods can be summarized as Weighted Sum Reward (WSR) methods (Zhang et al., 2019). Such an objective is sensitive to the selection of as well as the formulation of . For example, in our case formulating the novelty reward as , and will lead to significantly different results as they determine the tradeoffs in the two terms given . Besides, dilemma also arises in the selection of : while a large may undermine the contribution of intrinsic reward, a small could ignore the importance of the primal task, leading to the failure of an agent in solving the task.
To tackle such an issue, the crux is to deal with the conflict between different objectives. The work of Zhang et al. proposes the TNB, where the task reward is regarded as the dominant one while the novelty reward is regarded as subordinate Zhang et al. (2019). However, as TNB considers the novelty gradient all the time, it may hinder the learning process, e.g., Intuitively, wellperforming policies should be more similar to each other than to random initialized policies. As a new random initialized policy is different enough from previous policies, considering the novelty gradient at beginning of training will result in a much slower learning process.
In order to tackle the above problems and adjust the extent of novelty in new policies, we propose to solve the noveltyseeking problem under the perspective of constrained optimization. The basic idea is as follows: while the task reward is considered as a learning objective, the novelty reward should be considered as a bonus instead of another objective, and should not impede the learning of the primal task. Fig. 1 illustrates how novelty gradients impede the learning of a policy: at the beginning of learning, a random initialized policy should in total learn to be more similar to a wellperforming policy rather than be different. The seeking of novelty should not be taken into consideration all the time during learning. With such an insight, we change the multiobjective optimization problem in Eq.(5) into a constrained optimization problem as:
(6) 
where is a threshold indicating minimal permitted novelty, and denotes a moving average of . as we need not force every single action of a new agent to be different from others. Instead, we care more about the longterm differences. Therefore, we use cumulative novelty terms as constraints. Moreover, the constraints can be flexibly applied after the first timesteps (e.g., ) for the consideration of similar starting sequences, so that the constraints can be written as .
3.4 Practical Novel Policy Seeking Methods
We note here, WSR and TNB proposed in previous work (Zhang et al., 2019) can correspond to different approaches in constrained optimization problems, yet some important ingredients are missing. We improve TNB according to the Feasible Direction Method in constrained optimization and then propose the Interior Policy Differentiation (IPD) method according to the Interior Point Method in constrained optimization.
WSR: Penalty Method
The Penalty Method considers the constraints of Eq.(6) by putting constraint into a penalty term, followed by solving the unconstrained problem
(7) 
in an iterative manner. The limit of the above unconstrained problem when then leads to the solution of the original constrained problem. As an approximation, WSR chooses a fixed weight , and uses the gradient of instead of , thus the final solution will intensely rely on the selection of .
TNB: Feasible Direction Method
The Feasible Direction Method (FDM) (Ruszczyński, 1980; Herskovits, 1998) solves the constrained optimization problem by finding a direction where taking gradient upon will lead to increment of the objective function as well as constraints satisfaction, i.e.,
(8) 
The TNB proposes to use a revised bisector of gradients and as ,
(9) 
Clearly, Eq.(9) satisfies Eq.(8), but it is more strict than Eq.(8) as the term always exists during the optimization of TNB. Based on TNB, we provide a revised approach, named Constrained Task Novel Bisector (CTNB), which resembles better with FDM. Specifically, when , CTNB will not apply on . It is clear that TNB is a special case of CTNB when the novelty threshold
is set to infinity. We note that in both TNB and CTNB, the learning stride is fixed to be
and may lead to problem when , where the final optimization result will rely heavily on the selection of , i.e., the shape of is crucial for the success of this approach.IPD: Interior Point Method
The Interior Point Method (Potra and Wright, 2000; Dantzig and Thapa, 2006) is another approach used to solve the constrained optimization problem. Thus here we solve Eq.(6) using the Interior Policy Differentiation (IPD), which can be regarded as an analogy of the Interior Point Method. In the vanilla Interior Point Method, the constrained optimization problem in Eq.(6) is solved by reforming it to an unconstrained form with an additional barrier term in the objective as
(10) 
or more precisely in our problem
(11) 
where is the barrier factor. The barrier term can use instead as . As is small, the barrier term will introduce only minuscule influence on the objective. On the other hand, when get closer to the barrier, the objective will increase rapidly. It is clear that the solution of the objective with barrier term will get closer to the original objective as gets smaller. Thus in practice, we can choose a sequence of such that and as . The limits of Eq.(11) when then lead to the solution of Eq.(6). The convergence of such methods are provided in previous works Conn et al. (1997); Wright (2001).
However, directly applying IPM is computationally expensive and numerically unstable. Luckily, in the RL paradigm where the learning of an agent is determined by the experiences used in the calculation of policy gradients, a more natural way can thus be used. Specifically, since the learning process is based on sampled transitions, we can simply bound the collected transitions in the feasible region by permitting previously trained policies sending termination signals during the training process of new agents. In other words, we implicitly bound the feasible region by terminating any new agent that steps outside it.
Consequently, during the training process, all valid samples we collected are inside the feasible region, which means these samples are less likely to appear in previously trained policies. At the end of the training, we then naturally obtain a new policy that has sufficient novelty. In this way, we no longer need to consider the tradeoff between intrinsic and extrinsic rewards deliberately. The learning process of IPD is thus more robust and no longer suffers from objective inconsistency.
4 Experiments
According to Proposition 2, the novelty reward in Eq.(6) under our novelty metric can be unbiasedly approximated by . We thus utilize this novelty metric directly throughout our experiments. We apply different novel policy seeking methods, namely WSR, TNB, CTNB, and IPD, to the backbone RL algorithm PPO (Schulman et al., 2017). The extension to other popular RL algorithms is straightforward. More implementation details are depicted in Appendix D. Experiments in the work of Henderson et al. show that one can simply change the random seeds before training to get policies that perform differently Henderson et al. (2018). Therefore, we also use PPO with varying random seeds as a baseline method for novel policy seeking. And we use the averaged differences between policies learned by this baseline as the default threshold in CTNB and IPD. Algorithm 1 and Algorithm 2 show the pseudo code of IPD and CTNB based on PPO, where the blue lines show the addition to the primal PPO algorithm.
4.1 The MuJoCo environment
We evaluate our proposed method on the OpenAI Gym based on the MuJoCo engine (Brockman et al., 2016; Todorov et al., 2012). Concretely, we test on three locomotion environments, the Hopperv3 (11 observations and 3 actions), Walker2dv3 (11 observations and 6 actions), and HalfCheetahv3 (17 observations and 6 actions). Although relaxing the healthy termination thresholds in Hopper and Walker may permit more visible behavior diversity, all the environment parameters are set as default values in our experiments to demonstrate the generality of our method.
4.1.1 Comparison on Novelty and Performance
We implement WSR, TNB, CTNB, and IPD using the same hyperparameter settings per environment. And we also apply CPO Achiam et al. (2017) as a baseline as a solution of CMDP. For each method, we first train policies using PPO with different random seeds. Those PPO policies are used as the primal reference policies, and then we train novel policies that try to be different from previous reference policies. Concretely, in each method, the novel policy is trained to be different from the previous PPO policies, and the should be different from the previous policies, and so on. More implementation details are depicted in Appendix D.
Reward  Success Rate  

Environment  Hopper  Walker2d  HalfCheetah  Hopper  Walker2d  HalfCheetah 
PPO  
CPO  
WSR  
TNB  
CTNB (Ours)  
IPD (Ours) 
Fig. 2 shows our experimental results in terms of novelty (the xaxis) and the performance (the yaxis). Policies close to the upper right corner are the more novel ones with higher performance. In all environments, the performance of CTNB, IPD and CPO outperforms WSR and TNB, showing the advantage of constrained optimization approaches in novel policy seeking. Specifically, the results of CTNB are all better than their multiobjective counterparts, i.e., the results from TNB, showing the superiority of seeking novel policies with constrained optimization. Moreover, the IPD method provides more novelty than CTNB and CPO, while the primal task performances are still guaranteed.
Comparisons of the taskrelated rewards are carried out in Table 1, where among all the four methods, IPD provides sufficient diversity with minimum loss of performance. Instead of performance decay, we find IPD is able to find better policies in the environment of Hopper and HalfCheetah. Moreover, in the Hopper environment, while the agents trained with PPO tend to fall into the same local minimum. (e.g., they all jump as far as possible and then terminate this episode. On the contrary, PPO with IPD keeps new agents away from falling into the same local minimum, because once an agent has reached some local minimum, agents learned later will try to avoid this region due to the novelty constraints. Such property shows that IPD can enhance the traditional RL schemes to tackle the local exploration challenge (Tessler et al., 2019; Ciosek et al., 2019). A similar feature brings about reward growth in the environment of HalfCheetah. Detailed analysis and discussions are developed in Appendix E.
4.1.2 Success Rate of Each Method
In addition to averaged reward, we also use the success rate as another metric to compare the performance of different approaches. Roughly speaking, the success rate evaluates the stability of each method in terms of generating a policy that performs as good as the policies PPO generates. In this work, we regard a policy successful when its performance achieves at least as good as the median performance of policies trained with PPO. To be specific, we use the median of the final performance of PPO as the baseline, and if a novel policy, which aims at performing differently to solve the same task, surpasses the baseline during its training process, it will be regarded as a successful policy. By definition, the success rate of PPO is as a baseline for every environment. Table 1 shows the success rate of all the methods. The results show that all constrained novelty seeking methods (CTNB, IPD, CPO) can surpass the average baseline during training, while the multiobjective optimization approaches normally can not. Thus the performance of constrained novelty seeking methods can always be insured.
5 Conclusion
In this work, we rethink the novel policy seeking problem under the perspective of constrained optimization. We introduce a new metric to measure the distances between policies, and then we introduce the definition of policy novelty. We propose a new perspective by connecting the domain of constrained optimization to the domain of RL, and come up with two methods for seeking novel policies, namely the Constrained Task Novel Bisector (CTNB), and the Interior Policy Differentiation (IPD). Our experimental results demonstrate that the proposed method can effectively learn various wellbehaved yet diverse policies, outperforming previous methods following the multiobjective formulation.
Broader Impact
Our proposed method benefit the research of RL where diversity between policies is needed. As normal Policy Gradient methods are only guaranteed to converge to local minima, our proposed method can help the discovery of multiple solutions of a given task, and find possibly better solutions.
References

Constrained policy optimization.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 22–31. Cited by: §2, §4.1.1.  Constrained markov decision processes. Vol. 7, CRC Press. Cited by: §2.
 Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 214–223. External Links: Link Cited by: §3.1, §3.1.
 Using confidence bounds for exploitationexploration tradeoffs. Journal of Machine Learning Research 3 (Nov), pp. 397–422. Cited by: §1.
 Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479. Cited by: §1.
 OpenAI gym. External Links: arXiv:1606.01540 Cited by: §4.1.
 Largescale study of curiositydriven learning. arXiv preprint arXiv:1808.04355. Cited by: §1, §2, §3.3.
 Exploration by random network distillation. arXiv preprint arXiv:1810.12894. Cited by: §1, §2, §3.3.
 A lyapunovbased approach to safe reinforcement learning. In Advances in neural information processing systems, pp. 8092–8101. Cited by: §2.
 Better exploration with optimistic actor critic. In Advances in Neural Information Processing Systems, pp. 1785–1796. Cited by: §1, §4.1.1.
 A globally convergent lagrangian barrier algorithm for optimization with general inequality constraints and simple bounds. Mathematics of Computation of the American Mathematical Society 66 (217), pp. 261–288. Cited by: §3.4.
 Improving exploration in evolution strategies for deep reinforcement learning via a population of noveltyseeking agents. In Advances in Neural Information Processing Systems, pp. 5027–5038. Cited by: §1, §2.
 Linear programming 2: theory and extensions. Springer Science & Business Media. Cited by: §3.4.

A new metric for probability distributions
. IEEE Transactions on Information theory. Cited by: §3.1.  Diversity is all you need: learning skills without a reward function. arXiv preprint arXiv:1802.06070. Cited by: §1, §2.
 Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012. Cited by: §1.
 Jensenshannon divergence and hilbert space embedding. In International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings., pp. 31. Cited by: §3.1.
 Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477. Cited by: §3.
 Learning selfimitating diverse policies. arXiv preprint arXiv:1805.10309. Cited by: §1.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §3.

Deep reinforcement learning that matters.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §3.2, §4.  Feasible direction interiorpoint technique for nonlinear optimization. Journal of optimization theory and applications 99 (1), pp. 121–146. Cited by: §3.4.
 Variational information maximizing exploration. Cited by: §1, §2, §3.3.
 Novelty search and the problem with objectives. In Genetic programming theory and practice IX, pp. 37–56. Cited by: §1, §1.

Deceptiveness and genetic algorithm dynamics
. In Foundations of genetic algorithms, Vol. 1, pp. 36–50. Cited by: §1.  Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §3.
 Competitive experience replay. CoRR abs/1902.00528. External Links: Link, 1902.00528 Cited by: §1, §2, §3.3.
 Stein variational policy gradient. arXiv preprint arXiv:1704.02399. Cited by: §1.
 Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 8617–8629. Cited by: §1.
 Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pp. 4026–4034. Cited by: §1.
 Countbased exploration with neural density models. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2721–2730. Cited by: §1.

Curiositydriven exploration by selfsupervised prediction.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, pp. 16–17. Cited by: §1, §2, §3.3.  Multigoal reinforcement learning: challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464. Cited by: §1.
 Parameter space noise for exploration. arXiv preprint arXiv:1706.01905. Cited by: §1.
 Interiorpoint methods. Journal of Computational and Applied Mathematics 124 (12), pp. 281–302. Cited by: §3.4.

Quality diversity: a new frontier for evolutionary computation
. Frontiers in Robotics and AI 3, pp. 40. Cited by: §1, §2.  Benchmarking safe exploration in deep reinforcement learning. openai. Cited by: §2.
 The wasserstein distance and approximation theorems. Probability Theory and Related Fields 70 (1), pp. 117–129. Cited by: §3.1.
 Feasible direction methods for stochastic programming problems. Mathematical Programming 19 (1), pp. 220–229. Cited by: §3.4.
 Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864. Cited by: §1.
 Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §3.1, §3.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §3, §4.
 Dynamicsaware unsupervised discovery of skills. arXiv preprint arXiv:1907.01657. Cited by: §2.
 An atari model zoo for analyzing, visualizing, and comparing deep reinforcement learning agents. arXiv preprint arXiv:1812.07069. Cited by: §2.
 Introduction to reinforcement learning. Vol. 2, MIT press Cambridge. Cited by: §1.
 Reinforcement learning: an introduction. Cited by: §1.
 Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §1.
 # exploration: a study of countbased exploration for deep reinforcement learning. In Advances in neural information processing systems, pp. 2753–2762. Cited by: §1.
 Distributional policy optimization: an alternative approach for continuous control. arXiv preprint arXiv:1905.09855. Cited by: §1, §4.1.1.
 MuJoCo: a physics engine for modelbased control.. In IROS, pp. 5026–5033. External Links: ISBN 9781467317375, Link Cited by: §4.1.
 Optimal transport: old and new. Vol. 338, Springer Science & Business Media. Cited by: §3.1.
 On the convergence of the newton/logbarrier method. Mathematical Programming 90 (1), pp. 71–100. Cited by: §3.4.
 Learning novel policies for tasks. CoRR abs/1905.05252. External Links: Link, 1905.05252 Cited by: Figure 1, §1, §1, §2, §3.3, §3.3, §3.4.
Appendix A Metric Space
Definition 2
A metric space is an ordered pair
where is a set and is a metric on , i.e., a function such that for any , the following holds:1. ,
2. ,
3. .
Appendix B Proof of Proposition 1
The first two properties are obviously guaranteed by . As for the triangle inequality,
Appendix C Proof of Proposition 2
Appendix D Implementation Details
Calculation of
We use deterministic part of policies in the calculation of , i.e., we remove the Gaussian noise on the action space in PPO and use .
Network Structure
We use MLP with 2 hidden layers as our actor models in PPO. The first hidden layer is fixed to have 32 units. We choose to use , and hidden units for the three tasks respectively in all of the main experiments, after taking the success rate, performance and computation expense (i.e. the preference to use less unit when the other two factors are similar) into consideration.
Training Timesteps
We fix the training timesteps in our experiments. The timesteps are fixed to be M in Hopperv3, M for Walker2dv3 and M for HalfCheetahv3.
Appendix E Discussion
e.1 Novel Policy Seeking without Performance Decay
Multiobjective formulation of novel policy seeking has the risk of sacrificing the primal performance as the overall objective needs to consider both novelty and primal task rewards. On the contrary, under the perspective of constrained optimization, there will be no more tradeoff between novelty and final reward as the only objective is the task reward. Given a certain novelty threshold, the algorithms tend to find the optimal solution in terms of task reward under constraints, thus the learning process becomes more controllable and reliable, i.e., one can utilize the novelty threshold to control the degree of novelty.
Intuitively, the proper magnitude of the novelty threshold will lead to more exploration among a population of policies, thus the performance of latter found policies may be better than or at least as good as those trained without novelty seeking. However, when a larger magnitude of novelty threshold is applied, the performance of found novel policies will decrease because finding a feasible solution will get harder under more strict constraints. Fig. 3 shows our ablation study on adjusting the thresholds, which verifies our intuition.
e.2 Curriculum Learning in HalfCheetah
Moreover, we observe a kind of autocurriculum learning behavior in the learning of HalfCheetah, which may also help to understand the performance improvement in this environment. The environment of HalfCheetah is different from the other two in that there is no explicit early termination signal in its default setting (i.e., there is no explicit threshold for the states, exceeds which would trigger a termination). At the beginning of the learning, a PPO agent always acts randomly and keep twitching without moving, resulting in massive repeated and trivial samples and large control costs. Contrarily, in the learning of IPD, the agent can receive termination signals since repeated behaviors break the novelty constraint, preventing it from wasting too much effort acting randomly. Moreover, such termination signals also encourage the agent to imitate previous policies to get out of random explorations at the starting stage, avoiding heavy control costs while receiving less negative rewards. After that, the agent begins to learn to behave differently to pursue higher positive rewards. From this point of view, the learning process can be interpreted as a kind of implicit curriculum, which saves lots of interactions with the environment, improves the sample efficiency and therefore achieves better performance in the given learning timesteps.
Appendix F Visualize Diversity in Toy Model
f.1 The Four Reward Maze Problem
We first utilize a basic 2D environment named Four Reward Maze as a diagnostic environment where we can visualize learned policies directly. In this environment, four positive rewards of different values (e.g., for top, down, left and right respectively) are assigned to four middle points with radius on each edge in a 2D square map. We use in our experiments. The observation of a policy is the current position and the agent will receive a negative reward of at each timestep except stepping into the reward regions. Each episode starts from a randomly initialized position and the action space is limited to . The performance of each agent is evaluated by the averaged performances over 100 trials.
Results are shown in Fig. 4, where the behaviors of the PPO agents are quite similar, suggesting the diversity provided by random seeds is limited. WSR and TNB solve the noveltyseeking problem from the multiobjective optimization formulation, they thus suffer from the unbalance between performance and novelty. While WSR and TNB both provide sufficient novelty, performances of agents learned by WSR decay significantly, so did TNB due to an encumbered learning process, as we analyzed in Sec.3.3. Both CTNB and IPD, solving the task with noveltyseeking from the constrained optimization formulation, provide evident behavior diversity and perform recognizably better than TNB and WSR.