1 Introduction
While deep reinforcement learning (RL) approaches have led to many successful applications in challenging domains like Atari (mnih2015human), Go (silver2016mastering), Chess (silver2018general), Dota (berner2019dota) and StarCraft (vinyals2019grandmaster) in recent years, deep RL agents still prove to be brittle, often failing to transfer to environments only slightly different from those encountered during training (zhang2018dissection; coinrun). To ensure learning of robust and wellgeneralizing policies, agents must train on sufficiently diverse and informative variations of environments (e.g. see Section 3.1 of (procgen_benchmark)). However, it is not always feasible to specify an appropriate training distribution or a generator thereof. Agents may therefore benefit greatly from methods that automatically adapt the distribution over environment variations throughout training (paired; plr). Throughout this paper we will call a particular environment instance or configuration (e.g. an arrangement of blocks, race tracks, or generally any of the environment’s constituent entities) a level.
Two recent works (paired; plr) have sought to empirically demonstrate this need for a more targeted agentadaptive mechanism for selecting levels on which to train RL agents, so to ensure efficient learning and generalization to unseen levels—as well as to provide methods implementing such mechanisms. The first method, Protagonist Antagonist Induced Regret Environment Design (PAIRED) (paired), introduces a selfsupervised RL paradigm called Unsupervised Environment Design (UED). Here, an environment generator (a teacher) is coevolved with a student policy that trains on levels actively proposed by the teacher, leading to a form of adaptive curriculum learning. The aim of this coevolution is for the teacher to gradually learn to generate environments that exemplify properties of those that might be encountered at deployment time, and for the student to simultaneously learn a good policy that enables zeroshot transfer to such environments. PAIRED’s specific adversarial approach to environment design ensures a useful robustness characterization of the final student policy in the form of a minimax regret guarantee (savage1951theory)—assuming that its underlying teacherstudent multiagent system arrives at a Nash equilibrium (NE, nash1950equilibrium). In contrast, the second method, Prioritized Level Replay (PLR) (plr)
, embodies an alternative form of dynamic curriculum learning that does not assume control of level generation, but instead, the ability to selectively replay existing levels. PLR tracks levels previously proposed by a blackbox environment generator, and for each, estimates the agent’s learning potential in that level, in terms of how useful it would be to gather new experience from that level again in the future. The PLR algorithm exploits these scores to adapt a schedule for revisiting or
replayinglevels to maximize learning potential. PLR has been shown to produce scalable and robust results, improving both sample complexity of agent training and the generalization of the learned policy in diverse environments. However, unlike PAIRED, PLR is motivated with heuristic arguments and lacks a useful theoretical characterization of its learning behavior.
In this paper, we argue that PLR is, in and of itself, an effective form of UED: Through curating even randomly generated levels, PLR can generate novel and complex levels for learning robust policies. This insight leads to a natural class of UED methods which we call Dual Curriculum Design (DCD). In DCD, a student policy is challenged by a team of two coevolving teachers. One teacher actively generates new, challenging levels, while the other passively curates existing levels for replaying, by prioritizing those estimated to be most suitably challenging for the student. We show that PAIRED and PLR are distinct members of the DCD class of algorithms and prove in Section 3 that all DCD algorithms enjoy similar minimax regret guarantees to that of PAIRED.
We make use of this result to provide the first theoretical characterization of PLR, which immediately suggests a simple yet highly counterintuitive adjustment to PLR: By only training on trajectories in replay levels, PLR becomes provably robust at NE. We call this resulting variant (Section 4). From this perspective, we see that, in a diametrically opposite manner to PAIRED, PLR effectively performs level design through prioritized selection rather than active generation. A second corollary to the provable robustness of DCD algorithms shows that PLR can be extended to make use of the teacher from PAIRED as a level generator while preserving the robustness guarantee of PAIRED, resulting in a method we call ReplayEnhanced PAIRED (REPAIRED) (Section 5). We hypothesize that in this arrangement, PLR plays a complementary role to PAIRED in robustifying student policies.
Our experiments in Section 6 investigate the learning dynamics of , REPAIRED, and their replayfree counterparts on a challenging maze domain and a novel continuous control UED setting based on the popular CarRacing environment (gym). In both of these highly distinct settings, our methods provide significant improvements over PLR and PAIRED, producing agents that can perform outofdistribution (OOD) generalization to a variety of human designed mazes and Formula 1 tracks.
In summary, we present the following contributions: (i) We establish a common framework, Dual Curriculum Design, that encompasses PLR and PAIRED. This allows us to develop new theory, which provides the first robustness guarantees for PLR at NE as well as for REPAIRED, which augments PAIRED with a PLRbased replay mechanism. (ii) Crucially, our theory suggests a highly counterintuitive improvement to PLR: the convergence to NE should be assisted by training on less data when using PLR—namely by only taking gradient updates from data that originates from the PLR buffer, using the samples from the environment distribution only for computing the prioritization of levels in the buffer. (iii) Our experiments across a maze domain and a novel carracing domain show that this indeed improves the performance of PLR. Both this new variant of PLR and REPAIRED outperform alternative UED methods in these two challenging domains.
2 Background
2.1 Unsupervised Environment Design
Unsupervised Environment Design (UED), as introduced by (paired), is the problem of automatically designing a distribution of environments that adapts to the learning agent. UED is defined in terms of an Underspecified POMDP (UPOMDP), given by , where is a set of actions, is a set of observations, is a set of states, is a transition function, is an observation (or inspection) function, is a reward function, and is a discount factor. This definition is identical to a POMDP with the addition of to represent the freeparameters of the environment. These parameters can be distinct at every time step and incorporated into the transition function . For example, could represent the possible positions of obstacles in a maze. We will refer to the environment resulting from a fixed as , or with a slight abuse of notation, simply when clear from context. We define the value of in to be where are the rewards attained by in . Aligning with terminology from (plr), we refer to a fullyspecified environment as a level.
2.2 Protagonist Antagonist Induced Regret Environment Design
Protagonist Antagonist Induced Regret Environment Design (PAIRED, paired) presents a UED approach consisting of simultaneously training agents in a three player game: the protagonist and the antagonist are trained in environments generated by the teacher . The objective of this game is defined by , where regret is defined by . The protagonist and antagonist are both trained to maximize their discounted environment returns while the teacher is trained to maximize . Note that by maximizing regret, the teacher is disincentivized from generating unsolvable levels, which will have a maximum regret of . As shorthand, we will sometimes refer to the protagonist and antagonist jointly as the student agents. The counterclockwise loop beginning at the student agents in Figure 2 summarizes this approach, with the students being both the protagonist and antagonist.
As both student agents grow more adept at solving different levels, the teacher continues to adapt its level designs to exploit the weaknesses of the protagonist in relation to the antagonist. As this dynamic unfolds, PAIRED produces an emergent curriculum of progressively more complex levels along the boundary of the protagonist’s capabilities. PAIRED is a creative method in the sense that the teacher may potentially generate an endless sequence of novel levels. However, as the teacher only adapts through gradient updates, it is inherently slow to adapt to changes in the student policies.
2.3 Prioritized Level Replay
Prioritized Level Replay (PLR, plr)
is an activelearning strategy shown to improve a policy’s sample efficiency and generalization to unseen levels when training and evaluating on levels from a common UPOMDP, typically implemented as a seeded simulator. PLR maintains a level buffer
of the topvisited levels with highest learning potential as estimated by the timeaveraged L1 value loss of the learning agent over the last episode on each level. At the start of each training episode, with some predefined replay probability
, PLR uses a bandit to sample the level from to maximize the estimated learning potential; otherwise, with probability , PLR samples a new level from the simulator. In contrast to the generative but slowadapting PAIRED, PLR does not create new levels, but instead, acts as a fastadapting curation mechanism for selecting the next training level among previously encountered levels. Also unlike PAIRED, PLR does not provide a robustness guarantee. By extending the theoretical foundation of PAIRED to PLR, we will show how PLR can be modified to provide a robustness guarantee at NE, as well as how PAIRED can exploit PLR’s complementary curation to quickly switch among generated levels to maximize the student’s regret.3 The Robustness of Dual Curriculum Design
The previous approaches of PAIRED and PLR reveal a natural duality: Approaches that gradually learn to generate levels like PAIRED, and methods which cannot generate levels, but instead, quickly curate existing ones, like PLR. This duality suggests combining slow level generators with fast level curators. We call this novel class of UED algorithms Dual Curriculum Design (DCD). For instance, PLR can be seen as curator with a prioritized sampling mechanism with a random generator, while PAIRED, as a regretmaximizing generator without a curator. DCD can further consider Domain Randomization (DR) as a degenerate case of a random level generator without a curator.
To theoretically analyze this space of methods, we model DCD as a three player game among a student agent and two teachers called the dual curriculum game. However, to formalize this game, we must first formalize the singleteacher setting: Suppose the UPOMDP is clear from context. Then, given a utility function for a single teacher, , we can naturally define the base game between the student and teacher as , where is the strategy set of the student, is the strategy set of the teacher, and is the utility function of the student. In Sections 4 and 5, we will study settings corresponding to different choices of utility functions for the teacher agents, namely the maximumregret objective and the uniform objective . These two objectives are defined as follows (for any constant ):
(1) 
(2) 
In the dual curriculum game , the first teacher plays the game with probability , and the second, with probability —or more formally, , where the utility functions for the student and two teachers respectively, , are defined as follows:
(3) 
(4) 
(5) 
Our main theorem is that NE in the dual curriculum game are approximate NE of both the base game for either of the original teachers and the base game with a teacher maximizing the jointreward of , where the quality of the approximations depends on the mixing probability .
Theorem 1.
Let be the maximum difference between and , and let be a NE for . Then is an approximate NE for the base game with either teacher or for a teacher optimizing their joint objective. More precisely, it is a approximate NE when , a approximate NE , and a approximate NE when .
The intuition behind this theorem is that, since the two teachers do not affect each other’s behavior, their best response to a fixed is to choose a strategy that maximizes and respectively. Moreover, the two teachers’ strategies can be viewed as a single combined strategy for the base game with the jointobjective, or with each teacher’s own objective. In fact, the teachers provide an approximate bestresponse to each case of the base game simply by playing their individual best responses. Thus, when we reach a NE of the dual curriculum game, the teachers arrive at approximate best responses for both the base game with the joint objective and with their own objectives, meaning they are also in an approximate NE of the base game with either teacher. The full details of this proof are outlined in Appendix A.
4 Robustifying PLR
In this section, we provide theoretical justification for the empirically observed effectiveness of PLR, and in the process, motivate a counterintuitive adjustment to the algorithm.
4.1 Achieving Robustness Guarantees with PLR
PLR provides strong empirical gains in generalization, but lacks any theoretical guarantees of robustness. One step towards achieving such a guarantee is to replace its L1 valueloss prioritizaton with a regret prioritization, using the methods we discuss in Section 4.2
: While L1 value loss may be good for quickly training the value function, it can bias the longterm training behavior toward highvariance policies. However, even with this change, PLR holds weaker theoretical guarantees because the random generating teacher can bias the student away from minimax regret policies and instead, toward policies that sacrifice robustness in order to excel in unstructured levels. We formalize this intuitive argument in the following corollary of Theorem
1.Corollary 1.
Let be the dual curriculum game in which the first teacher maximizes regret, so , and the second teacher plays randomly, so . Let be bounded in for all . Further, suppose that is a Nash equilibrium of . Let be optimal worstcase regret. Then is close to having optimal worstcase regret, or formally . Moreover, there exists environments for all values of within a constant factor of achieving this bound.
The proof of Corollary 1 follows from a direct application of Theorem 1 to show that a NE of is an approximate NE for the base game of the first teacher, and through constructing a simple example where the student’s best response in fails to attain the minimax regret in . These arguments are described in full in Appendix A. This corollary provides some justification for why PLR improves robustness of the equilibrium policy, as it biases the resulting policy toward a minimax regret policy. However, it also points a way towards further improving PLR: If the probability of using a teachergenerated level directly was set to , then in equilibrium, the resulting policy converges to a minimax regret policy. Consequently, we arrive at the counterintuitive idea of avoiding gradient updates from trajectories collected from randomly sampled levels, to ensure that at NE, we find a minimax regret policy. From a robustness standpoint, it is therefore optimal to train on less data. The modified PLR algorithm with this counterintuitive adjustment is summarized in Algorithm 1, in which this small change relative to the original algorithm is highlighted in blue.
4.2 Estimating Regret
In general, levels may differ in maximum achievable returns, making it impossible to know the true regret of a level without access to an oracle. As the L1 value loss typically employed by PLR does not generally correspond to regret, we turn to alternative scoring functions that better approximate regret. Two approaches, both effective in practice, are discussed below.
Positive Value LossAveraging over all transitions with positive value loss amounts to estimating regret as the difference between maximum achieved return and predicted return on an episodic basis. However, this estimate is highly biased, as the value targets are tied to the agent’s current, potentially suboptimal policy. As it only considers positive value losses, this scoring function leads to optimistic sampling of levels with respect to the current policy. When using GAE (gae) to estimate bootstrapped value targets, this loss takes the following form, where and are the GAE and MDP discount factors respectively, and , the TDerror at timestep :
Maximum Monte Carlo (MaxMC) We can mitigate some of the bias of the positive value loss by replacing the value target with the highest return achieved on the given level so far during training. By using this maximal return, the regret estimates no longer depend on the agent’s current policy. This estimator takes the simple form of . In our densereward experiments, we compute this score as the difference between the maximum achieved return and .
5 ReplayEnhanced PAIRED (REPAIRED)
A straightforward extension of PLR is to replace the random teacher (i.e. the level generator) used by PLR with the PAIRED teacher. This addition then necessarily entails introducing a second student agent, the antagonist, also equipped with its own PLR level buffer. In each episode, with probability , the student agents train on a newly generated level and with probability train on a level sampled from each student’s own PLR buffer, prioritizing levels by highest estimated regret. We will refer to this extension as ReplayEnhanced PAIRED (REPAIRED). An overview of REPAIRED is provided by black arrows in Figure 2, with the students being both the protagonist and antagonist, while the full procedure is outlined in Appendix B.
Since REPAIRED’s variant of PLR and PAIRED both promote regret in equilibrium, it would be reasonable to believe the combination of the two does the same. A straightforward corollary of Theorem 1, which we describe in Appendix 1, shows that, in a theoretically ideal setting, combining these two algorithms as is done in REPAIRED indeed finds minimax regret strategies in equilibrium.
Corollary 2.
Let be the dual curriculum game in which both teachers maximize regret, so . Further, suppose that is a Nash equilibrium of . Then, .
This result gives us some amount of assurance that, if our method arrives at NE, then the protagonist has converged to a minimax regret strategy, which has the benefits outlined in (paired): Since a minimax regret policy solves all solvable environments, whenever this is possible and sufficiently welldefined, we should expect policies resulting from the equilibrium behavior of REPAIRED to be robust and versatile across all environments in the domain.
6 Experiments
Our experiments firstly aim to (1) assess the empirical performance of the theoretically motivated
, and secondly, seek to better understand the effect of replay on unsupervised environment design, specifically (2) its impact on the zeroshot generalization performance of the induced student policies, and (3) the complexity of the levels designed by the teacher. To do so, we compare PLR and REPAIRED against their replayfree counterparts, DR and PAIRED, in the two highly distinct settings of discrete control with sparse rewards and continuous control with dense rewards. We provide environment descriptions alongside model and hyperparameter choices in Appendix
D.6.1 PartiallyObservable Navigation
Each navigation level is a partiallyobservable maze requiring student agents to take discrete actions to reach a goal and receive a sparse reward. Our agents use PPO (schulman2017proximal) with an LSTMbased recurrent policy to handle partial observability. Before each episode, the teacher designs the level in this order: beginning with an empty maze, it places one obstructing block per time step up to a predefined block budget, and finally places the agent followed by the goal.
ZeroShot GeneralizationWe train policies with each method for 250M steps and evaluate zeroshot generalization on several challenging OOD environments, in addition to levels from the full distribution of two procedurallygenerated environments, PerfectMazes and LargeCorridor. We also compare against DR and minimax baselines. Our results in Figure 3 and 4 show that and REPAIRED both achieve greater sampleefficiency and zeroshot generalization than their replayfree counterparts. The improved test performance achieved by over both DR and PLR when trained for an equivalent number of gradient updates, aggregated over all test mazes, is statistically significant (), as is the improved test performance of REPAIRED over PAIRED. Table 2 in Appendix C.1 reports the mean performance of each method over all test mazes. Notably, well before 250 million steps, both PLR and significantly outperform PAIRED after 3 billion training steps as reported in (paired). Further, these two methods lead to policies exhibiting greater zeroshot transfer than both PAIRED and REPAIRED. The success of designing regretmaximizing levels via random search (curation) over learning a generator with RL suggests that for some UPOMDPs, the regret landscape, as a function of the free parameters , has a low effective dimensionality (bergstra2012random). Foregoing gradientbased learning in favor of random search may then lead to faster adaptation to the changing regret landscape, as the policy evolves throughout the course of training.
Zeroshot transfer performance during training for PAIRED and REPAIRED variants. The plots show mean and standard error across 10 runs. The dotted lines mark the mean performance of PAIRED after 3B training steps, as reported in
(paired), while dashed lines indicate median returns.Emergent ComplexityAs the student agents improve, the teachers must generate more challenging levels to maintain regret. We measure the resultant emergent complexity by tracking the number of blocks in each level and the shortest path length to the goal (where unsolvable levels are assigned a length of 0). These results, summarized in Figure 4, show that PAIRED slowly adapts the complexity over training while REPAIRED initially quickly grows complexity, before being overtaken by PAIRED. The fast onset of complexity may be due to REPAIRED’s fast replay mechanism, and the longterm slowdown relative to PAIRED may be explained by its less frequent gradient updates. Our results over an extended training period in Appendix C confirm that both PAIRED and REPAIRED slowly increase complexity over time, eventually matching that attained in just a fraction of the number of gradient steps by PLR and . This result shows that random search is surprisingly efficient at continually discovering levels of increasing complexity, given an appropriate curation mechanism such as PLR. Figure 5 shows that, similar to methods with a regretmaximizing teacher, PLR finds levels exhibiting complex structure.
6.2 PixelBased Car Racing with Continuous Control
To test the versatility and scalability of our methods, we turn to an extended version of the CarRacing environment from OpenAI Gym (gym). This environment entails continuous control with dense rewards, a 3dimensional action space, and partial, pixel observations, with the goal of driving a full lap around a track. To enable UED of any closedloop track, we reparameterize CarRacing to generate tracks as Bézier curves (bezier_ref) with arbitrary control points. The teacher generates levels by choosing a sequence of up to 12 control points, which uniquely defines a Bézier track within specific, predefined curvature constraints. After 5M steps of training, we test the zeroshot transfer performance of policies trained by each method on 20 levels replicating official humandesigned Formula One (F1) tracks (see Figure 19 in the Appendix for a visualization of the tracks). Note that these tracks are significantly OOD, as they cannot be defined with just 12 control points. In Figure 6 we show the progression of zeroshot transfer performance for the original CarRacing environment, as well as three F1 tracks of varying difficulty, while also including the final performance on the full F1 benchmark. For the final performance, we also evaluated the stateoftheart CarRacing agent from (attentionagent) on our new F1 benchmark.
Unlike in the sparse, discrete navigation setting, we find DR leads to moderately successful policies for zeroshot transfer in CarRacing. Dense rewards simplify the learning problem and random Bezier tracks occasionally contain the challenges seen in F1 tracks, such as hairpin turns and observations showing parallel tracks due to high local curvature. Still, we see that policies trained by selectively sampling tracks to maximize regret significantly outperform those trained by uniformly sampling from randomly generated tracks, in terms of zeroshot transfer to the OOD F1 tracks. Remarkably, with a replay rate of 0.5, sees statistically significant () gains over PLR in zeroshot performance over the full F1 benchmark, despite directly training on only half the rollout data using half as many gradient updates. Moreover, the robustness adjustment of is crucial for attaining improved performance with respect to DR in this domain. Once again, we see that random search with curation via PLR produces a rich selection of levels and an effective curriculum.
We also observe that PAIRED struggles to train a robust protagonist in CarRacing. Specifically, PAIRED overexploits the relative strengths of the antagonist over the protagonist, finding curricula that steer the protagonist towards policies that ultimately perform poorly even on simple tracks, leading to a gradual reduction in level complexity. We present training curves revealing this dynamic in Appendix C. As shown in Figure 6, REPAIRED improves upon this significantly, inducing a policy that outperforms both PAIRED and standard PLR in mean performance on the full F1 benchmark. Notably, approaches the performance of the stateoftheart AttentionAgent (attentionagent), despite not using a selfattention policy and training on less than 0.25% of the number of environment steps in comparison. These gains come purely from the induced curriculum. Figure 17 in Appendix C further reveals that and REPAIRED induce CarRacing policies that tend to achieve higher minimum returns on average compared to other methods that do not have such a robustness guarantee, providing further evidence of the benefits of the minimax regret property.
7 Related Work
In inducing parallel curricula, DCD follows a rich lineage of curriculum learning methods (bengio_curriculum; schmidhuber_curriculum; curriculum_rl_survey2; curriculum_rl_survey1). Many previous curriculum learning algorithms resemble the curator in DCD, sharing similar underlying selectivesampling mechanisms as . Most similar is TSCL (tscl), which prioritizes levels based on return rather than value loss, and has been shown to overfit to training levels in some settings (plr). In our setting, replayed levels can be viewed as past strategies from a levelgenerating teacher. This links our replaybased methods to fictitious selfplay (FSP, fictitious_sp), and more closely, Prioritized FSP (vinyals2019grandmaster), which selectively samples opponents based on historic win ratios.
Recent approaches that make use of a generating adversary include Asymmetric SelfPlay (sukhbaatar2018intrinsic; openai2021asymmetric), wherein one agent proposes tasks for another in the form of environment trajectories, and AMIGo (amigo), wherein the teacher is rewarded for proposing reachable goals. While our methods do not presuppose a goalbased setting, others have made progress here using generative modeling (goalgan; Racaniere2020Automated), latent skill learning (carml), and exploiting model disagreement (NEURIPS2020_566f0ea4). These methods are less generally applicable than , and unlike our DCD methods, they do not provide wellprincipled robustness guarantees.
Other recent algorithms can be understood as forms of UED and like DCD, framed in the lens of decision theory. POET (poet; enhanced_poet), a coevolutionary approach (Popovici2012), uses a population of minimax (rather than minimax regret) adversaries to construct terrain for a BipedalWalker agent. In contrast to our methods, POET requires training a large population of both agents and environments and consequently, a sizable compute overhead. APTGen (fang2021adaptive) also procedurally generates tasks, but requires access to target tasks, whereas our methods seek to improve zeroshot transfer.
The DCD framework also encompasses adaptive domain randomization methods (DR, mehta2019activedomain; dr_evolutionary), which have seen success in assisting sim2real transfer for robotics (domain_randomization; james2017transferring; dexterity; rubics_cube). DR itself is subsumed by procedural content generation (risi_togelius_pcg), for which UED and DCD may be seen as providing a formal, decisiontheoretic framework, enabling development of provably optimal algorithms.
8 Conclusion
We developed a novel connection between PLR and minimax regret UED approaches like PAIRED. We demonstrated that PAIRED, a slow but generative method for level design can be combined with PLR, a fast but nongenerative method that, instead, relies on replaybased curation of the most promising, existing levels. In order to theoretically ground this new setting, we introduced Dual Curriculum Design (DCD), in which a student policy is challenged by a team of two coadapting, regretmaximizing teachers—one, a generator that creates new levels, and the other, a curator that selectively samples existing levels for replay. This formalism enabled us to prove robustness guarantees for PLR at NE, notably yielding the counterintuitive result that PLR can be made provably robust by training on less data, specifically, only the trajectories on levels sampled for replay. In addition, we developed ReplayEnhanced PAIRED (REPAIRED), the natural instantiation of DCD. Empirically, in two highly distinct environments, we found that significantly improves zeroshot generalization over PLR, and REPAIRED, over PAIRED.
Longrunning UED processes in expansive UPDOMPs closely resemble continual learning in openended domains. The congruency of these settings suggests our contributions around DCD may extend to more general continual learning settings in which agents must learn to master a diverse sequence of tasks with predefined (or inferred) episode boundaries—if tasks are assumed to be designed by a regretmaximizing teacher. Thus, DCDbased methods like may also yield more general policies for continual learning. We believe this to be a promising direction for future research.
We would like to thank Natasha Jaques, Patrick Labatut, and Heinrich Küttler for fruitful discussions that helped inform this work. Further, we are grateful to our anonymous reviewers for their valuable feedback. MJ is supported by the FAIR PhD program. This work was funded by Facebook.
References
Appendix A Theoretical Results
In this section we prove the theoretical results around the dual curriculum game and use these results to show approximation bounds for our methods, given that they have reached a Nash equilibrium (NE).
The first theorem is the main result that allows us to analyze dual curriculum games. The highlevel result says that the NE of a dual curriculum game are approximate NE of the base game from the perspective of any of the individual players, or from the perspective of the joint strategy.
Theorem 1.
Let be the maximum difference between and , and let be a NE for . Then is an approximate NE for the base game with either teacher or for a teacher optimizing their joint objective. More precisely, it is a approximate NE when , a approximate NE when , and a approximate NE when .
At a high level, this is true because, for low values of , the bestresponse strategies for the individual players can be thought of as approximatebest response strategies for the jointplayer, and visversa. Since the Nash Equilibrium consists of each of the players playing their own best response, they must be playing an approximate best response for the jointplayer. We provide a formal proof below:
Proof.
Let be the maximum difference between and , and let be a Nash Equilibrium for . Then consider as a strategy in the base game for the joint player . Let be the best response for the joint player to . Since is a best response by assumption, it is sufficient to show that is an approximate best response. We then have
(6)  
(7)  
(8)  
(9)  
(10) 
Thus, we have shown that represents an Nash equilibrium for the joint player. For the first teacher we have the opposite condition trivially, the teacher is doing a best response to the student. We must now show that the student is doing an approximate best response to the teacher.
Let be the best response to the first teacher (with utility ) and let be the best response policy to the joint teacher. In this argument we will start with the observation that by definition, and then argue that we can construct an upper bound on the performance of on , , and a lower bound on the performance of on , . We get the desired result by combining these two arguments.
First we use to upper bound :
(11)  
(12)  
(13) 
Second we can use to lower bound :
(14)  
(15)  
(16) 
Putting this all together, we have
Which, after rearranging terms, gives
as desired. Repeating the symmetric argument shows the desired property for the second teacher. ∎
Following this main theorem, we can apply it to two of our methods. First we can apply it to naive PLR, which trains on a mixture of domain randomization (a teacher with utility ) and the PLR bandit (a teacher with utility ). This result shows that as we reduce the number of random episodes, the approximation to a minimax regret strategy improves. The intuition behind this is a direct application of Theorem 1, to show that it is an approximate Nash for the minimax regret player, and then showing that the minimax reget player has access to a strategy which ensures small regret, thus the regret that the equilibrium ensures must be approximately small.
Corollary 1.
Let be the dual curriculum game in which the first teacher maximizes regret, so , and the second teacher plays randomly, so . Let be bounded in for all . Further, suppose that is a Nash equilibrium of . Let be optimal worstcase regret. Then is close to having optimal worstcase regret, or formally . Moreover, there exists environments for all values of within a constant factor of achieving this bound.
Proof.
Since is bounded in for all , we know that and are within of each other. Thus by Theorem 1 we have that is a Nash equilibrium of the base game when . Thus is a approximate bestresponse to . However, since is a best response it chooses a regret maximizing parameter distribution. Thus the does not just measure the suboptimally of with respect to , but measures the worstcase regret of across all as desired.
The intuition for the existence of examples in which this approximation of regret decays linearly in is that a random level and the maximal regret level can be very different, and so the two measures may diverge drastically. For an example environment where deviates strongly from the minimax regret strategy, consider the onestep UMDP described in Table 1.
is better in expectation under the uniform distribution. For large
it is especially clear that and have better expected value under the uniform distribution, though we show that even for , the optimal joint policy can mix between and incurring high regret.Note that in Table 1, no policy has less than regret, since every policy will have to incur regret on either at least half the time. The minimax regret policy mixes uniformly between and to achieve regret of exactly . We can ignore for the regret calculations by assuming that , since every policy achieves less than regret on these levels.
Our claim is that in equilibrium of in this environment, the student policy can incur regret, more than the minimax regret policy. An example of such an equilibrium point would be when the student policy uniformly randomizes between and , which we will call , when the minimax teacher uniformly randomizes between and which we will call , and when the uniform teacher randomizes exactly which we call . To check this we must show that is in fact a NE of . Then we must show that incurs regret.
To show that is a NE of first note that is trivially a best response for the uniform utility function. Also note that maximizes the regret of since and are the only two parameters on which incur regret, and they incur the same regret; thus, any mixture over them will be optimal for the regretbased teacher. Finally, we need to show that is optimal for the student. To do this we will calculate the expected value of each policy and notice that the expected values for and are higher than for and . Thus any optimal policy will place no weight on and , but any distribution over and will be equivalently optimal. By symmetry, we can show only the calculations for and :
(17)  
(18) 
Thus and achieve
higher expected value by the joint distribution. Thus, we know that
is a best response and is in fact a NE of .Finally, we simply need to show that incurs regret. WLOG, we can evaluate its regret on . On , achieves reward while achieves . Thus incurs regret of as desired. As discussed before, since the minimax regret policy achieves , this is more regret than optimal. ∎
Lastly, we can also apply Theorem 1 to prove that REPAIRED achieves a minimax regret strategy in equilibrium. The intuition behind this corollary is that, since the utility functions of both teachers are the same, the approximate NE ensured by Theorem 1 is actually a true NE; therefore, the minimax theorem applies.
Corollary 2.
Let be the dual curriculum game in which both teachers maximize regret, so . Further, suppose that is a Nash equilibrium of . Then, .
Proof.
Since the joint objective is . Note that since , . Thus by Theorem 1 is a Nash Equilibrium of the base game with teacher objective , thus by the minimax theorem, as desired. ∎
Appendix B Algorithms
Although the PLR update rule for the level buffer of size in the case of unbounded training levels is described in [plr], we provide the pseudocode for this update rule in Algorithm 2 for completeness. Given staleness coefficient , temperature , a prioritization function (e.g. rank), level buffer scores , level buffer timestamps , and the current episode count (i.e. current timestamp), the update takes the form
The pseudocode for ReplayEnhanced PAIRED (REPAIRED), the method described in Section 5, is presented in Algorithm 3.
Appendix C Additional Experimental Results
This section provides additional experimental results in MiniGrid and CarRacing environments. Note that we determine the statistical significance of our results using a Welch ttest
[welch1947generalization].c.1 Extended Results for MiniGrid
Unlike the original maze experiments used to evaluate PAIRED in [paired], we conduct our main maze experiments with a block budget of 25 blocks (reported in Section 6.1), rather than 50 blocks. Following the environment parameterization in [paired], for a block budget of , the teacher attempts to place blocks that act as obstacles when designing each maze level. However, the teacher can place fewer than blocks, as placing a block in a location already occupied by a block results in a noopt. We found that PAIRED underperforms DR when both methods are given a budget of 50 blocks, a setting in which randomly sampled mazes exhibit enough structural complexity to allow DR to learn highly robust policies. Note that [paired] used a DR baseline with a 25block budget. With a 50block budget, DR and all replaybased methods are able to fully solve almost all test mazes after around 500M steps of training, making UED of mazes with a 50block budget too simple of a setting to provide an informative comparison among the methods studied.
c.1.1 Mazes with a 25block budget
We report the results of evaluating policies produced by each method after 250M training steps on each of the zeroshot transfer environments in Figure 8 and Table 2. Examples of each test environment are presented in Figure 7. All replaybased UED methods lead to policies with statistically significantly () higher test performance than PAIRED, and , after 500M training steps, similarly improves over PLR when trained for an equivalent number of gradient updates (as replay rate is set to ). Note that for PAIRED and REPAIRED, we evaluate the protagonist policy.
To provide a further sense of the training dynamics, we present the peragent training returns for each method in Figure 9. Notably PAIRED results in antagonists that attain higher returns than the protagonist as expected. This dynamic takes on a mild oscillation, visible in the training return curve of the generator (adversary). As the protagonist adapts to the adversarial levels, the generator’s return reduces, until the generator discovers new configurations that better exploit the relative differences between the two student policies. Notably, the adversary under REPAIRED seems to propose more difficult levels for both the protagonist and antagonist, while the resulting protagonist policy exhibits improved test performance, as seen in Figure 4.
Environment  DR  Minimax  PAIRED  REPAIRED  PLR  (500M)  

Labyrinth  
Labyrinth2  
LargeCorridor  
Maze  
Maze2  
PerfectMaze  
SixteenRooms  
SixteenRooms2  
Mean 
Additional complexity metrics tracked during training are shown in Figure 10. Alongside the number of blocks and shortest path length of levels seen during training, we also track solved path length and action complexity. Solved path length corresponds to the shortest path length from start position to goal in the levels successfully solved by the primary student agent (e.g. the protagonist in PAIRED). Action complexity corresponds to the LempelZivWelch (LZW) complexity—a commonly used measure of string compressibility—of the action sequence taken during the primary student agent’s trajectories. As expected, DR results in constant complexity for number of blocks and path length metrics. REPAIRED generates mazes with significantly greater complexity in terms of block count. The lower path lengths seen by REPAIRED suggest that it trains agents that more readily generalize to different path lengths, thereby pressuring the adversary to raise complexity in terms of block count. Further, given the high replay rates used, the REPAIRED adversary sees far fewer gradient updates with which to adjust its policy. As its shortest path lengths exceed that of PAIRED after adjusting proportionately by replay rate, foreseeably, over a longer period, the shortest path lengths generated by REPAIRED may meet or exceed that of PAIRED. In all cases, the action complexity reduces as the agent becomes more decisive, and we see that both PAIRED and REPAIRED lead to more decisive policies—as indicated by the simultaneously lower action complexity and greater level complexity in terms of higher block count (relative to DR) and, in the case of PAIRED, higher path length metrics. Lastly, it is interesting to note that while the random generator used by PLR produces levels of average complexity, the complexity of curated levels, as revealed in Figure 4, is significantly higher and, in the case of path length, steadily increasing.
c.1.2 Mazes with a 50block budget
Similarly, Figures 12, 13, and 14 report the training dynamics and test performance of agents trained using each method with a 50block budget for 500M steps. Figure 11 shows that DR and all replaybased methods are able to reach near perfect solve rates on most test mazes after 500M steps of training, with the exception of the Maze and PerfectMazes environments, where the test performances across methods are not markedly dissimilar, making the setting with a 50block budget uninformative for assessing performance differences among these methods. The example mazes generated by each method, presented in Figure 15, shows that the larger block budget allows DR to sample mazes with greater structural complexity, leading to robust policies and diminishing the benefits of the UED methods studied. Therefore, in this work, we focus the main results for the maze domain on the more challenging setting with a 25block budget. Note that the impact of the block budget on test performance further highlights the importance of properly adapting the training distribution for producing policies exhibiting high generality—a problem that our replaybased UED methods effectively address, as demonstrated by the results for the 25block setting.
c.2 Extended Results for CarRacing
The training return plots for each agent, shown in Figure 16, reveal that PAIRED’s generator (adversary) overexploits the relative advantages of the antagonist over the protagonist, leading to a highly suboptimal protagonist policy. In fact, as shown in the rightmost plot of Figure 16, the resulting protagonist policies suffer such performance degradation from the adversarial curriculum that they can no longer even successfully drive on the original, simpler CarRacing tracks.
Additionally, we present pertrack zeroshot transfer returns for the entire CarRacingF1 benchmark after 5M training steps (equivalent to 40M environment interaction steps due to the usage of action repeat) in Table 3
. Results report the mean and standard deviation over 100 attempts per track across 10 seeds. While DR acts as a strong baseline in terms of zeroshot generalization in this setting,
either attains the highest mean return, or matches the method achieving the highest return within standard error on all tracks. The mean performance of across the full benchmark is statistically significantly higher () than that of all other methods. Notably, PAIRED sees poor results, likely as a result of how the generator is able to overexploit the differences between antagonist and protagonist to detrimental effect in this domain. We see that REPAIRED mitigates this effect to a degree, resulting in more competitive policies. Note that due to the high compute overhead of training the AttentionAgent (8.2 billion steps of training over a population 256 agents) [attentionagent], we resorted to evaluating its mean F1 performance using the pretrained model weights provided by the authors with their public code release. As a result, we only have a single training run for AttentionAgent. This means we cannot reliably compute standard errors for this baseline, but we believe that showing the performance for a single training seed of AttentionAgent on the F1 benchmark alongside our methods, as done in Figure 6, nonetheless provides a useful comparison for further contextualizing the efficacy of our methods. This comparison highlights how by purely modifying the training curriculum, our methods produce policies with test returns approaching that of AttentionAgent—which in contrast, uses a powerful attentionbased policy and a larger number of training steps.As a further analysis of robustness, we inspect the minimum returns over 10 attempts per track, averaged over 10 runs per method. We present these results (mean and standard error) in Figure 17. achieves consistently higher minimum returns on average for many of the tracks compared to the other methods, including on the challenging Russia and USA tracks. This indicates that more reliably approaches a minimax regret policy than PAIRED and REPAIRED, which both share such a guarantee at NE.
Track  DR  PAIRED  REPAIRED  PLR  PLR  AttentionAgent 

Australia  826  
Austria  511  
Bahrain  372  
Belgium  668  
Brazil  145  
China  344  
France  153  
Germany  214  
Hungary  769  
Italy  798  
Malaysia  300  
Mexico  580  
Monaco  835  
Netherlands  131  
Portugal  606  
Russia  732  
Singapore  276  
Spain  759  
UK  729  
USA  192  
Mean  477 





Appendix D Experiment Details and Hyperparameters
Parameter  MiniGrid  CarRacing 

PPO  
0.995  0.99  
0.95  0.9  
PPO rollout length  256  125 
PPO epochs 
5  8 
PPO minibatches per epoch  1  4 
PPO clip range  0.2  0.2 
PPO number of workers  32  16 
Adam learning rate  1e4  3e4 
Adam  1e5  1e5 
PPO max gradient norm  0.5  0.5 
PPO value clipping  yes  no 
return normalization  no  yes 
value loss coefficient  0.5  0.5 
student entropy coefficient  0.0  0.0 
generator entropy coefficient  0.0  0.01 
PLR 

Replay rate,  0.5  0.5 
Buffer size,  4000  500 
Scoring function  MaxMC  positive value loss 
Prioritization  rank  rank 
Temperature,  0.3  0.1 
Staleness coefficient,  0.3  0.3 
PLR 

Scoring function  MaxMC  positive value loss 
REPAIRED 

Scoring function  MaxMC  MaxMC 

This section details the environments, agent architectures, and training procedures used in our experiments discussed in Section 6. We use PPO to train both student and generator policies in all experiments. Section 6 reports results for each method using the best hyperparameter settings, which we summarize in Figure 4. Note that unless specified, PPO hyperparameters are shared between student and teacher, and PLR hyperparameters are shared between REPAIRED and REPAIRED. The procedures for determining the hyperparameter choices for each environment are detailed below, in Sections D.1 and D.2.
d.1 PartiallyObservable Navigation (MiniGrid)
Environment details Our mazes are based on MiniGrid [gym_minigrid]. Each maze consists of a grid, where each cell can contain a wall, the goal, the agent, or navigable space. The student agent receives a reward of upon reaching the goal, where is the episode length and is the maximum episode length (set to 250). Otherwise, the agent receives a reward of 0 if it fails to reach the goal. The observation space consists of the agent’s orientation (facing north, south, east, or west) and the grid immediately in front of and including the agent. This grid takes the form of a 3channel integer encoding. The action space consists of 7 total actions, though mazes only make use of the first three: turn left, turn right, and forward. We do not mask out irrelevant actions.
Level generation Each maze is fully surrounded by walls, resulting in cells in which the generator can place walls, the goal, and the agent. Starting from an initially empty maze (except the bordering walls), the generator is given a budget of
steps in which it can choose a grid cell in which to place a wall. Placing a wall in a cell already containing a wall results in a noopt. After wall placement, the generator then chooses cells for the goal and the agent’s starting position. If either of these cells collides with an existing wall, a random empty cell is chosen. At each time step, the generator teacher receives the full grid observation of the developing maze, the onehot encoding of the current time step, as well as a 50dimensional random noise vector, where each component is uniformly sampled from
.Generator architecture We base the generator architecture on the the original model used for the PAIRED adversary in [paired]. This model encodes the full grid observation using a convolution layer (
kernel, stride length
, 128 filters) followed by a ReLU activation layer over the flattened convolution outputs. The current time step is embedded into a 10dimensional space, which is concatenated to the grid embedding, along with the random noise vector. This combined representation is then passed through an LSTM with hidden dimension 256, followed by two fullyconnected layers, each with a hidden dimension 32 and ReLU activations, to produce the action logits over the 169 possible cell choices. We further ablated the LSTM and found that its absence preserves the performance of the minimax generator in both 25block and 50block settings, as well as that of the PAIRED generator in the 50block setting, as expected given that the full grid and time step form a Markov state. However, the PAIRED generator struggles to learn without an LSTM in the 25block setting. We believe PAIRED’s improved performance when using an LSTMbased generator in the 25block setting is due to the additional network capacity provided by the LSTM. Therefore, in favor of less compute time, our experiments only used an LSTMbased generator for PAIRED in the 25block setting.
Student architecture The student policy architecture resembles the LSTMbased generator architecture, except the student model uses a convolution with 16 filters to embed its partial observation; does not use a random noise vector; and instead of embedding the time step, embeds the student’s current direction into a 5dimensional latent space.
Choice of hyperparameters We base our choice of hyperparameters for student agents and generator (i.e. the adversary) on [paired]. We also performed a coarse grid search over the student entropy coefficient in , generator entropy coefficient in , and number of PPO epochs in for both students and generator, as well as the choice of including an LSTM in the student and generator policies. We selected the best performing settings based on average return on the validation levels of SixteenRooms, Labyrinth, and Maze over 3 seeds. Our final choices are summarized in 4. The main deviations from the settings in [paired] are the choice of removing the generator’s LSTM (except for PAIRED with 25 blocks) and using fewer PPO epochs (5 instead of 20). For PLR, we searched over replay rate, , in and level buffer size, , in , temperature in , and choice of scoring function in . The final PLR hyperparameter selection was then also used for and REPAIRED, except for the scoring function, over which we conducted a separate search for each method.
Zeroshot levels We make use of the challenging test mazes in [paired]: SixteenRooms, requiring navigation through up to 16 rooms to find a goal; Labyrinth, requiring traversal of a spiral labyrinth; and Maze, requiring the agent to find a goal in a binarytree maze, which requires the agent to successfully backtrack from dead ends. To more comprehensively test the agent’s zeroshot transfer performance on OOD classes of mazes, we introduce Labyrinth2, a rotated version of Labyrinth; Maze2, another variant of a binarytree maze; PerfectMazes, a procedurallygenerated maze environment; and LargeCorridor, another procedurallygenerated maze environment, where the goal position is randomly chosen to lie at the end of one of the corridors, thereby testing the agent’s ability to perform backtracking. Figure 3 provides screenshots of these mazes.
d.2 CarRacing
Environment details Each track consists of a closed loop around which the student agent must drive a full lap. In order to increase the expressiveness of the original CarRacing, we reparameterized the tracks using Bézier curves. In our experiments, each track consists of a Bézier curve [bezier_ref] based on 12 randomly sampled control points within a fixed radius, , of the center of the playfield. The track consists of a sequence of polygons. When driving over each previously unvisited polygon, the agent receives a reward equal to . The student additionally receives a reward of 0.1 at each time step. Aligning with the methodology of [carracing_ppo], we do not penalize the agent for driving out of the playfield boundaries, terminate episodes if the agent drives too far off track, and repeat every selected action for 8 steps. The student observation space consists of a pix
Comments
There are no comments yet.