Replay-Guided Adversarial Environment Design

by   Minqi Jiang, et al.
berkeley college

Deep reinforcement learning (RL) agents may successfully generalize to new settings if trained on an appropriately diverse set of environment and task configurations. Unsupervised Environment Design (UED) is a promising self-supervised RL paradigm, wherein the free parameters of an underspecified environment are automatically adapted during training to the agent's capabilities, leading to the emergence of diverse training environments. Here, we cast Prioritized Level Replay (PLR), an empirically successful but theoretically unmotivated method that selectively samples randomly-generated training levels, as UED. We argue that by curating completely random levels, PLR, too, can generate novel and complex levels for effective training. This insight reveals a natural class of UED methods we call Dual Curriculum Design (DCD). Crucially, DCD includes both PLR and a popular UED algorithm, PAIRED, as special cases and inherits similar theoretical guarantees. This connection allows us to develop novel theory for PLR, providing a version with a robustness guarantee at Nash equilibria. Furthermore, our theory suggests a highly counterintuitive improvement to PLR: by stopping the agent from updating its policy on uncurated levels (training on less data), we can improve the convergence to Nash equilibria. Indeed, our experiments confirm that our new method, PLR^⊥, obtains better results on a suite of out-of-distribution, zero-shot transfer tasks, in addition to demonstrating that PLR^⊥ improves the performance of PAIRED, from which it inherited its theoretical framework.



There are no comments yet.


page 2

page 7

page 17

page 20

page 27

page 28


Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design

A wide range of reinforcement learning (RL) problems - including robustn...

Procedural Level Generation Improves Generality of Deep Reinforcement Learning

Over the last few years, deep reinforcement learning (RL) has shown impr...

Stratified Experience Replay: Correcting Multiplicity Bias in Off-Policy Reinforcement Learning

Deep Reinforcement Learning (RL) methods rely on experience replay to ap...

Learning to Design Games: Strategic Environments in Deep Reinforcement Learning

In typical reinforcement learning (RL), the environment is assumed given...

Prioritized Level Replay

Simulated environments with procedurally generated content have become p...

One Solution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL

While reinforcement learning algorithms can learn effective policies for...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While deep reinforcement learning (RL) approaches have led to many successful applications in challenging domains like Atari (mnih2015human), Go (silver2016mastering), Chess (silver2018general), Dota (berner2019dota) and StarCraft (vinyals2019grandmaster) in recent years, deep RL agents still prove to be brittle, often failing to transfer to environments only slightly different from those encountered during training (zhang2018dissection; coinrun). To ensure learning of robust and well-generalizing policies, agents must train on sufficiently diverse and informative variations of environments (e.g. see Section 3.1 of (procgen_benchmark)). However, it is not always feasible to specify an appropriate training distribution or a generator thereof. Agents may therefore benefit greatly from methods that automatically adapt the distribution over environment variations throughout training (paired; plr). Throughout this paper we will call a particular environment instance or configuration (e.g. an arrangement of blocks, race tracks, or generally any of the environment’s constituent entities) a level.

Two recent works (paired; plr) have sought to empirically demonstrate this need for a more targeted agent-adaptive mechanism for selecting levels on which to train RL agents, so to ensure efficient learning and generalization to unseen levels—as well as to provide methods implementing such mechanisms. The first method, Protagonist Antagonist Induced Regret Environment Design (PAIRED) (paired), introduces a self-supervised RL paradigm called Unsupervised Environment Design (UED). Here, an environment generator (a teacher) is co-evolved with a student policy that trains on levels actively proposed by the teacher, leading to a form of adaptive curriculum learning. The aim of this coevolution is for the teacher to gradually learn to generate environments that exemplify properties of those that might be encountered at deployment time, and for the student to simultaneously learn a good policy that enables zero-shot transfer to such environments. PAIRED’s specific adversarial approach to environment design ensures a useful robustness characterization of the final student policy in the form of a minimax regret guarantee (savage1951theory)—assuming that its underlying teacher-student multi-agent system arrives at a Nash equilibrium (NE, nash1950equilibrium). In contrast, the second method, Prioritized Level Replay (PLR) (plr)

, embodies an alternative form of dynamic curriculum learning that does not assume control of level generation, but instead, the ability to selectively replay existing levels. PLR tracks levels previously proposed by a black-box environment generator, and for each, estimates the agent’s learning potential in that level, in terms of how useful it would be to gather new experience from that level again in the future. The PLR algorithm exploits these scores to adapt a schedule for revisiting or


levels to maximize learning potential. PLR has been shown to produce scalable and robust results, improving both sample complexity of agent training and the generalization of the learned policy in diverse environments. However, unlike PAIRED, PLR is motivated with heuristic arguments and lacks a useful theoretical characterization of its learning behavior.

In this paper, we argue that PLR is, in and of itself, an effective form of UED: Through curating even randomly generated levels, PLR can generate novel and complex levels for learning robust policies. This insight leads to a natural class of UED methods which we call Dual Curriculum Design (DCD). In DCD, a student policy is challenged by a team of two co-evolving teachers. One teacher actively generates new, challenging levels, while the other passively curates existing levels for replaying, by prioritizing those estimated to be most suitably challenging for the student. We show that PAIRED and PLR are distinct members of the DCD class of algorithms and prove in Section 3 that all DCD algorithms enjoy similar minimax regret guarantees to that of PAIRED.

We make use of this result to provide the first theoretical characterization of PLR, which immediately suggests a simple yet highly counterintuitive adjustment to PLR: By only training on trajectories in replay levels, PLR becomes provably robust at NE. We call this resulting variant (Section 4). From this perspective, we see that, in a diametrically opposite manner to PAIRED, PLR effectively performs level design through prioritized selection rather than active generation. A second corollary to the provable robustness of DCD algorithms shows that PLR can be extended to make use of the teacher from PAIRED as a level generator while preserving the robustness guarantee of PAIRED, resulting in a method we call Replay-Enhanced PAIRED (REPAIRED) (Section 5). We hypothesize that in this arrangement, PLR plays a complementary role to PAIRED in robustifying student policies.

Our experiments in Section 6 investigate the learning dynamics of , REPAIRED, and their replay-free counterparts on a challenging maze domain and a novel continuous control UED setting based on the popular CarRacing environment (gym). In both of these highly distinct settings, our methods provide significant improvements over PLR and PAIRED, producing agents that can perform out-of-distribution (OOD) generalization to a variety of human designed mazes and Formula 1 tracks.

(a) DR (b) PAIRED (c) REPAIRED (d) PLR (e) Human
Figure 1: Randomly drawn samples of CarRacing tracks produced by different methods. (a) Domain Randomization (DR) produces tracks of average complexity, with few sharp turns. (b) PAIRED often overexploits the difference in the students, leading to simple tracks that incidentally favor the antagonist. (c) Tracks generated by REPAIRED combine both challenging and easy sections. (d)  selects the most challenging randomly generated tracks, resulting in tracks that more closely resemble human-designed tracks, such as (e) the Nürburgring Grand Prix.

In summary, we present the following contributions: (i) We establish a common framework, Dual Curriculum Design, that encompasses PLR and PAIRED. This allows us to develop new theory, which provides the first robustness guarantees for PLR at NE as well as for REPAIRED, which augments PAIRED with a PLR-based replay mechanism. (ii) Crucially, our theory suggests a highly counterintuitive improvement to PLR: the convergence to NE should be assisted by training on less data when using PLR—namely by only taking gradient updates from data that originates from the PLR buffer, using the samples from the environment distribution only for computing the prioritization of levels in the buffer. (iii) Our experiments across a maze domain and a novel car-racing domain show that this indeed improves the performance of PLR. Both this new variant of PLR and REPAIRED outperform alternative UED methods in these two challenging domains.

2 Background

2.1 Unsupervised Environment Design

Unsupervised Environment Design (UED), as introduced by (paired), is the problem of automatically designing a distribution of environments that adapts to the learning agent. UED is defined in terms of an Underspecified POMDP (UPOMDP), given by , where is a set of actions, is a set of observations, is a set of states, is a transition function, is an observation (or inspection) function, is a reward function, and is a discount factor. This definition is identical to a POMDP with the addition of to represent the free-parameters of the environment. These parameters can be distinct at every time step and incorporated into the transition function . For example, could represent the possible positions of obstacles in a maze. We will refer to the environment resulting from a fixed as , or with a slight abuse of notation, simply when clear from context. We define the value of in to be where are the rewards attained by in . Aligning with terminology from (plr), we refer to a fully-specified environment as a level.

2.2 Protagonist Antagonist Induced Regret Environment Design

Protagonist Antagonist Induced Regret Environment Design (PAIRED, paired) presents a UED approach consisting of simultaneously training agents in a three player game: the protagonist and the antagonist are trained in environments generated by the teacher . The objective of this game is defined by , where regret is defined by . The protagonist and antagonist are both trained to maximize their discounted environment returns while the teacher is trained to maximize . Note that by maximizing regret, the teacher is disincentivized from generating unsolvable levels, which will have a maximum regret of . As shorthand, we will sometimes refer to the protagonist and antagonist jointly as the student agents. The counterclockwise loop beginning at the student agents in Figure 2 summarizes this approach, with the students being both the protagonist and antagonist.

As both student agents grow more adept at solving different levels, the teacher continues to adapt its level designs to exploit the weaknesses of the protagonist in relation to the antagonist. As this dynamic unfolds, PAIRED produces an emergent curriculum of progressively more complex levels along the boundary of the protagonist’s capabilities. PAIRED is a creative method in the sense that the teacher may potentially generate an endless sequence of novel levels. However, as the teacher only adapts through gradient updates, it is inherently slow to adapt to changes in the student policies.

2.3 Prioritized Level Replay

Prioritized Level Replay (PLR, plr)

is an active-learning strategy shown to improve a policy’s sample efficiency and generalization to unseen levels when training and evaluating on levels from a common UPOMDP, typically implemented as a seeded simulator. PLR maintains a level buffer

of the top

visited levels with highest learning potential as estimated by the time-averaged L1 value loss of the learning agent over the last episode on each level. At the start of each training episode, with some predefined replay probability

, PLR uses a bandit to sample the level from to maximize the estimated learning potential; otherwise, with probability , PLR samples a new level from the simulator. In contrast to the generative but slow-adapting PAIRED, PLR does not create new levels, but instead, acts as a fast-adapting curation mechanism for selecting the next training level among previously encountered levels. Also unlike PAIRED, PLR does not provide a robustness guarantee. By extending the theoretical foundation of PAIRED to PLR, we will show how PLR can be modified to provide a robustness guarantee at NE, as well as how PAIRED can exploit PLR’s complementary curation to quickly switch among generated levels to maximize the student’s regret.

3 The Robustness of Dual Curriculum Design

Figure 2: Overview of Dual Curriculum Design (DCD). The student learns in the presence of two co-adapting teachers that aim to maximize the student’s regret: The generator teacher designs new levels to challenge the agent, and the curator teacher prioritizes a set of levels already created, selectively sampling them for replay.

The previous approaches of PAIRED and PLR reveal a natural duality: Approaches that gradually learn to generate levels like PAIRED, and methods which cannot generate levels, but instead, quickly curate existing ones, like PLR. This duality suggests combining slow level generators with fast level curators. We call this novel class of UED algorithms Dual Curriculum Design (DCD). For instance, PLR can be seen as curator with a prioritized sampling mechanism with a random generator, while PAIRED, as a regret-maximizing generator without a curator. DCD can further consider Domain Randomization (DR) as a degenerate case of a random level generator without a curator.

To theoretically analyze this space of methods, we model DCD as a three player game among a student agent and two teachers called the dual curriculum game. However, to formalize this game, we must first formalize the single-teacher setting: Suppose the UPOMDP is clear from context. Then, given a utility function for a single teacher, , we can naturally define the base game between the student and teacher as , where is the strategy set of the student, is the strategy set of the teacher, and is the utility function of the student. In Sections 4 and 5, we will study settings corresponding to different choices of utility functions for the teacher agents, namely the maximum-regret objective and the uniform objective . These two objectives are defined as follows (for any constant ):


In the dual curriculum game , the first teacher plays the game with probability , and the second, with probability —or more formally, , where the utility functions for the student and two teachers respectively, , are defined as follows:


Our main theorem is that NE in the dual curriculum game are approximate NE of both the base game for either of the original teachers and the base game with a teacher maximizing the joint-reward of , where the quality of the approximations depends on the mixing probability .

Theorem 1.

Let be the maximum difference between and , and let be a NE for . Then is an approximate NE for the base game with either teacher or for a teacher optimizing their joint objective. More precisely, it is a -approximate NE when , a -approximate NE , and a -approximate NE when .

The intuition behind this theorem is that, since the two teachers do not affect each other’s behavior, their best response to a fixed is to choose a strategy that maximizes and respectively. Moreover, the two teachers’ strategies can be viewed as a single combined strategy for the base game with the joint-objective, or with each teacher’s own objective. In fact, the teachers provide an approximate best-response to each case of the base game simply by playing their individual best responses. Thus, when we reach a NE of the dual curriculum game, the teachers arrive at approximate best responses for both the base game with the joint objective and with their own objectives, meaning they are also in an approximate NE of the base game with either teacher. The full details of this proof are outlined in Appendix A.

4 Robustifying PLR

In this section, we provide theoretical justification for the empirically observed effectiveness of PLR, and in the process, motivate a counterintuitive adjustment to the algorithm.

Randomly initialize policy and an empty level buffer, of size .
while not converged do
        Sample replay-decision Bernoulli,
        if  then
               Sample level from level generator
               Collect ’s trajectory on , with a stop-gradient i.e. Suppress policy update
               Use PLR to sample a replay level from the level store,
               Collect policy trajectory on and update with rewards
        end if
       Compute PLR score,
        Update with using score
end while
Algorithm 1 Robust PLR (PLR)

4.1 Achieving Robustness Guarantees with PLR

PLR provides strong empirical gains in generalization, but lacks any theoretical guarantees of robustness. One step towards achieving such a guarantee is to replace its L1 value-loss prioritizaton with a regret prioritization, using the methods we discuss in Section 4.2

: While L1 value loss may be good for quickly training the value function, it can bias the long-term training behavior toward high-variance policies. However, even with this change, PLR holds weaker theoretical guarantees because the random generating teacher can bias the student away from minimax regret policies and instead, toward policies that sacrifice robustness in order to excel in unstructured levels. We formalize this intuitive argument in the following corollary of Theorem


Corollary 1.

Let be the dual curriculum game in which the first teacher maximizes regret, so , and the second teacher plays randomly, so . Let be bounded in for all . Further, suppose that is a Nash equilibrium of . Let be optimal worst-case regret. Then is close to having optimal worst-case regret, or formally . Moreover, there exists environments for all values of within a constant factor of achieving this bound.

The proof of Corollary 1 follows from a direct application of Theorem 1 to show that a NE of is an approximate NE for the base game of the first teacher, and through constructing a simple example where the student’s best response in fails to attain the minimax regret in . These arguments are described in full in Appendix A. This corollary provides some justification for why PLR improves robustness of the equilibrium policy, as it biases the resulting policy toward a minimax regret policy. However, it also points a way towards further improving PLR: If the probability of using a teacher-generated level directly was set to , then in equilibrium, the resulting policy converges to a minimax regret policy. Consequently, we arrive at the counterintuitive idea of avoiding gradient updates from trajectories collected from randomly sampled levels, to ensure that at NE, we find a minimax regret policy. From a robustness standpoint, it is therefore optimal to train on less data. The modified PLR algorithm with this counterintuitive adjustment is summarized in Algorithm 1, in which this small change relative to the original algorithm is highlighted in blue.

4.2 Estimating Regret

In general, levels may differ in maximum achievable returns, making it impossible to know the true regret of a level without access to an oracle. As the L1 value loss typically employed by PLR does not generally correspond to regret, we turn to alternative scoring functions that better approximate regret. Two approaches, both effective in practice, are discussed below.

Positive Value LossAveraging over all transitions with positive value loss amounts to estimating regret as the difference between maximum achieved return and predicted return on an episodic basis. However, this estimate is highly biased, as the value targets are tied to the agent’s current, potentially suboptimal policy. As it only considers positive value losses, this scoring function leads to optimistic sampling of levels with respect to the current policy. When using GAE (gae) to estimate bootstrapped value targets, this loss takes the following form, where and are the GAE and MDP discount factors respectively, and , the TD-error at timestep :

Maximum Monte Carlo (MaxMC) We can mitigate some of the bias of the positive value loss by replacing the value target with the highest return achieved on the given level so far during training. By using this maximal return, the regret estimates no longer depend on the agent’s current policy. This estimator takes the simple form of . In our dense-reward experiments, we compute this score as the difference between the maximum achieved return and .

5 Replay-Enhanced PAIRED (REPAIRED)

A straightforward extension of PLR is to replace the random teacher (i.e. the level generator) used by PLR with the PAIRED teacher. This addition then necessarily entails introducing a second student agent, the antagonist, also equipped with its own PLR level buffer. In each episode, with probability , the student agents train on a newly generated level and with probability train on a level sampled from each student’s own PLR buffer, prioritizing levels by highest estimated regret. We will refer to this extension as Replay-Enhanced PAIRED (REPAIRED). An overview of REPAIRED is provided by black arrows in Figure 2, with the students being both the protagonist and antagonist, while the full procedure is outlined in Appendix B.

Since REPAIRED’s variant of PLR and PAIRED both promote regret in equilibrium, it would be reasonable to believe the combination of the two does the same. A straightforward corollary of Theorem 1, which we describe in Appendix 1, shows that, in a theoretically ideal setting, combining these two algorithms as is done in REPAIRED indeed finds minimax regret strategies in equilibrium.

Corollary 2.

Let be the dual curriculum game in which both teachers maximize regret, so . Further, suppose that is a Nash equilibrium of . Then, .

This result gives us some amount of assurance that, if our method arrives at NE, then the protagonist has converged to a minimax regret strategy, which has the benefits outlined in (paired): Since a minimax regret policy solves all solvable environments, whenever this is possible and sufficiently well-defined, we should expect policies resulting from the equilibrium behavior of REPAIRED to be robust and versatile across all environments in the domain.

6 Experiments

Our experiments firstly aim to (1) assess the empirical performance of the theoretically motivated

, and secondly, seek to better understand the effect of replay on unsupervised environment design, specifically (2) its impact on the zero-shot generalization performance of the induced student policies, and (3) the complexity of the levels designed by the teacher. To do so, we compare PLR and REPAIRED against their replay-free counterparts, DR and PAIRED, in the two highly distinct settings of discrete control with sparse rewards and continuous control with dense rewards. We provide environment descriptions alongside model and hyperparameter choices in Appendix


6.1 Partially-Observable Navigation

Each navigation level is a partially-observable maze requiring student agents to take discrete actions to reach a goal and receive a sparse reward. Our agents use PPO (schulman2017proximal) with an LSTM-based recurrent policy to handle partial observability. Before each episode, the teacher designs the level in this order: beginning with an empty maze, it places one obstructing block per time step up to a predefined block budget, and finally places the agent followed by the goal.

Figure 3: Zero-shot transfer performance in challenging test environments after 250M training steps. The plots show median and interquartile range of solved rates over 10 runs. An asterisk (*) next to the maze name indicates the maze is procedurally-generated, and thus each attempt corresponds to a random configuration of the maze.

Zero-Shot GeneralizationWe train policies with each method for 250M steps and evaluate zero-shot generalization on several challenging OOD environments, in addition to levels from the full distribution of two procedurally-generated environments, PerfectMazes and LargeCorridor. We also compare against DR and minimax baselines. Our results in Figure 3 and 4 show that and REPAIRED both achieve greater sample-efficiency and zero-shot generalization than their replay-free counterparts. The improved test performance achieved by over both DR and PLR when trained for an equivalent number of gradient updates, aggregated over all test mazes, is statistically significant (), as is the improved test performance of REPAIRED over PAIRED. Table 2 in Appendix C.1 reports the mean performance of each method over all test mazes. Notably, well before 250 million steps, both PLR and significantly outperform PAIRED after 3 billion training steps as reported in (paired). Further, these two methods lead to policies exhibiting greater zero-shot transfer than both PAIRED and REPAIRED. The success of designing regret-maximizing levels via random search (curation) over learning a generator with RL suggests that for some UPOMDPs, the regret landscape, as a function of the free parameters , has a low effective dimensionality (bergstra2012random). Foregoing gradient-based learning in favor of random search may then lead to faster adaptation to the changing regret landscape, as the policy evolves throughout the course of training.

Figure 4:

Zero-shot transfer performance during training for PAIRED and REPAIRED variants. The plots show mean and standard error across 10 runs. The dotted lines mark the mean performance of PAIRED after 3B training steps, as reported in 

(paired), while dashed lines indicate median returns.
Figure 5: Examples of emergent structures generated by each method.

Emergent ComplexityAs the student agents improve, the teachers must generate more challenging levels to maintain regret. We measure the resultant emergent complexity by tracking the number of blocks in each level and the shortest path length to the goal (where unsolvable levels are assigned a length of 0). These results, summarized in Figure 4, show that PAIRED slowly adapts the complexity over training while REPAIRED initially quickly grows complexity, before being overtaken by PAIRED. The fast onset of complexity may be due to REPAIRED’s fast replay mechanism, and the long-term slowdown relative to PAIRED may be explained by its less frequent gradient updates. Our results over an extended training period in Appendix C confirm that both PAIRED and REPAIRED slowly increase complexity over time, eventually matching that attained in just a fraction of the number of gradient steps by PLR and . This result shows that random search is surprisingly efficient at continually discovering levels of increasing complexity, given an appropriate curation mechanism such as PLR. Figure 5 shows that, similar to methods with a regret-maximizing teacher, PLR finds levels exhibiting complex structure.

6.2 Pixel-Based Car Racing with Continuous Control

To test the versatility and scalability of our methods, we turn to an extended version of the CarRacing environment from OpenAI Gym (gym). This environment entails continuous control with dense rewards, a 3-dimensional action space, and partial, pixel observations, with the goal of driving a full lap around a track. To enable UED of any closed-loop track, we reparameterize CarRacing to generate tracks as Bézier curves (bezier_ref) with arbitrary control points. The teacher generates levels by choosing a sequence of up to 12 control points, which uniquely defines a Bézier track within specific, predefined curvature constraints. After 5M steps of training, we test the zero-shot transfer performance of policies trained by each method on 20 levels replicating official human-designed Formula One (F1) tracks (see Figure 19 in the Appendix for a visualization of the tracks). Note that these tracks are significantly OOD, as they cannot be defined with just 12 control points. In Figure 6 we show the progression of zero-shot transfer performance for the original CarRacing environment, as well as three F1 tracks of varying difficulty, while also including the final performance on the full F1 benchmark. For the final performance, we also evaluated the state-of-the-art CarRacing agent from (attentionagent) on our new F1 benchmark.

Figure 6: Zero-shot transfer performance. Plots show mean and standard error over 10 runs.

Unlike in the sparse, discrete navigation setting, we find DR leads to moderately successful policies for zero-shot transfer in CarRacing. Dense rewards simplify the learning problem and random Bezier tracks occasionally contain the challenges seen in F1 tracks, such as hairpin turns and observations showing parallel tracks due to high local curvature. Still, we see that policies trained by selectively sampling tracks to maximize regret significantly outperform those trained by uniformly sampling from randomly generated tracks, in terms of zero-shot transfer to the OOD F1 tracks. Remarkably, with a replay rate of 0.5, sees statistically significant () gains over PLR in zero-shot performance over the full F1 benchmark, despite directly training on only half the rollout data using half as many gradient updates. Moreover, the robustness adjustment of is crucial for attaining improved performance with respect to DR in this domain. Once again, we see that random search with curation via PLR produces a rich selection of levels and an effective curriculum.

We also observe that PAIRED struggles to train a robust protagonist in CarRacing. Specifically, PAIRED overexploits the relative strengths of the antagonist over the protagonist, finding curricula that steer the protagonist towards policies that ultimately perform poorly even on simple tracks, leading to a gradual reduction in level complexity. We present training curves revealing this dynamic in Appendix C. As shown in Figure 6, REPAIRED improves upon this significantly, inducing a policy that outperforms both PAIRED and standard PLR in mean performance on the full F1 benchmark. Notably, approaches the performance of the state-of-the-art AttentionAgent (attentionagent), despite not using a self-attention policy and training on less than 0.25% of the number of environment steps in comparison. These gains come purely from the induced curriculum. Figure 17 in Appendix C further reveals that and REPAIRED induce CarRacing policies that tend to achieve higher minimum returns on average compared to other methods that do not have such a robustness guarantee, providing further evidence of the benefits of the minimax regret property.

7 Related Work

In inducing parallel curricula, DCD follows a rich lineage of curriculum learning methods (bengio_curriculum; schmidhuber_curriculum; curriculum_rl_survey2; curriculum_rl_survey1). Many previous curriculum learning algorithms resemble the curator in DCD, sharing similar underlying selective-sampling mechanisms as . Most similar is TSCL (tscl), which prioritizes levels based on return rather than value loss, and has been shown to overfit to training levels in some settings (plr). In our setting, replayed levels can be viewed as past strategies from a level-generating teacher. This links our replay-based methods to fictitious self-play (FSP, fictitious_sp), and more closely, Prioritized FSP (vinyals2019grandmaster), which selectively samples opponents based on historic win ratios.

Recent approaches that make use of a generating adversary include Asymmetric Self-Play (sukhbaatar2018intrinsic; openai2021asymmetric), wherein one agent proposes tasks for another in the form of environment trajectories, and AMIGo (amigo), wherein the teacher is rewarded for proposing reachable goals. While our methods do not presuppose a goal-based setting, others have made progress here using generative modeling (goalgan; Racaniere2020Automated), latent skill learning (carml), and exploiting model disagreement (NEURIPS2020_566f0ea4). These methods are less generally applicable than , and unlike our DCD methods, they do not provide well-principled robustness guarantees.

Other recent algorithms can be understood as forms of UED and like DCD, framed in the lens of decision theory. POET (poet; enhanced_poet), a coevolutionary approach (Popovici2012), uses a population of minimax (rather than minimax regret) adversaries to construct terrain for a BipedalWalker agent. In contrast to our methods, POET requires training a large population of both agents and environments and consequently, a sizable compute overhead. APT-Gen (fang2021adaptive) also procedurally generates tasks, but requires access to target tasks, whereas our methods seek to improve zero-shot transfer.

The DCD framework also encompasses adaptive domain randomization methods (DR, mehta2019activedomain; dr_evolutionary), which have seen success in assisting sim2real transfer for robotics (domain_randomization; james2017transferring; dexterity; rubics_cube). DR itself is subsumed by procedural content generation (risi_togelius_pcg), for which UED and DCD may be seen as providing a formal, decision-theoretic framework, enabling development of provably optimal algorithms.

8 Conclusion

We developed a novel connection between PLR and minimax regret UED approaches like PAIRED. We demonstrated that PAIRED, a slow but generative method for level design can be combined with PLR, a fast but non-generative method that, instead, relies on replay-based curation of the most promising, existing levels. In order to theoretically ground this new setting, we introduced Dual Curriculum Design (DCD), in which a student policy is challenged by a team of two co-adapting, regret-maximizing teachers—one, a generator that creates new levels, and the other, a curator that selectively samples existing levels for replay. This formalism enabled us to prove robustness guarantees for PLR at NE, notably yielding the counterintuitive result that PLR can be made provably robust by training on less data, specifically, only the trajectories on levels sampled for replay. In addition, we developed Replay-Enhanced PAIRED (REPAIRED), the natural instantiation of DCD. Empirically, in two highly distinct environments, we found that significantly improves zero-shot generalization over PLR, and REPAIRED, over PAIRED.

Long-running UED processes in expansive UPDOMPs closely resemble continual learning in open-ended domains. The congruency of these settings suggests our contributions around DCD may extend to more general continual learning settings in which agents must learn to master a diverse sequence of tasks with predefined (or inferred) episode boundaries—if tasks are assumed to be designed by a regret-maximizing teacher. Thus, DCD-based methods like may also yield more general policies for continual learning. We believe this to be a promising direction for future research.

We would like to thank Natasha Jaques, Patrick Labatut, and Heinrich Küttler for fruitful discussions that helped inform this work. Further, we are grateful to our anonymous reviewers for their valuable feedback. MJ is supported by the FAIR PhD program. This work was funded by Facebook.


Appendix A Theoretical Results

In this section we prove the theoretical results around the dual curriculum game and use these results to show approximation bounds for our methods, given that they have reached a Nash equilibrium (NE).

The first theorem is the main result that allows us to analyze dual curriculum games. The high-level result says that the NE of a dual curriculum game are approximate NE of the base game from the perspective of any of the individual players, or from the perspective of the joint strategy.

Theorem 1.

Let be the maximum difference between and , and let be a NE for . Then is an approximate NE for the base game with either teacher or for a teacher optimizing their joint objective. More precisely, it is a -approximate NE when , a -approximate NE when , and a -approximate NE when .

At a high level, this is true because, for low values of , the best-response strategies for the individual players can be thought of as approximate-best response strategies for the joint-player, and vis-versa. Since the Nash Equilibrium consists of each of the players playing their own best response, they must be playing an approximate best response for the joint-player. We provide a formal proof below:


Let be the maximum difference between and , and let be a Nash Equilibrium for . Then consider as a strategy in the base game for the joint player . Let be the best response for the joint player to . Since is a best response by assumption, it is sufficient to show that is an approximate best response. We then have


Thus, we have shown that represents an -Nash equilibrium for the joint player. For the first teacher we have the opposite condition trivially, the teacher is doing a best response to the student. We must now show that the student is doing an approximate best response to the teacher.

Let be the best response to the first teacher (with utility ) and let be the best response policy to the joint teacher. In this argument we will start with the observation that by definition, and then argue that we can construct an upper bound on the performance of on , , and a lower bound on the performance of on , . We get the desired result by combining these two arguments.

First we use to upper bound :


Second we can use to lower bound :


Putting this all together, we have

Which, after rearranging terms, gives

as desired. Repeating the symmetric argument shows the desired property for the second teacher. ∎

Following this main theorem, we can apply it to two of our methods. First we can apply it to naive PLR, which trains on a mixture of domain randomization (a teacher with utility ) and the PLR bandit (a teacher with utility ). This result shows that as we reduce the number of random episodes, the approximation to a minimax regret strategy improves. The intuition behind this is a direct application of Theorem 1, to show that it is an approximate Nash for the minimax regret player, and then showing that the minimax reget player has access to a strategy which ensures small regret, thus the regret that the equilibrium ensures must be approximately small.

Corollary 1.

Let be the dual curriculum game in which the first teacher maximizes regret, so , and the second teacher plays randomly, so . Let be bounded in for all . Further, suppose that is a Nash equilibrium of . Let be optimal worst-case regret. Then is close to having optimal worst-case regret, or formally . Moreover, there exists environments for all values of within a constant factor of achieving this bound.


Since is bounded in for all , we know that and are within of each other. Thus by Theorem 1 we have that is a -Nash equilibrium of the base game when . Thus is a approximate best-response to . However, since is a best response it chooses a regret maximizing parameter distribution. Thus the does not just measure the sub-optimally of with respect to , but measures the worst-case regret of across all as desired.

The intuition for the existence of examples in which this approximation of regret decays linearly in is that a random level and the maximal regret level can be very different, and so the two measures may diverge drastically. For an example environment where deviates strongly from the minimax regret strategy, consider the one-step UMDP described in Table 1.

Table 1: In this environment all payoffs are between and (for and ), where is assumed to be positive. Randomizing between and minimizes regret, but choosing or

is better in expectation under the uniform distribution. For large

it is especially clear that and have better expected value under the uniform distribution, though we show that even for , the optimal joint policy can mix between and incurring high regret.

Note that in Table 1, no policy has less than regret, since every policy will have to incur regret on either at least half the time. The minimax regret policy mixes uniformly between and to achieve regret of exactly . We can ignore for the regret calculations by assuming that , since every policy achieves less than regret on these levels.

Our claim is that in equilibrium of in this environment, the student policy can incur regret, more than the minimax regret policy. An example of such an equilibrium point would be when the student policy uniformly randomizes between and , which we will call , when the minimax teacher uniformly randomizes between and which we will call , and when the uniform teacher randomizes exactly which we call . To check this we must show that is in fact a NE of . Then we must show that incurs regret.

To show that is a NE of first note that is trivially a best response for the uniform utility function. Also note that maximizes the regret of since and are the only two parameters on which incur regret, and they incur the same regret; thus, any mixture over them will be optimal for the regret-based teacher. Finally, we need to show that is optimal for the student. To do this we will calculate the expected value of each policy and notice that the expected values for and are higher than for and . Thus any optimal policy will place no weight on and , but any distribution over and will be equivalently optimal. By symmetry, we can show only the calculations for and :


Thus and achieve

higher expected value by the joint distribution. Thus, we know that

is a best response and is in fact a NE of .

Finally, we simply need to show that incurs regret. WLOG, we can evaluate its regret on . On , achieves reward while achieves . Thus incurs regret of as desired. As discussed before, since the minimax regret policy achieves , this is more regret than optimal. ∎

Lastly, we can also apply Theorem 1 to prove that REPAIRED achieves a minimax regret strategy in equilibrium. The intuition behind this corollary is that, since the utility functions of both teachers are the same, the approximate NE ensured by Theorem 1 is actually a true NE; therefore, the minimax theorem applies.

Corollary 2.

Let be the dual curriculum game in which both teachers maximize regret, so . Further, suppose that is a Nash equilibrium of . Then, .


Since the joint objective is . Note that since , . Thus by Theorem 1 is a -Nash Equilibrium of the base game with teacher objective , thus by the minimax theorem, as desired. ∎

Appendix B Algorithms

Although the PLR update rule for the level buffer of size in the case of unbounded training levels is described in [plr], we provide the pseudocode for this update rule in Algorithm 2 for completeness. Given staleness coefficient , temperature , a prioritization function (e.g. rank), level buffer scores , level buffer timestamps , and the current episode count (i.e. current timestamp), the update takes the form

The pseudocode for Replay-Enhanced PAIRED (REPAIRED), the method described in Section 5, is presented in Algorithm 3.

Input: Level buffer of size with scores and timestamps ; level ; level score ; and current episode count
if  then
       Insert into , and set ,
       Find level with minimal support,
       if  then
             Remove from
             Insert into , and set ,
             Update with latest scores and timestamps
       end if
end if
Algorithm 2 PLR level-buffer update rule
Randomly initialize Protagonist, Antagonist, and Generator policies , , and
Initialize Protagonist and Antagonist PLR level buffers and
while not converged do
       Sample replay-decision Bernoulli,
       if  then
             Teacher policy generates the next level,
             Collect trajectory on and on with stop-gradients ,
             Update with
             PLR samples replay levels, and
             Collect trajectory on and on
             Update with rewards , and , with rewards
       end if
      Compute PLR score
       Compute PLR score
       Update with using score
       Update with using score
end while
Algorithm 3 REPAIRED

Appendix C Additional Experimental Results

This section provides additional experimental results in MiniGrid and CarRacing environments. Note that we determine the statistical significance of our results using a Welch t-test


c.1 Extended Results for MiniGrid

Unlike the original maze experiments used to evaluate PAIRED in [paired], we conduct our main maze experiments with a block budget of 25 blocks (reported in Section 6.1), rather than 50 blocks. Following the environment parameterization in [paired], for a block budget of , the teacher attempts to place blocks that act as obstacles when designing each maze level. However, the teacher can place fewer than blocks, as placing a block in a location already occupied by a block results in a no-opt. We found that PAIRED underperforms DR when both methods are given a budget of 50 blocks, a setting in which randomly sampled mazes exhibit enough structural complexity to allow DR to learn highly robust policies. Note that [paired] used a DR baseline with a 25-block budget. With a 50-block budget, DR and all replay-based methods are able to fully solve almost all test mazes after around 500M steps of training, making UED of mazes with a 50-block budget too simple of a setting to provide an informative comparison among the methods studied.

c.1.1 Mazes with a 25-block budget

Figure 7: Test maze environments for evaluating zero-shot transfer. An asterisk (*) next to the maze name indicates the maze is procedurally-generated, and thus each attempt corresponds to a random configuration of the maze.

We report the results of evaluating policies produced by each method after 250M training steps on each of the zero-shot transfer environments in Figure 8 and Table 2. Examples of each test environment are presented in Figure 7. All replay-based UED methods lead to policies with statistically significantly () higher test performance than PAIRED, and , after 500M training steps, similarly improves over PLR when trained for an equivalent number of gradient updates (as replay rate is set to ). Note that for PAIRED and REPAIRED, we evaluate the protagonist policy.

To provide a further sense of the training dynamics, we present the per-agent training returns for each method in Figure 9. Notably PAIRED results in antagonists that attain higher returns than the protagonist as expected. This dynamic takes on a mild oscillation, visible in the training return curve of the generator (adversary). As the protagonist adapts to the adversarial levels, the generator’s return reduces, until the generator discovers new configurations that better exploit the relative differences between the two student policies. Notably, the adversary under REPAIRED seems to propose more difficult levels for both the protagonist and antagonist, while the resulting protagonist policy exhibits improved test performance, as seen in Figure 4.

Figure 8: Zero-shot test performance on OOD environments when trained with a 25-block budget. The plots report the median and interquartile range of solved rates over 10 runs.
Environment DR Minimax PAIRED REPAIRED PLR (500M)
Table 2: Mean test returns and standard errors on zero-shot transfer mazes for each method using a 25-block budget. Results are aggregated over 100 attempts for each maze across 10 runs per method. Bolded figures overlap in standard error with the method attaining the maximum mean test return in each row.
Figure 9: Training returns for each participating agent in each method, when trained with a 25-block budget. Plots show the mean and standard error over 10 runs.
Figure 10: Complexity metrics of environments generated by the teacher throughout training with a 25-block budget. Plots show the mean and standard error of 10 runs.

Additional complexity metrics tracked during training are shown in Figure 10. Alongside the number of blocks and shortest path length of levels seen during training, we also track solved path length and action complexity. Solved path length corresponds to the shortest path length from start position to goal in the levels successfully solved by the primary student agent (e.g. the protagonist in PAIRED). Action complexity corresponds to the Lempel-Ziv-Welch (LZW) complexity—a commonly used measure of string compressibility—of the action sequence taken during the primary student agent’s trajectories. As expected, DR results in constant complexity for number of blocks and path length metrics. REPAIRED generates mazes with significantly greater complexity in terms of block count. The lower path lengths seen by REPAIRED suggest that it trains agents that more readily generalize to different path lengths, thereby pressuring the adversary to raise complexity in terms of block count. Further, given the high replay rates used, the REPAIRED adversary sees far fewer gradient updates with which to adjust its policy. As its shortest path lengths exceed that of PAIRED after adjusting proportionately by replay rate, foreseeably, over a longer period, the shortest path lengths generated by REPAIRED may meet or exceed that of PAIRED. In all cases, the action complexity reduces as the agent becomes more decisive, and we see that both PAIRED and REPAIRED lead to more decisive policies—as indicated by the simultaneously lower action complexity and greater level complexity in terms of higher block count (relative to DR) and, in the case of PAIRED, higher path length metrics. Lastly, it is interesting to note that while the random generator used by PLR produces levels of average complexity, the complexity of curated levels, as revealed in Figure 4, is significantly higher and, in the case of path length, steadily increasing.

c.1.2 Mazes with a 50-block budget

Similarly, Figures 12, 13, and 14 report the training dynamics and test performance of agents trained using each method with a 50-block budget for 500M steps. Figure 11 shows that DR and all replay-based methods are able to reach near perfect solve rates on most test mazes after 500M steps of training, with the exception of the Maze and PerfectMazes environments, where the test performances across methods are not markedly dissimilar, making the setting with a 50-block budget uninformative for assessing performance differences among these methods. The example mazes generated by each method, presented in Figure 15, shows that the larger block budget allows DR to sample mazes with greater structural complexity, leading to robust policies and diminishing the benefits of the UED methods studied. Therefore, in this work, we focus the main results for the maze domain on the more challenging setting with a 25-block budget. Note that the impact of the block budget on test performance further highlights the importance of properly adapting the training distribution for producing policies exhibiting high generality—a problem that our replay-based UED methods effectively address, as demonstrated by the results for the 25-block setting.

Figure 11: Zero-shot test performance on OOD environments when trained with a 50-block budget. The plots show the median and interquartile range of solved rates over 10 runs.
Figure 12: Test performance as a function of number of training steps with a 50-block budget (left), and test performance and complexity metrics as a function of number of PPO updates (right). The plots show the mean and standard error over 10 runs.
Figure 13: Training returns for each participating agent in each method when training with a 50-block budget. Plots show the mean and standard error over 10 runs.
Figure 14: Complexity metrics of environments generated by the teacher throughout training with a 50-block budget. Plots show the mean and standard error of 10 runs.
Figure 15: Example mazes generated by each method when using a 50-block budget.

c.2 Extended Results for CarRacing

The training return plots for each agent, shown in Figure 16, reveal that PAIRED’s generator (adversary) overexploits the relative advantages of the antagonist over the protagonist, leading to a highly suboptimal protagonist policy. In fact, as shown in the right-most plot of Figure 16, the resulting protagonist policies suffer such performance degradation from the adversarial curriculum that they can no longer even successfully drive on the original, simpler CarRacing tracks.

Additionally, we present per-track zero-shot transfer returns for the entire CarRacing-F1 benchmark after 5M training steps (equivalent to 40M environment interaction steps due to the usage of action repeat) in Table 3

. Results report the mean and standard deviation over 100 attempts per track across 10 seeds. While DR acts as a strong baseline in terms of zero-shot generalization in this setting,

either attains the highest mean return, or matches the method achieving the highest return within standard error on all tracks. The mean performance of across the full benchmark is statistically significantly higher () than that of all other methods. Notably, PAIRED sees poor results, likely as a result of how the generator is able to overexploit the differences between antagonist and protagonist to detrimental effect in this domain. We see that REPAIRED mitigates this effect to a degree, resulting in more competitive policies. Note that due to the high compute overhead of training the AttentionAgent (8.2 billion steps of training over a population 256 agents) [attentionagent], we resorted to evaluating its mean F1 performance using the pre-trained model weights provided by the authors with their public code release. As a result, we only have a single training run for AttentionAgent. This means we cannot reliably compute standard errors for this baseline, but we believe that showing the performance for a single training seed of AttentionAgent on the F1 benchmark alongside our methods, as done in Figure 6, nonetheless provides a useful comparison for further contextualizing the efficacy of our methods. This comparison highlights how by purely modifying the training curriculum, our methods produce policies with test returns approaching that of AttentionAgent—which in contrast, uses a powerful attention-based policy and a larger number of training steps.

As a further analysis of robustness, we inspect the minimum returns over 10 attempts per track, averaged over 10 runs per method. We present these results (mean and standard error) in Figure 17. achieves consistently higher minimum returns on average for many of the tracks compared to the other methods, including on the challenging Russia and USA tracks. This indicates that more reliably approaches a minimax regret policy than PAIRED and REPAIRED, which both share such a guarantee at NE.

Australia 826
Austria 511
Bahrain 372
Belgium 668
Brazil 145
China 344
France 153
Germany 214
Hungary 769
Italy 798
Malaysia 300
Mexico 580
Monaco 835
Netherlands 131
Portugal 606
Russia 732
Singapore 276
Spain 759
UK 729
USA -192
Mean 477
Table 3: Mean test returns and standard errors of each method on the full F1 benchmark. Results are aggregated over 10 attempts for each track across 10 runs per method. Bolded figures overlap in standard error with the method attaining the maximum mean test return in each row. We see that consistently either outperforms the other methods or matches the best performing method. Note that we separately report the results of a single run for AttentionAgent due to its high compute overhead.
Figure 16: From left to right: Returns attained by the protagonist, antagonist, and generator (adversary) throughout training; the protagonist’s zero-shot transfer performance on the original CarRacing-v0 during training. The mean and standard error over 10 runs are shown.
Figure 17: Minimum returns attained across 10 test episodes per track per seed. Bars report mean and standard error over 10 training runs.
(a) DR
(d) PLR
(e) PLR
Figure 18: A randomly-selected set of CarRacing tracks generated by each method. (a) Domain Randomization (DR) produces tracks of average complexity, with few sharp turns. (b) PAIRED often overexploits the difference in the students, leading to simple tracks that incidentally favor the antagonist. (c) Tracks generated by REPAIRED combine both challenging and easy sections. (d) PLR and (e) PLR similarly generate tracks of considerable complexity, but by prioritizing the most challenging randomly generated tracks.

Appendix D Experiment Details and Hyperparameters

Parameter MiniGrid CarRacing
0.995 0.99
0.95 0.9
PPO rollout length 256 125

PPO epochs

5 8
PPO minibatches per epoch 1 4
PPO clip range 0.2 0.2
PPO number of workers 32 16
Adam learning rate 1e-4 3e-4
Adam 1e-5 1e-5
PPO max gradient norm 0.5 0.5
PPO value clipping yes no
return normalization no yes
value loss coefficient 0.5 0.5
student entropy coefficient 0.0 0.0
generator entropy coefficient 0.0 0.01

Replay rate, 0.5 0.5
Buffer size, 4000 500
Scoring function MaxMC positive value loss
Prioritization rank rank
Temperature, 0.3 0.1
Staleness coefficient, 0.3 0.3

Scoring function MaxMC positive value loss

Scoring function MaxMC MaxMC

Table 4: Hyperparameters used for training each method in the maze and car racing environments.

This section details the environments, agent architectures, and training procedures used in our experiments discussed in Section 6. We use PPO to train both student and generator policies in all experiments. Section 6 reports results for each method using the best hyperparameter settings, which we summarize in Figure 4. Note that unless specified, PPO hyperparameters are shared between student and teacher, and PLR hyperparameters are shared between REPAIRED and REPAIRED. The procedures for determining the hyperparameter choices for each environment are detailed below, in Sections D.1 and D.2.

d.1 Partially-Observable Navigation (MiniGrid)

Environment details Our mazes are based on MiniGrid [gym_minigrid]. Each maze consists of a grid, where each cell can contain a wall, the goal, the agent, or navigable space. The student agent receives a reward of upon reaching the goal, where is the episode length and is the maximum episode length (set to 250). Otherwise, the agent receives a reward of 0 if it fails to reach the goal. The observation space consists of the agent’s orientation (facing north, south, east, or west) and the grid immediately in front of and including the agent. This grid takes the form of a 3-channel integer encoding. The action space consists of 7 total actions, though mazes only make use of the first three: turn left, turn right, and forward. We do not mask out irrelevant actions.

Level generation Each maze is fully surrounded by walls, resulting in cells in which the generator can place walls, the goal, and the agent. Starting from an initially empty maze (except the bordering walls), the generator is given a budget of

steps in which it can choose a grid cell in which to place a wall. Placing a wall in a cell already containing a wall results in a no-opt. After wall placement, the generator then chooses cells for the goal and the agent’s starting position. If either of these cells collides with an existing wall, a random empty cell is chosen. At each time step, the generator teacher receives the full grid observation of the developing maze, the one-hot encoding of the current time step, as well as a 50-dimensional random noise vector, where each component is uniformly sampled from


Generator architecture We base the generator architecture on the the original model used for the PAIRED adversary in [paired]. This model encodes the full grid observation using a convolution layer (

kernel, stride length

, 128 filters) followed by a ReLU activation layer over the flattened convolution outputs. The current time step is embedded into a 10-dimensional space, which is concatenated to the grid embedding, along with the random noise vector. This combined representation is then passed through an LSTM with hidden dimension 256, followed by two fully-connected layers, each with a hidden dimension 32 and ReLU activations, to produce the action logits over the 169 possible cell choices. We further ablated the LSTM and found that its absence preserves the performance of the minimax generator in both 25-block and 50-block settings, as well as that of the PAIRED generator in the 50-block setting, as expected given that the full grid and time step form a Markov state. However, the PAIRED generator struggles to learn without an LSTM in the 25-block setting. We believe PAIRED’s improved performance when using an LSTM-based generator in the 25-block setting is due to the additional network capacity provided by the LSTM. Therefore, in favor of less compute time, our experiments only used an LSTM-based generator for PAIRED in the 25-block setting.

Student architecture The student policy architecture resembles the LSTM-based generator architecture, except the student model uses a convolution with 16 filters to embed its partial observation; does not use a random noise vector; and instead of embedding the time step, embeds the student’s current direction into a 5-dimensional latent space.

Choice of hyperparameters We base our choice of hyperparameters for student agents and generator (i.e. the adversary) on [paired]. We also performed a coarse grid search over the student entropy coefficient in , generator entropy coefficient in , and number of PPO epochs in for both students and generator, as well as the choice of including an LSTM in the student and generator policies. We selected the best performing settings based on average return on the validation levels of SixteenRooms, Labyrinth, and Maze over 3 seeds. Our final choices are summarized in 4. The main deviations from the settings in [paired] are the choice of removing the generator’s LSTM (except for PAIRED with 25 blocks) and using fewer PPO epochs (5 instead of 20). For PLR, we searched over replay rate, , in and level buffer size, , in , temperature in , and choice of scoring function in . The final PLR hyperparameter selection was then also used for and REPAIRED, except for the scoring function, over which we conducted a separate search for each method.

Zero-shot levels We make use of the challenging test mazes in [paired]: SixteenRooms, requiring navigation through up to 16 rooms to find a goal; Labyrinth, requiring traversal of a spiral labyrinth; and Maze, requiring the agent to find a goal in a binary-tree maze, which requires the agent to successfully backtrack from dead ends. To more comprehensively test the agent’s zero-shot transfer performance on OOD classes of mazes, we introduce Labyrinth2, a rotated version of Labyrinth; Maze2, another variant of a binary-tree maze; PerfectMazes, a procedurally-generated maze environment; and LargeCorridor, another procedurally-generated maze environment, where the goal position is randomly chosen to lie at the end of one of the corridors, thereby testing the agent’s ability to perform backtracking. Figure 3 provides screenshots of these mazes.

d.2 CarRacing

Environment details Each track consists of a closed loop around which the student agent must drive a full lap. In order to increase the expressiveness of the original CarRacing, we reparameterized the tracks using Bézier curves. In our experiments, each track consists of a Bézier curve [bezier_ref] based on 12 randomly sampled control points within a fixed radius, , of the center of the playfield. The track consists of a sequence of polygons. When driving over each previously unvisited polygon, the agent receives a reward equal to . The student additionally receives a reward of -0.1 at each time step. Aligning with the methodology of [carracing_ppo], we do not penalize the agent for driving out of the playfield boundaries, terminate episodes if the agent drives too far off track, and repeat every selected action for 8 steps. The student observation space consists of a pix