Accompanying repository for Unsupervised Active Domain Randomization in Goal-Directed RL
Goal-directed Reinforcement Learning (RL) traditionally considers an agent interacting with an environment, prescribing a real-valued reward to an agent proportional to the completion of some goal. Goal-directed RL has seen large gains in sample efficiency, due to the ease of reusing or generating new experience by proposing goals. In this work, we build on the framework of self-play, allowing an agent to interact with itself in order to make progress on some unknown task. We use Active Domain Randomization and self-play to create a novel, coupled environment-goal curriculum, where agents learn through progressively more difficult tasks and environment variations. Our method, Self-Supervised Active Domain Randomization (SS-ADR), generates a growing curriculum, encouraging the agent to try tasks that are just outside of its current capabilities, while building a domain-randomization curriculum that enables state-of-the-art results on various sim2real transfer tasks. Our results show that a curriculum of co-evolving the environment difficulty along with the difficulty of goals set in each environment provides practical benefits in the goal-directed tasks tested.READ FULL TEXT VIEW PDF
Reinforcement Learning (RL) agents can learn to solve complex sequential...
Continually solving new, unsolved tasks is the key to learning diverse
Generalization and reuse of agent behaviour across a variety of learning...
Domain randomization is a popular technique for improving domain transfe...
Reinforcement learning (RL) research focuses on general solutions that c...
Semantic communications will play a critical role in enabling goal-orien...
Goal-conditioned policies are used in order to break down complex
Accompanying repository for Unsupervised Active Domain Randomization in Goal-Directed RL
The classic Markov Decision Process (MDP)-based formulation of RL can be extended with goals to contextualize actions and enable higher sample-efficiency (see e.g.  ). These methods work by allowing the agent to set its own goals, rather than exclusively relying on the environment to provide these. However, when setting new goals, the onus falls on the experimenter to decide which goals to use. Not all experience is equally useful for learning. As a result, past works have resorted to simple random sampling  or learning an expensive generative model to generate relevant goals .
In the framework of self-play
, the agent can set goals for itself, using only unlabelled interactions with the environment (i.e., no evaluation of the true reward function). While many heuristics for this self-play goal curriculum exist, we focus on the framework of Asymmetric Self-Play, which learns a goal-setting policy via time-based heuristics. The idea is that the most “productive” goals for an agent to see are just out of the agent’s understanding or horizon. If goals are too easy or too hard, the experience will not be useful, making the horizon approach a strong option to pursue.
However, in certain cases, just learning a goal curriculum via self-play is not enough. In robotic RL, policies trained purely in the simulation have proved difficult to transfer to the real world, a problem known as “reality gap” . One leading approach for this sim2real transfer is Domain Randomization (DR) , where a simulator’s parameters are perturbed, generating a space of related-but-different environments, all of which an agent tries to solve before transferring to a real robot. Nevertheless, like the goal curriculum issue, the issue once again becomes a question of which environments to show the agent. Recently,  empirically showed that not all generated environments are equally useful for learning, leading to Active Domain Randomization (ADR). ADR defines a curriculum learning problem in the environment randomized space, using learned rewards to search for an optimal curriculum.
As our work deals with both robotics and goal-directed RL, we combine ADR and Asymmetric Self Play to propose Self-Supervised Active Domain Randomization (SS-ADR). SS-ADR couples the environment and goal space, learning a curriculum across both simultaneously. SS-ADR can transfer to real-world robotic tasks without ever evaluating the true reward function during training, learning a policy completely via self-supervised reward signals. We show that this coupling generates strong robotic policies in all environments tested, even across multiple robots and simulation settings.
We consider a Markov Decision Process (MDP), , defined by (), where is the state space, is the action space, is the transition function, and is the discount factor. Formally, the agent receives a state at the timestep and takes an action based on the policy . The environment gives a reward of and the agent transitions to next state . The goal of RL is to find a policy which maximizes the expected return from each state where the return is given by . Goal-directed RL often appends a goal (some in a goal space ) to the state, and requires the goal when evaluating the reward function (i.e )
Curriculum learning is a strategy of training machine learning models on a series of gradually increasing tasks (from easy to hard). In curriculum learning, the focus lies on the order of tasks, often abstracting away the particular learning of the task itself. In general, task curricula are crafted in such a way that the future task is just beyond the agent’s current capabilities. However, when an explicit ordering of task difficulty is not available, careful design of the curriculum is required to overcome optimization failures.
We consider the self-play framework by , which proposes an unsupervised way of learning to explore the environment. In this method, the agent has two brains: Alice, which sets a task, and Bob, which finishes the assigned task. The novelty of this method can be attributed to the elegant reward design given by Equations 1 and 2,
where is the timesteps taken by Alice to set a task, is the timesteps taken by Bob to finish the task set by Alice and is the scaling factor. This reward design allows self-regulating feedback between both agents, as Alice focuses on tasks that are just beyond Bob’s horizon: Alice tries to propose tasks that are easy for her, yet difficult for Bob. This evolution of tasks forces the two agents to construct a curriculum for exploration automatically.
However, in the original work, the unsupervised self-play is used only as supplementary experience. In order to learn better policies on a target task, Bob still requires a majority of trajectories where the reward is evaluated from the environment.
Domain Randomization ,  is a technique in which we provide enough variability during the training time such that during the test time, the model generalizes well on potentially unseen data. It requires the explicit definition of a set of simulation parameters like friction, damping, etc., and a randomization space . During every episode, a set of parameters are sampled to generate a new MDP when passed through the simulator (). If is the cumulative return of the policy in the MDP parameterized by , then the goal is to maximize this expected return across the distribution of such MDPs. The hope is that, when this model is deployed on an unseen environment, like a real robot (in a zero-shot transfer scenario), the policy generalizes well enough to maintain strong performance.
ADR  is a framework that searches for most informative environment instances, unlike the uniform sampling in DR . ADR formulates this as an RL problem, where the sampling policy is parameterized by Stein’s Variational Policy Gradient (SVPG) , to learn a set of particles which control which environments are shown to the agent. The particles undergo interacting updates, which can be written as:
where denotes the sampled return from particle , the learning rate and temperature
The particles are trained by using learned discriminator-based rewards  , which measure the discrepancies between the trajectories from the reference and randomized environment instances .
The authors claim that ADR finds environments which are difficult for the current agent policy to solve via learnable discrepancies between the reference (generally, easier) environment, and a proposed randomized instance.
While the formulation benefits from learned rewards, ADR also suffers from an exploitability problem, as the authors mention in the paper’s appendix. Equation 4 finds (and rewards) environments where the discrepancy can be maximized, leading to situations where the method exploits the physics of simulation via generation of “impossible to solve” environments. The original work proposed iteratively adjusting the bounds of the randomization space as workaround for the exploitability issue.
The idea of curriculum leaning was first proposed by , who showed that the curriculum of tasks is beneficial in language processing. Later,  extended this idea to various vision and language tasks which showed faster learning and better convergence. While many of these require some human specifications, recently, automatic task generation has gained interest in the RL community. This body of work includes automatic curriculum produced by adversarial training , reverse curriculum  , teacher-student curriculum learning   etc. However, many papers exploit (a) distinct tasks rather than continuous task spaces (b) state or reward-based “progress” heuristics. Our work builds upon the naturally growing curriculum formulation of asymmetricselfplay (asymmetricselfplay), fixing some of its issues with stability-inducing properties.
Curriculum learning has also been studied through the lens of Self-Play. Self-play has been successfully applied to many games such as checkers  and Go . Recently an interesting asymmetric self-play strategy has been proposed , which models a game between two variants of the same agent, Alice and Bob, enabling exploration of the environment without requiring any extrinsic reward. However, in this work, we use the self-play framework for learning a curriculum of goals, rather than for its traditional exploration-driven use case.
Despite the success in deep-RL, training RL algorithms on physical robots remains a difficult problem and often impractical due to safety concerns. Simulators played a huge role in transferring policies to the real robot safely, and many different methods have been proposed for the same , , . DR  is one of the popular methods which generates a multitude of environment instances by uniformly sampling the environment parameters from a fixed range. However, 
showed that DR suffers from high variance due to unstructured task space and instead proposed a novel algorithm that learns to sample the most informative environment instances. In our work, we useADR formulation while mitigating some of the critical issues like exploitability by substituting the learned reward with the self-supervised reward.
ADR allows for curriculum learning in an environment space: given some black box agent, trajectories are used to differentiate between the difficulty of environments, regardless of the goal set in the particular environment instance. In goal-directed RL, the goal itself may be the difference between a useful episode and a useless one. In particular, certain goals within the same environment instance may vary in difficulty; on the other hand, the same goal may vary in terms of reachability in different environments. ADR provides a curriculum in environment space, but with goal-directed environments, we have a new dimension to consider; one that the standard ADR formulation does not account for.
In order to build proficient, generalizable agents, we need to evolve a curriculum in goal space alongside a curriculum in environment space; otherwise, we may find degenerate solutions by proposing impossible goals with any environment, or vice versa. As shown in , self-play provides a way for policies to learn without environment interaction, but when used only for goal curricula, requires interleaving of self play trajectories alongside reward-evaluated rollouts for best performance.
To this end, we propose Self-Supervised Active Domain Randomization (SS-ADR), summarized in Algorithm 1. SS-ADR learns a curriculum in the joint goal-environment space, producing strong, generalizable policies without ever evaluating an environment reward function during training.
SS-ADR learns two additional policies: Alice and Bob. Alice and Bob are trained in the same format described in Algorithm 1 and . Alice sets a goal in the environment, and eventually signals a STOP action. The environment is reset to the starting state, and now uses Bob’s policy to attempt to achieve the goal Alice has set. Bob sees Alice’s goal state appended to the current state, while Alice sees the current state appended to it’s initial state. Alice and Bob are trained via DDPG , using Equations 1 and 2 to generate rewards for each trajectory based on the time each agent took to complete the task (denoted by and ).
The reward structure forces Alice to focus on horizons: her reward is maximized when she can do something quickly that Bob cannot do at all. Considering the synchrony of policy updates for each agent, we presume that the goal set by Alice is not far out of Bob’s current reach.
However, before Bob operates in the environment, the environment is randomized (e.g. object frictions are perturbed or robot torques are changed). Alice, who operates in the reference environment, (an environment given as the “default”), tries to find goals that are easy in the reference environment (), but difficult in the randomized ones (). Since the randomizations themselves are prescribed by the ADR particles, when we train the ADR particles with Alice’s reward (i.e Equation 1 is evaluated separately for each randomization tested), we get a co-evolution on both curriculum levels. The curriculum in both goal and environment space evolve in difficulty simultaneously, leading to state-of-the-art performance in goal-directed, real world robotic tasks.
Across all experiments, all networks share the same network architecture and hyperparameters. For each Alice and Bob policy, we use Deep Deterministic Policy Gradients , using an implementation from STOP
action), we use a multi-layered perceptron with two hidden layers consisting of 300 neurons each. All networks use the Adam optimizer
with standard hyperparameters from the Pytorch implementation222https://pytorch.org/. We use a learning rate , discount factor , reward scaling factor and number of ADR/SVPG particles . In all our self-play experiments, we consider 1 million unlabelled self-play interactions and plot the mean-averaged learning curves across 4 seeds.333This is unlike the original results of , where the x-axis labels consider only labeled interaction. All of the corresponding code and experiments can be found in the supplementary material.
ErgoReacher: A 4DoF robotic arm where the end-effector has to reach the goal (Figure 2)
For the sim-to-real experiments, we recreated the simulation environment on the real Poppy Ergo. Jr robots  shown in Figures (b)b and (b)b. All simulated experiments are run across 4 seeds. We evaluate the policy on (a) the default environment and (b) an intuitively hard environment which lies outside the training domain, for every 5000 timesteps, accounting to 200 evaluations in total over 1 million timesteps. Unlike the self-play framework proposed in , we do not explicitly train Bob on the target task with extrinsic rewards from the environment to learn a policy. Instead, we evaluate the policy trained only with intrinsic rewards, making the approach completely self-supervised.
We compare our method against two different baselines:
Uniform Domain Randomization (UDR): We use UDR, which generates a multitude of tasks by uniformly sampling parameters from a given range as our first baseline. The environment space generated by UDR is unstructured, where the difficulty greatly varies. Here the curriculum of goal space is not considered.
Unsupervised Default: We use the self-play framework to generate a naturally growing curriculum of goals as our second baseline. Here, only the goal curriculum (and not the coupled environment curriculum) is considered.
We explore the significance of SS-ADR’s performance on the ErgoPusher and ErgoReacher tasks. In the ErgoPusher task, we vary puck damping and puck friction (). In order to create an intuitively hard environment, we lower the values of these two parameters, which creates an ”icy” surface, ensuring that the puck needs to be hit carefully to complete the difficult task.
For the ErgoReacher task, we increase the randomization dimensions () making it hard to intuitively infer the environment complexity. However, for the demonstration purposes, we create an intuitively hard environment by assigning extremely low torques and gains for each joint. We adapt the parameter ranges from the GitHub repository of DBLP:journals/corr/abs-1904-04762 (DBLP:journals/corr/abs-1904-04762).
From Figure 4 and 6 we can see that both Unsupervised-Default and SS-ADR significantly outperform UDR both in terms of variance and average final distance. This highlights that the uniform sampling in UDR can lead to unpredictable and inconsistent behaviour. To actually see the benefits of environment-goal curriculum over solely goal curriculum, we evaluate on the intuitively-hard environments (outside of the training parameter distribution, as described above). From Figure 5 and 7, we can see that our method, SS-ADR, which co-evolves environment and goal curriculum, outperforms Unsupervised-Default, which omits the environment curriculum. This shows that the coupling curriculum enables strong generalization performance over the standard self-play formulation.
In this section, we explore the zero-shot transfer performance of the trained policies in the simulator. To test our policies on real-robots, we take the four independent trained policies of both ErgoReacher and ErgoPusher and deploy them onto the real-robots without any fine-tuning. We roll out each policy per seed for 25 independent trails and evaluate the average final distance across 25 trails. To evaluate the generalization, we change the task definitions (and therefore the MDPs) of the puck friction (across low, high, and standard frictions in a box pushing environment) in case of ErgoPusher and joint torques (across a wide spectrum of choices) on ErgoReacher. In general, lower values in both settings correspond to harder tasks, due to construction of the robot and the intrinsic difficulty of the task itself.
From the Figures 8 and 9, we see that SS-ADR outperforms both baselines in terms of accuracy and consistency, leading to robust performance across all environment variants tested. Zero-shot policy transfer is a difficult and dangerous task, meaning that low spread (i.e consistent performance) is required for deployed robotic RL agents. As we can see in the plots, simulation alone is not the answer (leading to poor performance of UDR), while self-play also fails sometimes to generate curricula that allow for strong, generalizable policies. However, by utilizing both methods together, and co-evolving the two curriculum spaces, we see multiplicative benefits of using curriculum learning in each separately.
In this work, we proposed Self-Supervised Active Domain Randomization (SS-ADR), which co-evolves curricula in a joint goal-environment task space to create strong, robust policies that can transfer zero-shot onto real world robots. Our method requires no evaluation of training environment reward functions, and learns this joint curriculum entirely through self-play. SS-ADR is a feasible approach to train new policies in goal-directed RL settings, and outperforms all baselines in both tasks (in simulated and real variants) tested.
The authors gratefully acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC), the Fonds de Recherche Nature et Technologies Quebec (FQRNT), Calcul Quebec, Compute Canada, the Canada Research Chairs, Canadian Institute for Advanced Research (CIFAR) and Nvidia for donating a DGX-1 for computation. BM would like to thank IVADO for financial support. FG would like to thank MITACS for their funding and support.
Learning and development in neural networks: the importance of starting small. Cognition 48 (1). Cited by: §3.
Poppy: open-source, 3d printed and fully-modular robotic platform for science, art and education. Cited by: §5.