Code for 'Dynamics-Aware Unsupervised Discovery of Skills' (DADS). Enables skill discovery without supervision, which can be combined with model-based control.
Reinforcement learning provides a general framework for learning robotic skills while minimizing engineering effort. However, most reinforcement learning algorithms assume that a well-designed reward function is provided, and learn a single behavior for that single reward function. Such reward functions can be difficult to design in practice. Can we instead develop efficient reinforcement learning methods that acquire diverse skills without any reward function, and then repurpose these skills for downstream tasks? In this paper, we demonstrate that a recently proposed unsupervised skill discovery algorithm can be extended into an efficient off-policy method, making it suitable for performing unsupervised reinforcement learning in the real world. Firstly, we show that our proposed algorithm provides substantial improvement in learning efficiency, making reward-free real-world training feasible. Secondly, we move beyond the simulation environments and evaluate the algorithm on real physical hardware. On quadrupeds, we observe that locomotion skills with diverse gaits and different orientations emerge without any rewards or demonstrations. We also demonstrate that the learned skills can be composed using model predictive control for goal-oriented navigation, without any additional training.READ FULL TEXT VIEW PDF
Code for 'Dynamics-Aware Unsupervised Discovery of Skills' (DADS). Enables skill discovery without supervision, which can be combined with model-based control.
Reinforcement learning (RL) has the potential of enabling autonomous agents to exhibit intricate behaviors and solve complex tasks from high-dimensional sensory input without hand-engineered policies or features [54, 49, 37, 35, 17]. These properties make this family of algorithms particularly applicable to the field of robotics where hand-engineering features and control policies have proven to be challenging and difficult to scale [30, 29, 48, 28, 16, 26]. However, applying RL to real-world robotic problems has not fully delivered on its promise. One of the reasons for this is that the assumptions that are required in a standard RL formulation are not fully compatible with the requirements of real-world robotics systems. One of these assumptions is the existence of a ground truth reward signal, provided as part of the task. While this is easy in simulation, in the real world this often requires special instrumentation of the setup, as well as the ability to reset the environment after every learning episode, which often requires tailored reset mechanisms or manual labor. If we could relax some of these assumptions, we may be able to fully utilize the potential of RL algorithms in real-world robotic problems.
In this context for robotics, the recent work in unsupervised learning becomes relevant — we can learn skills without any external reward supervision and then re-purpose those skills to solve downstream tasks using only a limited amount of interaction. Of course, when learning skills without any reward supervision, we have limited control over the kinds of skills that emerge. Therefore, it is critical for unsupervised skill learning frameworks to optimize for diversity, so as to produce a large enough repertoire of skills such that potentially useful skills are likely to be part of this repertoire. In addition, a framework like this needs to offer the user some degree of control over the dimensions along which the algorithm explores. Prior works in unsupervised reinforcement learning[43, 38, 45, 2, 13, 14, 52] have demonstrated that interesting behaviors can emerge from reward-free interaction between the agent and environment. In particular, [13, 14, 52] demonstrate that the skills learned from such unsupervised interaction can be harnessed to solve downstream tasks. However, due to their sample-inefficiency, these prior works in unsupervised skill learning have been restricted to simulation environments (with a few exceptions such as Baranes and Oudeyer , Pong et al. , Lee et al. ) and their feasibility of executing on real robots remains unexplored.
In this paper, we address the limiting sample-inefficiency challenges of previous reward-free, mutual-information-based learning methods and demonstrate that it is indeed feasible to carry out unsupervised reinforcement learning for acquisition of robotic skills. To this end, we build on the work of Sharma et al.  and derive a sample-efficient, off-policy version of a mutual-information-based, reward-free RL algorithm, Dynamics-Aware Discovery of Skills (DADS), which we refer to as off-DADS. Our method uses a mutual-information-based objective for diversity of skills and specification of task-relevant dimensions (such as x-y location of the robot) to specify where to explore. Moreover, we extend off-DADS to be able to efficiently collect data on multiple robots, which together with the efficient, off-policy nature of the algorithm, makes reward-free real-world robotics training feasible. We evaluate the asynchronous off-DADS method on D’Kitty, a compact cost-effective quadruped, from the ROBEL robotic benchmark suite . We demonstrate that diverse skills with different gaits and navigational properties can emerge, without any reward or demonstration. We present simulation experiments that indicate that our off-policy algorithm is up to 4x more efficient than its predecessor. In addition, we conduct real-world experiments showing that the learned skills can be harnessed to solve downstream tasks using model-based control, as presented in .
Our work builds on a number of recent works [16, 32, 26, 36, 19, 41] that study end-to-end reinforcement learning of neural policies on real-world robot hardware, which poses significant challenges such as sample-efficiency, reward engineering and measurements, resets, and safety [16, 11, 57]. Gu et al. , Kalashnikov et al. , Haarnoja et al. , Nagabandi et al.  demonstrate that existing off-policy and model-based algorithms are sample efficient enough for real world training of simple manipulation and locomotion skills given reasonable task rewards. Eysenbach et al. , Zhu et al.  propose reset-free continual learning algorithms and demonstrate initial successes in simulated and real environments. To enable efficient reward-free discovery of skills, our work aims to address the sample-efficiency and reward-free learning jointly through a novel off-policy learning framework.
Reward engineering has been a major bottleneck not only in robotics, but also in general RL domains. There are two kinds of approaches to alleviate this problem. The first kind involves recovering a task-specific reward function with alternative forms of specifications, such as inverse RL [42, 1, 58, 22] or preference feedback ; however, these approaches still require non-trivial human effort. The second kind proposes an intrinsic motivation reward that can be applied to different MDPs to discover useful policies, such as curiosity for novelty [50, 43, 51, 6, 44, 9], entropy maximization [23, 46, 33, 15], and mutual information [27, 25, 10, 14, 13, 38, 52]. Ours extends the dynamics-based mutual-information objective from Sharma et al.  to sample-efficient off-policy learning.
Off-policy extension to DADS  poses challenges beyond those in standard RL [47, 24, 56, 39]. Since we learn an action abstraction that can be related to a low-level policy in hierarchical RL (HRL) [55, 53, 31, 4, 40], we encounter similar difficulties as in off-policy HRL [40, 34]. We took inspirations from the techniques introduced in  for stable off-policy learning of a high-level policy; however, on top of the non-stationarity in policy, we also need to deal with the non-stationarity in reward function as our DADS rewards are continually updated during policy learning. We successfully derive a novel off-policy variant of DADS that exhibits stable and sample-efficient learning.
In this section, we setup the notation to formally introduce the reinforcement learning problem and the algorithmic foundations of our proposed approach. We work in a Markov decision process (MDP), where denotes the state space of the agent, denotes the action space of the agent, denotes the underlying (stochastic) dynamics of the agent-environment which can be sampled starting from the initial state distribution , and a reward function . The goal of the optimization problem is to learn a controller which maximizes for a discount factor .
are known to be suitable for reinforcement learning on robots. At a high level, algorithms estimate, where the expectation is taken over state-action trajectories generated by the executing policy in the MDP after taking action in the state . Crucially, can be estimated using data collected from arbitrary policies using the temporal-difference learning (hence off-policy learning). For continuous or large discrete action spaces, a parametric policy can be updated to
, which can be done approximately using stochastic gradient descent whenis differentiable with respect to [35, 21, 18]. While the off-policy methods differ in specifics of each step, they alternate between estimating and updating using the till convergence. The ability to use trajectories sampled from arbitrary policies enables these algorithms to be sample efficient.
In an unsupervised learning setup, we assume a MDP without any reward function , retaining the previous definitions and notations. The objective is to systematically acquire diverse set of behaviors using autonomous exploration, which can subsequently be used to solve downstream tasks efficiently. To this end, a skill space is defined such that a behavior is defined by the policy . To learn these behaviors in a reward-free setting, the information theoretic concept of mutual information is generally employed. Intuitively, mutual information
between two random variablesis high when given , the uncertainty in value of is low and vice-versa. Formally,
Dynamics-aware Discovery of Skills (DADS)  uses the concept of mutual information to encourage skill discovery with predictable consequences. It uses the following conditional mutual information formulation to motivate the algorithm: where denotes the next state observed after executing the behavior from the state
. The joint distribution can be factorized as follows:, where denotes the prior distribution over , denotes the stationary distribution induced by under the MDP and denotes the transition dynamics. The conditional mutual information can be written as
At a high level, optimizing for encourages to generate trajectories such that can be determined from (predictability) and simultaneously encourages to generate trajectories where cannot be determined well from without (diversity). Note, computing is intractable due to intractability of and . However, one can motivate the following reinforcement learning maximization for using variational inequalities and approximations as discussed in :
for where maximizes
Sharma et al.  propose an on-policy alternating optimization: At iteration , collect a batch of trajectories from the current policy to simulate samples from , update on using stochastic gradient descent to approximately maximize , label the transitions with reward and update on using any reinforcement learning algorithm to approximately maximize . Note, the optimization encourages the policy to produce behaviors predictable under , while rewarding the policy for producing diverse behaviors for different . This can be seen from the definition of : The numerator will be high when the transition
has a high log probability under the current skill(high implies high predictability), while the denominator will be lower if the transition has low probability under (low implies is expecting a different transition under the skill ).
Interestingly, the variational approximation , called skill dynamics, can be used for model-predictive control. Given a reward function at test-time, the sequence of skill can be determined online using model-predictive control by simulating trajectories using skill dynamics .
The broad goal of this section is to motivate and present the algorithmic choices required for accomplishing reward-free reinforcement learning in the real-world. We address the issue of sample-efficiency of learning algorithms, which is the main bottleneck to running the current unsupervised learning algorithms in the real-world. In the same vein, an asynchronous data-collection setup with multiple actors can substantially accelerate the real-world execution. We exploit the off-policy learning enterprise to demonstrate unsupervised learning in the real world, which allows for both sample-efficient and asynchronous data collection through multiple actors .
We develop the off-policy variant of DADS, which we call off-DADS. For clarity, we can restate in the more conventional form of expected discounted sum of rewards. Using the definition of the stationary distribution for a -discounted episodic setting of horizon , we can write:
where the expectation has been taken with respect to trajectories generated by for . This has been explicitly shown in Appendix -A. Now, we can write the -value function as
For problems with a fixed reward function, we can use off-the-shelf off-policy reinforcement learning algorithms like soft actor-critic [20, 21] or deep deterministic policy gradient . At a high level, we use the current policy to sample a sequence of transitions from the environment and add it to the replay buffer . We uniformly sample a batch of transitions from and use it to update and .
However, in this setup: (a) the reward is non-stationary as depends upon , which is learned simultaneously to and (b) learning involves maximizing which implicitly relies on the current policy and the induced stationary distribution . For (a), we recompute the reward for the batch using the current iterate . For (b), we propose two alternative methods:
We use samples from current policy to maximize . While this does not introduce any additional bias, it does not take advantage of the off-policy data available in the replay buffer.
Reuse off-policy data while maximizing .
To re-use off policy data for learning , we have to consider importance sampling corrections, as the data has been sampled from a different distribution. While we can derive an unbiased gradient estimator, as discussed in Appendix -B, we motivate an alternate estimator which is simpler and more stable numerically, albeit biased. Consider the definition of :
where we have used the fact that . Now, consider that the samples have been generated by a behavior policy . The corresponding generating distribution can be written as: , where the prior over and the dynamics are shared across all policies, and denotes the stationary state distribution induced by . We can rewrite as
which is equivalent to
Thus, the gradient for with respect to can be written as:
The estimator is biased because we compute the importance sampling correction as which ignores the intractable state-distribution correction 
. This considerably simplifies the estimator while keeping the estimator numerically stable (enhanced by clipping) as compared to the unbiased estimator derived in Appendix-B. In context of off-policy learning, the bias due to state-distribution shift can be reduced using a shorter replay buffer.
Our final proposed algorithm is summarized in the Algorithm 1. At a high level, we use actors in the environment which use the latest copy of the policy to collect episodic data. The centralized training script keeps adding new episodes to the shared replay buffer . When a certain threshold of new experience has been added to , the buffer is uniformly sampled to train to maximize . To update , we sample the buffer uniformly again and compute for all the transitions using the latest . The labelled transitions can then be passed to any off-the-shelf off-policy reinforcement learning algorithm to update and .
In this section, we experimentally evaluate our robotic learning method, off-DADS, for unsupervised skill discovery. First, we evaluate the off-DADS algorithm itself in isolation, on a set of standard benchmark tasks, to understand the gains in sample efficiency when compared to DADS proposed in 
, while ablating the role of hyperparameters and variants of off-DADS. Then, we evaluate our robotic learning method on D’Kitty from ROBEL, a real-world robotic benchmark suite. We also provide preliminary results on D’Claw from ROBEL, a manipulation oriented robotic setup in Appendix -D.
We benchmark off-DADS and its variants on continuous control environments from OpenAI gym , similar to . We use the HalfCheetah, Ant, and Humanoid environments, with state-dimensionality 18, 29, and 47 respectively. We also consider the setting where the skill-dynamics only observes the global coordinates of the Ant. This encourages the agent to discover skills which diversify in the space, yielding skills which are more suited for locomotion [13, 52].
To evaluate the performance of off-DADS and the role of hyperparameters, we consider the following variantions:
Replay Buffer Size: We consider two sizes for the replay buffer : 10,000 (s) and 1,000,000 (l). As alluded to, this controls how on-policy the algorithm is. A smaller replay buffer will have lower bias due to state-distribution shift, but can lose sample efficiency as it discards samples faster .
Importance Sampling: We consider two settings for the clipping parameter in the importance sampling correction: and . The former implies that there is no correction as all the weights are clipped to 1. This helps evaluate whether the suggested importance sampling correction gives any gains in terms of sample efficiency.
This gives us four variants abbreviated as s1, s10, l1 and l10. We also evaluate against the off-DADS variant where the skill-dynamics is trained on on-policy samples from the current policy. This helps us evaluate whether training skill-dynamics on off-policy data can benefit the sample efficiency of off-DADS. Note, while this ablation helps us understand the algorithm, this scheme would be wasteful of samples in asynchronous off-policy real world training, where the data from different actors could potentially be coming from different (older) policies. Finally, we benchmark against the baseline DADS, as formulated in . The exact hyperparameters for each of the variants are listed in Appendix -C. We record curves for five random seeds for the average intrinsic reward as a function of samples from the environment and report the average curves in Figure 2.
We observe that all variants of off-DADS consistently outperform the on-policy baseline DADS on all the environments. The gain in sample efficiency can be as high as four times, as is the case for Ant (x-y) environment where DADS takes 16 million samples to converge to the same levels as shown for off-DADS (about 0.8 average intrinsic reward). We also note that irrespective of the size of the replay buffer, the importance sampling correction with outperforms or matches
on all environments. This positively indicates that the devised importance sampling correction makes a better bias-variance trade-off than no importance sampling. The best performing variant on every environment except Ant (x-y) is thes10. While training skill-dynamics on-policy provides a competitive baseline, the short replay buffer and the clipped importance sampling counteract the distribution shift enough to benefit the overall sample efficiency of the algorithm. Interestingly on Ant (x-y), the best performing variant is l10. The long replay buffer variants are slower than the short replay buffer variants but reach a higher average intrinsic reward. This can be attributed to the smaller state-space for skill-dynamics (only -dimensional) and thus, the state-distribution correction required is potentially negligible but at the same time the off-policy data is helping learn better policies.
We now demonstrate the off-DADS can be deployed for real world reward-free reinforcement learning. To this end, we choose the ROBEL benchmark . In particular, we deploy off-DADS on D’Kitty shown in the Figure 3. D’Kitty is a 12 DOF compact quadruped capable of executing diverse gaits. We also provide preliminary results for D’Claw, a manipulation-oriented setup from ROBEL in Appendix -D.
To run real-world training, we constructed a walled cordoned area, shown in Figure 4. The area is equipped with PhaseSpace Impulse X2 cameras that are equidistantly mounted along two bounding orthogonal walls. These cameras are connected to a PhaseSpace Impulse X2E motion capture system which performs 6 DOF rigid body tracking of the D’Kitty robots’ chassis at 480Hz. We use two D’Kitty robots for data collection and training in experiment. Each D’Kitty, we attach one PhaseSpace LED controller which controls 8 active LED markers that are attached to the top surface of the D’Kitty chassis as shown in Figure 3. Each D’Kitty is tethered via 3 cables: USB serial to the computer running off-DADS, 12V/20A power to the D’Kitty robot, and USB power to the LED controller. To reduce wire entanglement with the robot, we also have an overhead support for the tethering wires.
|Control Mode||Position Control|
|PWM Limit||450 (50.85%)|
|Voltage Range||9.5V to 16V|
We first test the off-DADS algorithm variants in simulation. For the D’Kitty observation space, we use the Cartesian position and Euler orientation (3 + 3), joint angles and velocities (12 + 12), the last action executed by the D’Kitty (12) and the upright (1), which is the cosine of the orientation with global -axis. The concatenated observation space is 43-dimensional. Hyperparameter details for off-DADS (common to all variants) are as follows: The skill space is D with support over . We use a uniform prior over . We parameterize and
using neural networks with two hidden layers of size. The output of
is parameterized by a normal distributionwith a diagonal covariance which is scaled to using transformation. For , we reduce the observation space to the D’Kitty co-ordinates . This encourages skill-discovery for locomotion behaviors [52, 13]. We parameterize to predict , a general trick in model-based control which does not cause any loss in representational power as the next state can be recovered by adding the prediction to the current state. We use soft-actor critic  to optimize . To learn , we sample batches of size and use the Adam optimizer with a fixed learning rate of for steps. For soft-actor critic, we again use Adam optimizer with a fixed learning rate of while sampling batches of size for steps. Discount factor , with a fixed entropy coefficient of . For computing the DADS reward , we set samples from the prior . We set the episode length to be , which terminates prematurely if the upright coefficient falls below (that is the D’Kitty is tilting more than 25 degrees from the global -axis).
In terms of off-DADS variants, we evaluate the four variants discussed in the previous section. For all the variants, we collect at least steps in the simulation before updating and . The observations for the variants resemble those of the Ant (x-y) environment. We observe that the variants with a replay buffer of size are much faster to learn than the replay buffer of size . Asymptotically, we observe the long replay buffer outperforms the short replay buffer though. We also observe setting benefits the cause of sample efficiency.
For the real robotic experiment, we choose the hyperparameters to be of size and we set . While asymptotically better performance is nice, we prioritized sample efficiency. For the real experiment, we slightly modify the collection condition. For every update of and , we ensure there are new steps and at least new episodes in the replay buffer .
With the setup and hyperparameters described in the previous sections, we run the real-world experiment. The experiment was ran over 3 days, with the effective training time on the robot being 20 hours (including time spent in maintaining the hardware). We collected around samples in total as shown in the learning curve in Figure 6. We capture the emergence of locomotion skills in our video supplement. Figure 1 and Figure 7 show some of the diversity which emerges in skills learned by D’Kitty using off-DADS, in terms of orientation and gaits.
Broadly, the learning occurs in the following steps: (a) D’Kitty first tries to learn how to stay upright to prolong the length of the episode. This happens within the first hour of the episode. (b) It spends the next few hours trying to move around while trying to stay upright. These few hours, the movements are most random and the intrinsic reward is relatively low as they do not correlate well with . (c) About 5-6 hours into the training, it starts showing a systematic gait which it uses to move in relatively random directions. This is when the intrinsic reward starts to rise. (d) A few more hours of training and this gait is exploited to predictably move in specific directions. At this point the reward starts rising rapidly as it starts diversifying the directions the agent can move in predictably. Interestingly, D’Kitty uses two different gaits to capture and grow in two different directions of motion, which can be seen in the video supplement. (e) At about 16 hours of training, it can reliably move in different directions and it is trying to further increase the directions it can move in predictably. Supplementary videos are available here: https://sites.google.com/view/dads-skill
One interesting difference from simulation where the D’Kitty is unconstrained, is that the harnesses and tethering despite best attempts restrain the movement of the real robot to some extent. This encourages the agent to invest in multiple gaits and use simpler, more reliable motions to move in different directions.
We discuss some of the challenges encountered during real-world reinforcement learning, particularly in context of locomotive agents.
Reset & Autonomous operation: A good initial state distribution is necessary for the exploration to proceed towards the desirable state distribution. In context of locomotion, a good reset comprises of being in an upright position and relocating away from the extremities of the area. For the former, we tried two reset mechanisms: (a) scripted mechanism, which is shown in the supplementary video and (b) reset detector which would continue training if the D’Kitty was upright (based on height and tilt with z-axis), else would wait (for human to reset). However, (a) being programmatic is not robust and does not necessarily succeed in every configuration, in addition to being slow. (b) can be really fast considering that D’Kitty is reasonably compact, but requires human oversight. Despite human oversight, the reset detector can falsely assume the reset is complete and initiate the episode, which requires an episode filter to be written. Relocating from the extremities back to the center is a harder challenge. It is important because the tracker becomes noisy in those regions while also curbing the exploration of the policy. However, this problem only arises when the agent shows significant skills to navigate. There are other challenges besides reset which mandate human oversight into the operation. Primarily, random exploration can be tough on the robot, requiring maintenance in terms of tightening of screws and sometimes, replacing motors. We found latter can be avoided by keeping the motor PWMs low (450 is good). While we make progress towards reward-free learning in this work, we leave it to future work to resolve problems on the way to fully autonomous learning.
Constrained space: For the full diversity of skills to emerge, an ideal operation would have unconstrained space with accurate tracking. However, realistically that is infeasible. Moreover, real operation adds unmodelled constraints over simulation environments. For example, the use of harness to reduce wire entanglement with the robot adds tension depending on the location. When operating multiple D’Kitties in the same space, there can be collisions during the operation. Likelihood of such an event progressively grows through the training. About two-fifths through the training, we started collecting with only one D’Kitty in the space to avoid future collisions. Halfway through the training, we decided to expand the area to its limits and re-calibrate our tracking system for the larger area. Despite the expansion, we were still short on space for the operation of just one D’Kitty. To remedy the situation, we started decreasing the episode length. We went from 200 steps to 100, then to 80 and then to 60. However, we observed that short episodes started affecting the training curve adversarially (at about 300k samples). Constrained by the physical limits, we finished the training. While a reasonable skill diversity already emerges in terms of gaits and orientations within the training we conduct, as shown in Figure 1, more skills should be discovered with more training (as suggested by the simulation results as well as the fact the reward curve has not converged). Nonetheless, we made progress towards the conveying the larger point of reward-free learning being realized in real-world.
|Percentage of Falls||5%|
Qualitatively, we see that a diverse set of locomotive skills can emerge from reward-free training. However, as has been discussed in , these skills can be harnessed for downstream tasks using model-based control on the learned skill-dynamics . First, we partly quantify the learned skills in Figure 8. We execute skills randomly sampled from the prior and collect statistics for these runs. In particular, we find that despite limited training, the skills are relatively robust and fall in only of the runs, despite being proficient in covering distance. Interestingly, the learned skills can also be harnessed for model-based control as shown in Figure 9. The details for model-predictive control follow directly from , which elucidates on how to to do so using skill-dynamics and . We have included video supplements showing model-predictive control in the learned skill space for goal navigation.
In this work, we derived off-DADS, a novel off-policy variant to mutual-information-based reward-free reinforcement learning framework. The improved sample-efficiency from off-policy learning enabled the algorithm to be applied on a real hardware, a quadruped with 12 DoFs, to learn various locomotion gaits under 20 hours without human-design reward functions or hard-coded primitives. Given our dynamics-based formulation from , we further demonstrate those acquired skills are directly useful for solving downstream tasks such as navigation using online planning with no further learning. We detail the successes and challenges encountered in our experiments, and hope our work could offer an important foundation toward the goal of unsupervised, continual reinforcement learning of robots in the real world for many days with zero human intervention.
Proceedings of the twenty-first international conference on Machine learning, pp. 1. Cited by: §II.
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §II.
A divergence minimization perspective on imitation learning methods. Conference on Robot Learning (CoRL). Cited by: §II.