An Imitation from Observation Approach to Sim-to-Real Transfer

08/04/2020 ∙ by Siddarth Desai, et al. ∙ The University of Texas at Austin 5

The sim to real transfer problem deals with leveraging large amounts of inexpensive simulation experience to help artificial agents learn behaviors intended for the real world more efficiently. One approach to sim-to-real transfer is using interactions with the real world to make the simulator more realistic, called grounded sim to-real transfer. In this paper, we show that a particular grounded sim-to-real approach, grounded action transformation, is closely related to the problem of imitation from observation IfO, learning behaviors that mimic the observations of behavior demonstrations. After establishing this relationship, we hypothesize that recent state-of-the-art approaches from the IfO literature can be effectively repurposed for such grounded sim-to-real transfer. To validate our hypothesis we derive a new sim-to-real transfer algorithm - generative adversarial reinforced action transformation (GARAT) - based on adversarial imitation from observation techniques. We run experiments in several simulation domains with mismatched dynamics, and find that agents trained with GARAT achieve higher returns in the real world compared to existing black-box sim-to-real methods



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the robot learning community, sim-to-real

approaches seek to leverage inexpensive simulation experience to more efficiently learn control policies that perform well in the real world. This paradigm allows us to utilize powerful machine learning techniques without extensive real-world testing, which can be expensive, time-consuming, and potentially dangerous. Sim-to-real transfer has been used effectively to learn a fast humanoid walk

Hanna and Stone [2017], dexterous manipulation OpenAI et al. [2019], and agile locomotion skills Peng et al. [2020]. In this work, we focus on the paradigm of simulator grounding Farchy et al. [2013]; Hanna and Stone [2017]; Chebotar et al. [2019], which modifies a simulator’s dynamics to more closely match the real world dynamics using some real world data. Policies then learned in such a grounded simulator transfer better to the real world.

Separately, the machine learning community has also devoted attention to imitation learning

Bakker and Kuniyoshi [1996], i.e. the problem of learning a policy to mimic demonstrations provided by another agent. In particular, recent work has considered the specific problem of imitation from observation (IfO) Liu et al. [2018], in which an imitator mimics the expert’s behavior without knowing which actions the expert took, only the outcomes of those actions (i.e. state-only demonstrations). While the lack of action information presents an additional challenge, recently-proposed approaches have suggested that this challenge may be addressable Torabi et al. [2018a, 2019].

In this paper, we show that a particular grounded sim-to-real technique, called grounded action transformation (gat) Hanna and Stone [2017], can be seen as a form of IfO. We therefore hypothesize that recent, state-of-the-art approaches for addressing the IfO problem might also be effective for grounding the simulator leading to improved sim-to-real transfer. Specifically, we derive a distribution-matching objective similar to ones used in adversarial approaches for generative modeling Goodfellow et al. [2014], imitation learning Ho and Ermon [2016], and IfO Torabi et al. [2018b] with considerable empirical success. Based on this objective, we propose a novel algorithm, generative adversarial reinforced action transformation (garat), to ground the simulator by reducing the distribution mismatch between the simulator and the real world.

Our experiments confirm our hypothesis by showing that garat reduces the difference in the dynamics between two environments more effectively than gat. Moreover, our experiments show that, in several domains, this improved grounding translates to better transfer of policies from one environment to the other.

In summary, our contributions are as follows: (1) we show that grounded action transformation can be seen as an IfO problem, (2) we derive a novel adversarial imitation learning algorithm, garat, to learn an action transformation policy for sim-to-real transfer, and (3) we experimentally evaluate the efficacy of garat for sim-to-real transfer.

2 Background

We begin by introducing notation, reviewing the sim-to-real-problem formulation, and describing the action transformation approach for sim-to-real transfer. We also provide a brief overview of imitation learning and imitation from observation.

2.1 Notation

We consider here sequential decision processes formulated as Markov decision processes (MDPs)

Sutton and Barto [2018]. An MDP is a tuple consisting of a set of states, ; a set of actions, ; a reward function, (where denotes a distribution over the interval ); a discount factor, ; a transition function, ; and an initial state distribution, . An RL agent uses a policy to select actions in the environment. In an environment with transition function , the agent aims to learn a policy to maximize its expected discounted return , where , , , and .

Given a fixed and a specific transition function , the marginal transition distribution is where

is the probability of being in state

at time . The marginal transition distribution is the probability of being in state marginalized over time , taking action under policy , and ending up in state under transition function (laid out more explicitly in Appendix A). We can denote the expected return under a policy and a transition function in terms of this marginal distribution as:


2.2 Sim-to-real Transfer and Grounded Action Transformation

Let be the transition functions for two otherwise identical MDPs, and , representing the simulator and real world respectively. Sim-to-real transfer aims to train an agent policy to maximize return in with limited trajectories from , and as many as needed in .

The work presented here is specifically concerned with a particular class of sim-to-real approaches known as simulator grounding approaches Allevato et al. [2019]; Chebotar et al. [2019]; Farchy et al. [2013]. These approaches modify the simulator dynamics by using real-world interactions to ground them to be closer to the dynamics of the real world. Because it may sometimes be difficult or impossible to modify the simulator itself, the recently-proposed grounded action transformation (gat) approach Hanna and Stone [2017] seeks to instead induce grounding by modifying the agent’s actions before using them in the simulator. This modification is accomplished via an action transformation function that takes as input the state and action of the agent, and produces an action to be presented to the simulator. From the agent’s perspective, composing the action transformation with the simulator changes the simulator’s transition function. We call this modified simulator the grounded simulator, and its transition function is given by


The action transformation approach aims to learn function such that the resulting transition function is as close as possible to . We denote the marginal transition distributions in sim and real by and respectively, and for the grounded simulator.

gat learns a model of the real world dynamics , an inverse model of the simulator dynamics , and uses the composition of the two as the action transformation function, i.e. .

2.3 Imitation Learning

In parallel to advances in sim-to-real transfer, the machine learning community has also made considerable progress on the problem of imitation learning. Imitation learning Bakker and Kuniyoshi [1996]; Ross et al. [2011]; Schaal [1997] is the problem setting where an agent tries to mimic trajectories where each is a demonstrated trajectory induced by an expert policy .

Various methods have been proposed to address the imitation learning problem. Behavioral cloning Bain and Sammut [1995]

uses the expert’s trajectories as labeled data and uses supervised learning to recover the maximum likelihood policy. Another approach instead relies on reinforcement learning to learn the policy, where the required reward function is recovered using inverse reinforcement learning (IRL)

Ng et al. [2000]. IRL aims to recover a reward function under which the demonstrated trajectories would be optimal.

A related setting to learning from state-action demonstrations is the imitation from observation (IfO) Liu et al. [2018]; Pavse et al. [2019]; Torabi et al. [2018a, b] problem. Here, an agent observes an expert’s state-only trajectories where each is a sequence of states . The agent must then learn a policy to imitate the expert’s behavior, without being given labels of which actions to take.

3 gat as Imitation from Observation

We now show that the underlying problem of gat—i.e., learning an action transformation for sim-to-real transfer—can also been seen as an IfO problem. Adapting the definition by Liu et al. [2018], an IfO problem is a sequential decision-making problem where the policy imitates state-only trajectories produced by a Markov process, with no information about what actions generated those trajectories. To show that the action transformation learning problem fits this definition, we must show that it (1) is a sequential decision-making problem and (2) aims to imitate state-only trajectories produced by a Markov process, with no information about what actions generated those trajectories.

Starting with (1), it is sufficient to show that the action transformation function is a policy in an MDP Puterman [1990]. This action transformation MDP can be seen clearly if we combine the target task MDP and the fixed agent policy . Let the joint state and action space with be the state space of this new MDP. The combined transition function is , where , and initial state distribution is . For completeness, we consider a reward function and discount factor , which are not essential for an IfO problem. With these components, the action transformation environment is an MDP . The action transformation function , now , is then clearly a mapping from states to a distribution over actions, i.e. it is a policy in an MDP. Thus, the action transformation learning problem is a sequential decision-making problem.

We now consider the action transformation objective to show (2). When learning the action transformation policy, we have trajectories , where each trajectory is obtained by sampling actions from agent policy in the real world. Re-writing in the above MDP, . If an expert action transformation policy is capable of mimicking the dynamics of the real world, , then we can consider the above trajectories to be produced by a Markov process with dynamics and policy . The action transformation aims to imitate the state-only trajectories produced by a Markov process, with no information about what actions generated those trajectories.

The problem of learning the action transformation thus satisfies the conditions we identified above, and so it is an IfO problem.

4 Generative Adversarial Reinforced Action Transformation

Input: Real world with , simulator with , number of update steps
Agent policy with parameters  , pretrained in simulator;
Initialize action transformation policy with parameters ;
Initialize discriminator with parameters ;
while performance of policy in real world not satisfactory do
       Rollout policy in real world to obtain trajectories ;
       for  do
             Rollout Policy in grounded simulator and obtain trajectories ;
             Update parameters of using gradient descent to minimize
             Update parameters of using policy gradient with reward ;
       end for
      Optimize parameters of in simulator grounded with action transformer ;
end while
Algorithm 1 GARAT

The insight above naturally leads to the following question: if learning an action transformation for sim-to-real transfer is equivalent to IfO, might recently-proposed IfO approaches lead to better sim-to-real approaches? To investigate the answer, we derive a novel generative adversarial approach inspired by gaifoTorabi et al. [2018b] that can be used to train the action transformation policy using IfO. A simulator grounded with this action transformation policy can then be used to train an agent policy which can be expected to transfer effectively to the real world. We call our approach generative adversarial reinforced action transformation (garat), and Algorithm 1 lays out its details.

The rest of this section details our derivation of the objective used in garat. First, in Section 4.1, we formulate a procedure for action transformation using a computationally expensive IRL step to extract a reward function and then learning an action transformation policy based on that reward. Then, in Section 4.2, we show that this entire procedure is equivalent to directly reducing the marginal transition distribution discrepancy between the real world and the grounded simulator. This is important, as recent work Goodfellow et al. [2014]; Ho and Ermon [2016]; Torabi et al. [2018b] has shown that adversarial approaches are a promising algorithmic paradigm to reduce such discrepancies. Thus, in Section 4.3, we explicitly formulate a generative adversarial objective upon which we build the proposed approach.

4.1 Action Transformation Inverse Reinforcement Learning

We first lay out a procedure to learn the action transformation policy by extracting the appropriate cost function, which we term action transformation IRL (ATIRL). We use the cost function formulation in our derivation, similar to previous work Ho and Ermon [2016]; Torabi et al. [2018b]. ATIRL aims to identify a cost function such that the observed real world transitions yield higher return than any other possible transitions. We consider the set of cost functions as all functions .


where is a (closed, proper) convex reward function regularizer, and denotes the extended real numbers . This regularizer is used to avoid overfitting the expressive set . Note that influences (Equation 10 in Appendix A) and influences . Similar to gaifo, we do not use causal entropy in our ATIRL objective due to the surjective mapping from to .

The action transformation then uses this per-step cost function as a reward function in an RL procedure: . We assume here for simplicity that there is an action transformation policy that can mimic the real world dynamics perfectly. That is, there exists a policy , such that . We denote the RL procedure applied to the cost function recovered by ATIRL as .

4.2 Characterizing the Policy Induced by ATIRL

This section shows that it is possible to bypass the ATIRL step and learn the action transformation policy directly from data. We show that -regularized implicitly searches for policies that have a marginal transition distribution close to the real world’s, as measured by the convex conjugate of , which we denote as . As a practical consequence, we will then be able to devise a method for minimizing this divergence through the use of generative adversarial techniques in Section 4.3. But first, we state our main theoretical claim:

Theorem 1.

and induce policies that have the same marginal transition distribution, .

To reiterate, the agent policy is fixed. So the only decisions affecting the marginal transition distributions are of the action transformation policy . We can now state the following proposition:

Proposition 4.1.

For a given generated by a fixed policy , is the only transition function whose marginal transition distribution is .

Proof in Appendix B.1. We can also show that if two transition functions are equal, then the optimal policy in one will be optimal in the other.

Proposition 4.2.

If , then .

Proof in Appendix B.2. We now prove Theorem 1, which characterizes the policy learned by on the cost function recovered by .

Proof of Theorem 1.

To prove Theorem 1, we prove that and result in the same marginal transition distribution. This proof has three parts, two of which are proving that both objectives above can be formulated as optimizing over marginal transition distributions. The third is to show that these equivalent objectives result in the same distribution.

The output of both and are policies. To compare the marginal distributions, we first establish a different objective that we argue has the same marginal transition distribution as . We define


with the same and as Equation 3, and similar except the internal optimization for Equation 3 is over , while it is over for Equation 4. We define an RL procedure that returns a marginal transition distribution which minimizes the given cost function . will output the marginal transition distribution .

Lemma 4.1.

outputs a marginal transition distribution which is equal to induced by .

Proof in Appendix B.3. The mapping from to is not injective, and there could be multiple policies that lead to the same marginal transition distribution. The above lemma is sufficient for proof of Theorem 1, however, since we focus on the effect of the policy on the transitions.

Lemma 4.2.


The proof in Appendix B.4 relies on the optimal cost function and the optimal policy forming a saddle point, leading to a minimax objective, and these objectives being the same.

Lemma 4.3.

The marginal transition distribution of is equal to .

Proof in appendix B.5. With these three lemmas, we have proved that and induce policies that have the same marginal transition distribution. ∎

Theorem 1 thus tells us that the objective is equivalent to the procedure from Section 4.1. In the next section, we choose a function which leads to our adversarial objective.

4.3 Forming the Adversarial Objective

Section 4.2 laid out the objective we want to minimize. To solve we require an appropriate regularizer . gail Ho and Ermon [2016] and gaifo Torabi et al. [2018b] optimize similar objectives and have shown a regularizer similar to the following to work well:


It is closed, proper, convex and has a convex conjugate leading to the following minimax objective:


where the reward for the action transformer policy is , and

is a discriminative classifier. These properties have been shown in previous works

Ho and Ermon [2016]; Torabi et al. [2018b]. Algorithm 1 lays out the steps for learning the action transformer using the above procedure, which we call generative adversarial reinforced action transformation (garat).

5 Related Work

In this section, we discuss the variety of sim-to-real methods, work more closely related to garat, and some related methods in the IfO literature. Sim-to-real transfer can be improved by making the agent’s policy more robust to variations in the environment or by making the simulator more accurate w.r.t. the real world. The first approach, which we call policy robustness methods, encompasses algorithms that train a robust policy that performs well on a range of environments Jakobi [1997]; Peng et al. [2018, 2020]; Pinto et al. [2017]; Rajeswaran et al. [2016]; Sadeghi and Levine [2016]; Tobin et al. [2017, 2018]. Robust adversarial reinforcement learning (rarl) Pinto et al. [2017] is such an algorithm that learns a policy robust to adversarial perturbations Szegedy et al. [2013]. While primarily focused on training with a modifiable simulator, a version of rarl treats the simulator as a black-box by adding the adversarial perturbation directly to the protagonist’s action. Additive noise envelope (ane) Jakobi et al. [1995] is another black-box robustness method which adds an envelope of Gaussian noise to the agent’s action during training.

The second approach, known as domain adaption or system identification, grounds the simulator using real world data to make its transitions more realistic. Since hand engineering accurate simulators Tan et al. [2018]; Xie et al. [2019] can be expensive and time consuming, real world data can be used to adapt low-fidelity simulators to the task at hand. Most simulator adaptation methods Allevato et al. [2019]; Chebotar et al. [2019]; Farchy et al. [2013]; Hwangbo et al. [2019] rely on access to a parameterized simulator.

garat, on the other hand, does not require a modifiable simulator and relies on an action transformation policy applied in the simulator to bring its transitions closer to the real world. gatHanna and Stone [2017] learns an action transformation function similar to garat. It was shown to have successfully learned and transferred one of the fastest known walk policies on the humanoid robot, Nao.

garat draws from recent generative adversarial approaches to imitation learning (gail Ho and Ermon [2016]) and IfO (gaifo Torabi et al. [2018b]). airlFu et al. [2018], fairlGhasemipour et al. [2019], and wailXiao et al. [2019] are related approaches which use different divergence metrics to reduce the marginal distribution mismatch. garat can be adapted to use any of these metrics, as we show in the appendix.

One of the insights of this paper is that grounding the simulator using action transformation can be seen as a form of IfO. bco Torabi et al. [2018a] is an IfO technique that utilizes behavioral cloning. i2l Gangwani and Peng [2020] is an IfO algorithm that aims to learn in the presence of transition dynamics mismatch in the expert and agent’s domains, but requires millions of real world interactions to be competent.

6 Experiments

In this section, we conduct experiments to verify our hypothesis that garat leads to improved sim-to-real transfer compared to previous methods. We also show that it leads to better simulator grounding compared to the previous action transformation approach, gat.

We validate garat for sim-to-real transfer by transferring the agent policy between Open AI Gym Brockman et al. [2016] simulated environments with different transition dynamics. We highlight the Minitaur domain (Figure 2) as a particularly useful test since there exist two simulators, one of which has been carefully engineered for high fidelity to the real robot Tan et al. [2018]. For other environments, the “real” environment is the simulator modified in different ways such that a policy trained in the simulator does not transfer well to the “real” environment. Details of these modifications are provided in Appendix C.1. Apart from a thorough evaluation across multiple different domains, this sim-to-“real” setup also allows us to compare garat and other algorithms against a policy trained directly in the target domain with millions of interactions, which is otherwise prohibitively expensive on a real robot. This setup also allows us to perform a thorough evaluation of sim-to-real algorithms across multiple different domains. Throughout this section, we refer to the target environment as the “real” environment and the source environment as the simulator. We focus here on answering the following questions :

  1. How well does garat ground the simulator to the “real” environment?

  2. Does garat lead to improved sim-to-“real” transfer, compared to other related methods?

6.1 Simulator Grounding


(a) norm of per step transition errors (lower is better) between different simulator environments and the target environment, shown over number of action transformation policy updates for garat.


(b) Example trajectories of the same agent policy deployed in different environments, plotted using the pendulum angle across time. Response of garat grounded simulator is the most like “real” environment.
Figure 1: Evaluation of simulator grounding with garat in InvertedPendulum domain

In Figure 1, we evaluate how well garat grounds the simulator to the “real” environment both quantitatively and qualitatively. This evaluation is in the InvertedPendulum domain, where the “real” environment has a heavier pendulum than the simulator; implementation details are in Appendix C.1. In Figure 0(a), we plot the average error in transitions in simulators grounded with garat and gat with different amounts of “real” data, collected by deploying in the “real” environment. In Figure 0(b) we deploy the same policy from the same start state in the different environments (simulator, “real” environment, and grounded simulators). From both these figures it is evident that garat leads to a grounded simulator with lower error on average, and responses qualitatively closer to the “real” environment compared to gat. Details of how we obtained these plots are in Appendix C.2.

6.2 Sim-to-“Real” Transfer


Figure 2: The Minitaur Domain

We now validate the effectiveness of garat at transferring a policy from sim to “real”. For various MuJoCo Todorov et al. [2012] environments, we pretrain the agent policy in the ungrounded simulator, collect real world data with , use garat to ground the simulator, re-train the agent policy until convergence in these grounded simulators, and then evaluate mean return across 50 episodes for the updated agent policy in the “real” environment.

The agent policy and action transformation policy are trained with trpo Schulman et al. [2015] and ppo Schulman et al. [2017]

respectively. The specific hyperparameters used are provided in Appendix

C. We use the implementations of trpo and ppo provided in the stable-baselines library Hill et al. [2018]. For every update, we update the garat discriminator once as well. Results here use the losses detailed in Algorithm 1. However, we find that garat is just as effective with other divergence measures Fu et al. [2018]; Ghasemipour et al. [2019]; Xiao et al. [2019] (Appendix C).

garat is compared to gat Hanna and Stone [2017], rarl Pinto et al. [2017] adapted for a black-box simulator, and action-noise-envelope (ane) Jakobi et al. [1995]. and denote policies trained in the “real” environment and simulator respectively until convergence. We use the best performing hyperparameters for these methods, specified in Appendix C.


Figure 3: Performance of different techniques evaluated in “real” environment. Environment return on the -axis is scaled such that achieves 1 and achieves 0.

Figure 3 shows that, in most of the domains, garat with just a few thousand transitions from the “real” environment facilitates transfer of policies that perform on par with policies trained directly in the “real” environment using 1 million transitions. garat also consistently performs better than previous methods on all domains, except HopperHighFriction

, where most of the methods perform well. The shaded envelope denotes the standard error across 5 experiments with different random seeds for all the methods. Apart from the MuJoCo simulator, we also show successful transfer in the PyBullet simulator

Coumans and Bai [2016] using the Ant domain. Here the “real” environment has gravity twice that of the simulator, resulting in purely simulator-trained policies collapsing ineffectually in the “real” environment. In this relatively high dimensional domain, as well as in Walker, we see garat still transfers a competent policy while the related methods fail.

In the Minitaur domain Tan et al. [2018] we use the high fidelity simulator as our “real” environment. Here as well, a policy trained in simulation does not directly transfer well to the “real” environment Yu et al. [2018]. We see in this realistic setting that garat learns a policy that obtains more than 80% of the optimal “real” environment performance with just “real” environment transitions while the next best baseline (gat) obtains at most 50%, requiring ten times more “real” environment data.

7 Conclusion

In this paper, we have shown that grounded action transformation, a particular kind of grounded sim-to-real transfer technique, can be seen as a form of imitation from observation. We use this insight to develop garat, an adversarial imitation from observation algorithm for grounded sim-to-real transfer. We hypothesized that such an algorithm would lead to improved grounding of the simulator as well as better sim-to-real transfer compared to related techniques. This hypothesis is validated in Section 6 where we show that garat leads to better grounding of the simulator as compared to gat, and improved transfer to the “real” environment on various mismatched environment transfers, including the realistic Minitaur domain.

Broader Impact

Reinforcement learning Sutton and Barto [2018] is being considered as an effective tool to train autonomous agents in various important domains like robotics, medicine, etc. A major hurdle to deploying learning agents in these environments is the massive exploration and data requirements Hanna [2019] to ensure that these agents learn effective policies. Real world interactions and exploration in these situations could be extremely expensive (wear and tear on expensive robots), or dangerous (treating a patient in the medical domain).

Sim-to-real transfer aims to address this hurdle and enables agents to be trained mostly in simulation and then transferred to the real world based on very few interactions. Reducing the requirement for real world data for autonomous agents might open up the viability for autonomous agents in other fields as well.

Improved sim-to-real transfer will also reduce the pressure for high fidelity simulators, which require significant engineering effort [Chebotar et al., 2019; Tan et al., 2018]. Simulators are also developed with a task in mind, and are generally not reliable outside their specifications. Sim-to-real transfer might enable simulators that learn to adapt to the task that needs to be performed, a potential direction for future research.

Sim-to-real research needs to be handled carefully, however. Grounded simulators might lead to a false sense of confidence in a policy trained in such a simulator. However, a simulator grounded with real world data will still perform poorly in situations outside the data distribution. As has been noted in the broader field of machine learning Amodei et al. [2016], out of training distribution situations might lead to unexpected consequences. Simulator grounding must be done carefully in order to guarantee that the grounding is applied over all relevant parts of the environment.

Improved sim-to-real transfer could increase reliance on compute and reduce incentives for sample efficient methods. The field should be careful in not abandoning this thread of research as the increasing cost and impact of computation used by machine learning becomes more apparent Amodei and Hernandez [2018].


  • A. Allevato, E. S. Short, M. Pryor, and A. L. Thomaz (2019) TuneNet: one-shot residual tuning for system identification and sim-to-real robot task transfer. In Conference on Robot Learning (CoRL), Cited by: §2.2, §5.
  • D. Amodei and D. Hernandez (2018) AI and compute. External Links: Link Cited by: Broader Impact.
  • D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané (2016) Concrete problems in AI safety. arXiv preprint arXiv:1606.06565. Cited by: Broader Impact.
  • M. Bain and C. Sammut (1995) A framework for behavioural cloning.. In Machine Intelligence 15, pp. 103–129. Cited by: §2.3.
  • P. Bakker and Y. Kuniyoshi (1996) Robot see, robot do: an overview of robot imitation. In AISB96 Workshop on Learning in Robots and Animals, pp. 3–11. Cited by: §1, §2.3.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §6.
  • Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N. Ratliff, and D. Fox (2019) Closing the sim-to-real loop: adapting simulation randomization with real world experience. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8973–8979. Cited by: §1, §2.2, §5, Broader Impact.
  • E. Coumans and Y. Bai (2016) Pybullet, a python module for physics simulation for games, robotics and machine learning. GitHub repository. Cited by: §6.2.
  • A. Farchy, S. Barrett, P. MacAlpine, and P. Stone (2013) Humanoid robots learning to walk faster: from the real world to simulation and back. In Proc. of 12th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS), Cited by: §1, §2.2, §5.
  • J. Fu, K. Luo, and S. Levine (2018) Learning robust rewards with adverserial inverse reinforcement learning. In International Conference on Learning Representations, External Links: Link Cited by: Appendix C, §5, §6.2.
  • T. Gangwani and J. Peng (2020) State-only imitation with transition dynamics mismatch. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  • S. K. S. Ghasemipour, R. Zemel, and S. Gu (2019) A divergence minimization perspective on imitation learning methods. External Links: 1911.02256 Cited by: Appendix C, §5, §6.2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §4.
  • J. P. Hanna and P. Stone (2017) Grounded action transformation for robot learning in simulation. In

    Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: Appendix C, §1, §1, §2.2, §5, §6.2.
  • J. P. Hanna (2019) Data efficient reinforcement learning with off-policy and simulated data. Ph.D. Thesis, University of Texas at Austin. Cited by: Broader Impact.
  • A. Hill, A. Raffin, M. Ernestus, A. Gleave, A. Kanervisto, R. Traore, P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu (2018) Stable baselines. GitHub. Note: Cited by: §6.2.
  • J. Ho and S. Ermon (2016) Generative adversarial imitation learning. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4565–4573. External Links: Link Cited by: §1, §4.1, §4.3, §4, §5.
  • J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter (2019) Learning agile and dynamic motor skills for legged robots. Science Robotics 4 (26), pp. eaau5872. Cited by: §5.
  • N. Jakobi, P. Husbands, and I. Harvey (1995) Noise and the reality gap: the use of simulation in evolutionary robotics. In Advances in Artificial Life, F. Morán, A. Moreno, J. J. Merelo, and P. Chacón (Eds.), Berlin, Heidelberg, pp. 704–720. External Links: ISBN 978-3-540-49286-3 Cited by: §5, §6.2.
  • N. Jakobi (1997) Evolutionary robotics and the radical envelope-of-noise hypothesis. Adaptive behavior 6 (2), pp. 325–368. Cited by: §5.
  • Y. Liu, A. Gupta, P. Abbeel, and S. Levine (2018) Imitation from observation: learning to imitate behaviors from raw video via context translation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1118–1125. Cited by: §1, §2.3, §3.
  • L. Mescheder, A. Geiger, and S. Nowozin (2018) Which training methods for GANs do actually converge?. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 3481–3490. External Links: Link Cited by: Appendix C.
  • A. Y. Ng, S. J. Russell, et al. (2000) Algorithms for inverse reinforcement learning.. In Icml, Vol. 1, pp. 663–670. Cited by: §2.3.
  • OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, and L. Zhang (2019) Solving rubik’s cube with a robot hand. External Links: 1910.07113 Cited by: §1.
  • B. S. Pavse, F. Torabi, J. P. Hanna, G. Warnell, and P. Stone (2019) Ridm: reinforced inverse dynamics modeling for learning from a single observed demonstration. arXiv preprint arXiv:1906.07372. Cited by: §2.3.
  • X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel (2018) Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 1–8. Cited by: §5.
  • X. B. Peng, E. Coumans, T. Zhang, T. Lee, J. Tan, and S. Levine (2020) Learning agile robotic locomotion skills by imitating animals. arXiv preprint arXiv:2004.00784. Cited by: §1, §5.
  • L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta (2017) Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2817–2826. Cited by: §5, §6.2.
  • M. L. Puterman (1990) Markov decision processes. Handbooks in operations research and management science 2, pp. 331–434. Cited by: §3.
  • A. Rajeswaran, S. Ghotra, S. Levine, and B. Ravindran (2016)

    EPOpt: learning robust neural network policies using model ensembles

    CoRR abs/1610.01283. External Links: Link, 1610.01283 Cited by: §5.
  • S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. Cited by: §2.3.
  • F. Sadeghi and S. Levine (2016) Cad2rl: real single-image flight without a single real image. arXiv preprint arXiv:1611.04201. Cited by: §5.
  • S. Schaal (1997) Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046. Cited by: §2.3.
  • J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel (2015) Trust region policy optimization. CoRR abs/1502.05477. External Links: Link, 1502.05477 Cited by: §6.2.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §6.2.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §2.1, Broader Impact.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §5.
  • J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke (2018) Sim-to-real: learning agile locomotion for quadruped robots. CoRR abs/1804.10332. External Links: Link, 1804.10332 Cited by: Table 3, §5, §6.2, §6, Broader Impact.
  • J. Tobin, L. Biewald, R. Duan, M. Andrychowicz, A. Handa, V. Kumar, B. McGrew, A. Ray, J. Schneider, P. Welinder, et al. (2018) Domain randomization and generative models for robotic grasping. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3482–3489. Cited by: §5.
  • J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 23–30. Cited by: §5.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §6.2.
  • F. Torabi, G. Warnell, and P. Stone (2018a) Behavioral cloning from observation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4950–4957. Cited by: §1, §2.3, §5.
  • F. Torabi, G. Warnell, and P. Stone (2018b) Generative adversarial imitation from observation. arXiv preprint arXiv:1807.06158. Cited by: §1, §2.3, §4.1, §4.3, §4, §4, §5.
  • F. Torabi, G. Warnell, and P. Stone (2019) Recent advances in imitation learning from observation. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Cited by: §1.
  • H. Xiao, M. Herman, J. Wagner, S. Ziesche, J. Etesami, and T. H. Linh (2019) Wasserstein adversarial imitation learning. arXiv preprint arXiv:1906.08113. Cited by: §5, §6.2.
  • Z. Xie, P. Clary, J. Dao, P. Morais, J. Hurst, and M. van de Panne (2019) Learning locomotion skills for cassie: iterative design and sim-to-real. In Proc. Conference on Robot Learning (CORL 2019), Vol. 4. Cited by: §5.
  • W. Yu, C. K. Liu, and G. Turk (2018) Policy transfer with strategy optimization. CoRR abs/1810.05751. External Links: Link, 1810.05751 Cited by: §6.2.

Appendix A Marginal Distributions and Returns

We expand the marginal transition distribution () definition to be more explicit below.


where is the starting state distribution. Written in a single equation:

The expected return can be written more explicitly to show the dependence on the transition function. It then makes the connection to 1 more explicit.

In the grounded simulator, the action transformer policy transforms the transition function as specified in Section 2.2. Ideally, such a exists. We denote the marginal transition distributions in sim and real by and respectively, and for the grounded simulator. The distribution relies on as follows:


The marginal transition distribution of the simulator after action transformation, , differs in Equation 7 as follows:


Appendix B Proofs

b.1 Proof of Proposition 4.1

See 4.1


We prove the above statement by contradiction. Consider two transition functions and that have the same marginal distribution under the same policy , but differ in their likelihood for at least one transition .


Let us denote the marginal distributions for and under policy as and . Thus, and .

The marginal likelihood of the above transition for both and is:

Since the marginal distributions match, and the policy is the same, this leads to the equality:


Equation 13 contradicts Equation 12, proving our claim. ∎

b.2 Proof of Proposition 4.2

See 4.2


We overload the notation slightly and refer to as the marginal transition distribution in the real world while following agent policy . Proposition 4.1 still holds under this expanded notation.

From Proposition 4.1, if , we can say that . From Equation 1, , and . ∎

b.3 Proof of Lemma 4.1

See 4.1


For every , there exists at least one action transformer policy , from our definition of . Let lead to a policy , with a marginal transition distribution . The marginal transition distribution induced by is .

We need to prove that , and we do so by contradiction. We assume that . For this inequality to be true, the marginal transition distribution of the result of must be different than the result of , or the cost functions and must be different.

Let us compare the procedures first. Assume that .

which leads to a contradiction.

Now let’s consider the cost functions presented by and . Since and lead to the same marginal transition distributions, for the inequality we assumed at the beginning of this proof to be true, and must return different cost functions.

which leads to another contradiction. Therefore, we can say that . ∎

b.4 Proof of Lemma 4.2

We prove convexity under a particular agent policy but across AT policies

Lemma B.1.

is compact and convex.


We first prove convexity of for and , by means of induction.

Base case: , for .

is convex and hence is a valid distribution, meaning is convex.

Induction Step: If is convex, is convex.

If is convex, is a valid distribution. This is true simply by summing the distribution at time over states and actions.