I Introduction
One of the most well recognized goals of robotics research is to develop autonomous agents that can perform a wide variety of tasks in various complex environments. Recently numerous deep reinforcement learning (RL) and imitation learning (IL) based approaches have sought to achieve good performance in complex robotic tasks through minimal supervision. However, a major concern in experimenting with the real environment directly is safety, both of the robot and of the environment. Safety concerns and also the issue of reproducibility has drawn robotics research extensively to simulation environments.
An important benefit of simulators is that not only can we reset as many times as needed by varying the initial state and/or injecting stochastic noises such as observation noise and motor noise, but can also arbitrarily configure the environment. This enables us to change the dynamics parameters like mass, shape, size, and inertia of the agent, friction between the agent and the environment, damping coefficients and gravitational acceleration. We leverage this to develop our approach such that a wide ensemble of simulation configurations can be used in training to achieve robustness to a new environment. We especially focus on adaptation to the unknown dynamics of the new environment.
Most previous approaches for transfer to different environments [10, 2, 7, 29] have not explicitly taken advantage of the fact that we can dynamically sample a variety of environments in simulation, and some that have done so [16, 22, 32, 30, 27] do not attempt to learn an efficient offpolicy scheme for inferring the dynamics of the environment. To remedy this, we adopt a twofold approach, and claim the following contributions:

learn a good latent space by encoding observations through appropriate regularizations and explicitly concatenate to it an encoding of the vector of dynamics parameter configurations; condition the policy decoder on this latent representation

develop a Bayesian metalearning scheme to infer the dynamics parameter configuration of a given environment from a dataset of offpolicy rollouts in that environment.
We demonstrate that by randomly sampling the parameters of the simulation environments, and adapting the policy to these varied configurations in training, we can achieve successful transfer at test time to a completely unseen dynamics configuration of the environment. An important point to note is that at test time, we do not have access to the ground truth system parameters. So, we develop a scheme to learn system parameters from random offpolicy state transition data.
A desirable property of the transfer learning method is that it should be zeroshot in the sense that the transferred policy should not require any finetuning in the target environment so that safety of the real robot is not compromised, when applied to sim2real transfer. This can be indeed realized by our proposed approach.
Although we evaluate our model in simulation only and study transfer across different simulation environments, the approach can be extended to sim2real transfer settings as well, provided there is access to a real robot, and we can mimic the real dynamics well when we set appropriate dynamics parameters in the simulator.
Ii Related Works
Training robots directly in the real environment is unsafe, especially for domains like navigation/locomotion [17, 25, 18], and hence training in simulation and deploying in the real world has become a common trend in robotics, under the theme of sim2real transfer [13, 34]. An important first step to sim2real transfer is sim2sim transfer [10, 31]. Numerous recent works have tackled a similar problem as ours and studied transfer of policies across different simulation environments, across dynamics models, and from simulation to real environments. Universal Planning Network (UPN) [21] trains for goaldirected tasks and in the process tries to capture ‘transferable representations’ such that the trained encoders can be used for rewardshaping an RL algorithm for a similar albeit slightly complicated task. It is important to note that, complete onpolicy training of the RL algorithm still needs to be performed in the new environment and the only ‘transfer’ benefit provided by UPN is in reward shaping. Hence, a good transfer cannot be achieved zeroshot.
Learning a policy that is robust to dynamics change can be naively done by training a policy architecture in different domain randomized configurations. This has been done in the Domain Randomization [16, 22] approaches, the main drawback of which is that it learns an ‘average’ policy that performs reasonably well in a wide range of test environments but is not ‘very good’ for each environment. Motivated by this drawback, we do not aim to develop ‘robust’ policies, but polices that can ‘adapt’ to a given new test environment. A simple way to do this, as shown in [30] could be to maintain a repertoire of policies corresponding to different dynamics configurations (this ensemble is called the strategy) and choose the best policy corresponding to the test environment, by running a few episodes, and considering the policy that yields the most rewards. However, since this method requires number of execution episodes in the test environment that is linearly proportional to the number of policies in the strategy, the approach is not scalable.
In [1], the authors use LSTM [11] value and policy networks that implicitly learn the dynamics parameters of the environment during policy learning via dynamics randomization. However, learning a dynamics model together with policy learning renders less control over what is learned in the latent space and may lead to sensitive hyperparameter optimization for achieving convergence. Hence, it is advisable to decouple the two procedures as advocated in [32]. Another issue is the convergence of dynamics parameter estimation. Since LSTM assumes timevarying latent variables, and the observations change every timestep while the environmental dynamics remains fixed within an episode, trying to achieve convergence for both a good policy and an estimate of good dynamics may be difficult.
MetaLearning [8, 28, 9, 20] attempts to develop general models that can adapt to new tasks with a very few model updates. FastMAML [35] modifies MAML [8] by separating the model parameters into general and taskspecific parameters. Only the taskspecific parameters need to be updated when a new ‘test’ task is given. NoReward MAML [27] extends MAML [8] to handle tasks defined by different dynamics configurations of the environment. The main difference from vanilla MAML is that the authors metalearn the advantage function which is used to appropriately bias the MonteCarlo sampling estimates of policy during learning. An important drawback of this approach is that by considering nontemporal stateaction transition sequences (just static data of the form ) important dynamics parameters like friction, gravity etc. cannot be appropriately modeled. Another drawback is that the method requires finetuning with some data samples in the test environment, and hence is not zeroshot.
Learning a domaininvariant latent space on which the policy is conditioned is another line of domainadaptation based approaches for policy transfer. Zhang et al. [33] adapt the encoder from sim to real by performing adversarial domain adaptation (ADA) [4, 24] to match the latent space of encoding in sim and real, but require intermediate supervision in the form of position of robotic joints for the latent state while training the encoders. Bharadhwaj et al. [2] do this endtoend without requiring intermediate supervision. However, both of these approaches suffer from the drawback of not being able to transfer effectively to different dynamics configurations as ADA cannot capture nonvisual changes. Hence, they require finetuning in the real environment for aligning the dynamics modules, and so, are not zeroshot approaches.
Yu et al. [29] adopt a two stage process for system identification and subsequently policy transfer is developed. The novelty of their method is in training a policy architecture conditioned on the roughly identified model parameters. However, a major concern of this approach is that onpolicy statetransition data from the intermediately trained model is required in the target environment for systemparameter identification, which is not safe (since the model has not yet been fully trained). Also, the method proposed in the paper can be used for transfer to a ‘fixed’ target environment  when the target environment is altered i.e. the system dynamics parameters are altered, the entire method including ‘preSysID’ needs to be retrained. However our method, after being trained can be deployed on any test environment with unknown dynamics parameters and does not need to be retrained when the test environments change.
Iii The Proposed Approach
The components of the proposed approach are described below:
Iiia The basic model
Our basic model consists of an encoder for observations and a policy (or ‘action’) decoder. The Markov chain corresponding to the model is
, where is the input state (which can either be fully observable or partially observable). is the action space, a sample from which is what the model outputs. We considerto be a normal distribution whose mean and variance are predicted by the decoder from latent
. Our training scheme is endtoend and hence we do not need intermediate supervision for latent . In the subsequent sections, we denote the observation encoder by and the action decoder by . Later, we also introduce the dynamics encoder, inverse dynamics model, and state (reconstruction) decoder respectively denoted by , , and . The Dynamics Conditioned Policy module in Fig. 1describes the basic architecture of MANGA. All the model components are realized by feedforward neural networks.
IiiB Dynamics conditioned policy (DCP)
We condition our policy decoder both on the current observation frame in the environment and on an encoding of dynamics parameters of the agent and the environment. Many other previous papers [16, 7] considered the raw dynamics parameters as input to the policy model (i.e. without encoding them separately from input observations), however, it is important to consider a separate encoding of the parameters so that they scale well and are in sync with the latent encoding of input observations. This is also important because the observations change in each timestep while the dynamics parameter vector, and thus the dynamics encoding, remains fixed within each episode of training.
Consider the process of training our model in a simulation environment and let the dynamics parameters of be denoted by a dim vector . Now, we encode the groundtruth dynamics parameters through an encoder and feed in the output to the bottleneck layer at timestep of our basic model. The bottleneck layer is the concatenation of , where is the observation in at time and i.e. the vector . The policy decoder then takes as input the vector and outputs the mean and covariance matrix for the action distribution. So,
Here is the output action of the model corresponding to the input observation and the dynamics vector . The policy learned in is not likely to work well in the other environment , even if we provided the dynamics parameters , because we have not trained the policy to distinguish between the dependence on and . To remedy this, we borrow the idea of dynamics randomization from Peng et al. [16].
IiiC Training the DCP  Improving Generalization through Dynamics Randomization
To implicitly learn the dependence of the policy on input observations and the dynamics of the environment, we train our model across different simulation environments by choosing random values for the dynamics parameters (within appropriate ranges) across the environments. At the start of each episode we sample a certain that defines an environment, choose a random initial pose (state) from the distribution of all states and train the model corresponding to that environment and sample another randomly at the start of the next episode. This explicit conditioning over a wide ensemble of dynamics parameters enables transfer to unseen dynamics parameters.
Our proposed method is agnostic to the type of training procedure, and both Reinforcement Learning (RL) and Imitation Learning (IL) approaches can be used. However, in the experiments we consider a specific RL algorithm for training, for the sake of consistency in comparison. The detailed training procedure for the dynamics conditioned policy is described through Algorithm 1.
IiiD Inferring the dynamics parameters at test time
Since we do not have access to the system’s dynamics parameters at test time, we propose a scheme to learn the system parameters from random offpolicy state transition data. During training, we have access to the simulation environments and their corresponding dynamics parameter (). We consider a random policy that samples a random action in the range of allowed actions (or any pretrained ‘safe’ policy) and allow it to run for a few episodes in each environment. We collect state transition data in tuples of the form (state, action, next state) i.e., . Let be a forward dynamics model of the simulator such that where the true next state is given as
(1) 
when we have the true value of system parameters . is a noise term, and for the sake of analytic simplicity, we assume is a Gaussian with zero mean and variance . The above defines the likelihood model for statetransition and our aim is to estimate through its posterior where .
Although some previous approaches [27] try to estimate system dynamics parameters from uncorrelated, standalone statetransition tuples, we postulate that to correctly estimate dynamics parameters, we must consider correlated statetransition data within episodes. We divide the horizon length of the episodes in each into chunks of length each and estimate for each chunk
in the form of Gaussian distribution with mean
and variance by using an elemental dynamics estimator. If we denote the observation sequence and action sequence within the chunk in as and , this amounts to the estimate of the following posterior:The length of chunk, , should be large, but not be too large. To aggregate the estimates of s, we exploit the relationship between the posterior of conditioned on a single pair of datapoints, , and the posterior of conditioned on the entire dataset , . We note that:
Because we assume and to both be independent Gaussian distributions, can be obtained as a Gaussian distribution after some elementary computations,
(2) 
where
Here the subscript denotes the element of the vector. The posterior is parameterized by as through the parameterization of the functions and which are realized by deep neural networks. Also includes scalars and (). The parameters are optimized so as to approximate the true posterior well.
A popular way of approximating the posterior is to minimize the KL divergence between the true posterior and its approximation , i.e.,
This posterior approximation problem for each environment can be solved without explicitly evaluating the when we consider the following evidence lower bound optimization [12].
Here we can use the reparameterization trick [12] that replaces the expectation with respect to with the expectation with a standard Gaussian variable by interpreting the Gaussian posterior to be a result of elementwise variable transformation with .
It is important to consider a chunk of temporal sequences instead of standalone tuples (unlike [27]
) so as to effectively realize the posterior of complex dynamics parameters like friction, gravity etc. In general, the posterior of the dynamics parameter can take a complex multimodal distribution, but it approaches a Gaussian when the no. of samples in the temporal chunk increases and the statistical model is ‘regular’ according to the central limit theorem
[26]. The different estimates of from temporal chunks of length each in each episode of the rollouts are the ‘Elemental Dynamics Estimator’ in Fig. 1. Their ‘Aggregator’ is described by the optimization problem above.Given statetransition data, we can use the trained model to infer the dynamics parameter vector of the test environment. It is important to note that collecting data in the test environment for this system parameter identification is inexpensive because we only need offpolicy data, which can be collected by simply running a random policy or a different pretrained ‘safe’ policy.
IiiE Test Time inference
At test time we are given a simulation environment with dynamics parameters which are unknown. Let denote our trained model components that have been adapted through training in different dynamics configurations. Our aim now is to transfer the policy that has been learned in training, without any finetuning in the test environment i.e. we are not allowed to train again in . We can do this by using the learned in lieu of groundtruth and running forward inference through the trained model . This scheme is demonstrated in Algorithm 2. Although our approach learns a very good zeroshot initialization in the test environment (Section IV), we show comparisons with other models that require onpolicy finetuning in the test environment in Section IV. Finetuning corresponds to updating the parameters of the policy architecture while executing in the test environment. For MANGA, to achieve good zeroshot initialization, the only execution in the test environment needed is running a random policy (or some external trained policy) to collect state transition data for feeding into the trained dynamics estimation module.
IiiF Adapting to variations in motor noise
In this section we discuss a scheme to make our model robust to motor noise, which is an important consideration for real robotic tasks [7]. We interpret the addition of motor noise as a form of domain randomization, and consider that in reality we have some specific state dependent deviation. The implication of motor noise is same as adding disturbance to the output action of our policy model. In order to infer a model for the disturbance , we assume it to be a function of the current state weighted by an environment dependent parameter . Hence, , where is a nonlinear mapping, specifically a feedforward neural network whose parameter have been randomly assigned and fixed (similar random networks have been used for exploration and uncertainty estimation in RL [6]). When is randomly set with a large enough output dimension of during the training of the policy, the training scheme under this motor noise is similar to a form of domain randomization. However, we actively identify the perturbation caused by this in environment through the estimation of .
Let the original predicted action at timestep be . The action that is fed to the simulator, because of the motor noise now becomes . Since is an environment dependent parameter just like , we estimate the concatenated vector through the scheme described in Section III D for estimating . Here is a scalar multiplier to the noise.
IiiG State reconstruction and ignoring nuisance correlates
Since we are training policies adaptable to variations in the environment, we need to ensure that our agent’s policy does not unfairly make correlations with state changes like changes in brightness, direction of light, location of shadow etc. that occur not as a result of the policy. Previous works like [16] do not consider this issue, however we argue that it is important for the very reason that we consider randomized environments. To tackle this and avoid learning nuisance correlates, we enforce an inverse dynamics model based regularization, which was previously used in the Intrinsic Curiosity Module of [15]. Let be the latent state at timestep , and be the inverse dynamics model such that the predicted action
. The loss function is:
.In addition to the regularization via an inverse dynamics model, we also enforce input state reconstruction from the learned latent representation. This is important because we do not want our policy to get conditioned on a latent state which can never be reached from the observation space. Thus we aim to learn a reconstruction such that is minimized. The loss function is:
Iv Experiments
Through a series of experiments, we demonstrate the necessity of the different components of the proposed MANGA approach (ablation study) and compare against some external baselines for adaptation to different dynamics at test time. We also experimented to see how adaptive is MANGA to the change in the range of the dynamics parameter variations and how it adapts to motor noise variations at test time.
Iva MuJoCo Environments (OpenAI Gym)
We consider three different MuJoCo environments [23] of varying complexity  Humanoidv2, HalfCheetahv2, and Hopperv2, where the task in each environment is to move the agent as fast as possible without toppling over [5]. For consistency in comparison with external baselines, we use the default reward setting for each environment as specified in [5] and alter the following dynamics variables for evaluation: mass () and inertia () of the agent, gravitational acceleration (), friction coefficient between the agent and the environment (), stiffness coefficient of joints (), and damping coefficient ().
Each dynamics variable for MuJoCo is in general a vector (for example the vector consists of the mass of different parts of the HalfCheetah body) of different dimensions. We consider to be the linearized concatenation of all these variables. Let corresponds the dynamics variable in , whose base value is say .
During training, we randomize such that each component gets perturbed in the range . Here denotes randomization and we perform experiments with randomly chosen in the respective range specified by .
IvB Training details
Although our proposed approach is methodagnostic, and Lalgo in Algorithm 1 can be any RL or IL algorithm, for our specific implementation we used the RL algorithm ProximalPolicy Optimization (PPO) algorithm [19]. We used SGD [3]
optimizer for optimization and the Pytorch
[14] library in Python for the implementation. For training the dynamics estimator in Fig. 1, we found that choosing a temporal chunk of length timesteps (for each elemental dynamics estimator) performed well. All the functions described in Fig. 1 and Section III are realized by feedforward neural networks. Other details including the baselines are described in the subsequent sections.IvC Ablation study
We postulate that the auxiliary modules, namely the inverse dynamics model and the statereconstruction decoder are needed to learn a good latent space for effective transfer. MANGA refers to the proposed approach with all components present. For reference, we compare MANGA with an Oracle. Oracle refers to an untrained agent of the same architecture as MANGA, that has access to groundtruth system parameters, and is trained from scratch directly in the test environment. In Fig. 2, we show results by selectively ablating different components of the proposed model when the dynamics parameters are perturbed in the range of of base values. There is clearly a drop in performance on the test environment when we remove either or both the auxiliary modules. Interestingly, removing the inverse dynamics model causes a very sharp decrease in performance across all the three MuJoCo domains. Hence, it is clear that ignoring nuisance correlates between states and actions is important for quick and effective transfer.
IvD Comparisons with existing methods in literature
We compare the performance of MANGA with existing approaches in a new environment whose dynamics parameters are perturbed in the range of of base values. Note that perturbation is large enough to cause significant performance drop when a version of MANGA was trained in the base environment only and then tested in the randomized test environment without any adaptation (, , respectively for HalfCheetah, Hopper, and Humanoid) . The results are in Fig. 2.
We consider two external baselines, namely Domain Randomization (DR) and metalearning. For DR, we followed the implementation of the stateofthe art dynamics randomization paper [16] with two variants LSTM and FF. LSTM is the variant that uses an LSTM policy and value architecture while implicitly identifying the system dynamics parameters during policy learning [1]. FF is with the same policy architecture (FF) as our MANGA model and without any system parameter identification. Since the LSTM variant is computationally very expensive and takes a long time to train, we perform only one type of comparison against it. For metalearning, we implemented NoReward MAML [27], which performs significantly better than vanilla MAML [8] for the scenario of transfer to different environments dynamics. To ensure fair comparison, all models were trained for the same number of episodes, by executing for the same number of maximum timesteps per episode and the same optimizer was used for the rest of all experiments.
IvE Analysis with different randomization ranges
The extent to which we need to randomize the dynamics parameters during training depends on how different the test environment is likely to be with respect to the default setting. We experimented with different test environments in the range of maximum variation of dynamics parameters from the default value. The range of randomized environment during training was also the same ( respectively) in each case.
As evident from Fig. 3, the performance of all the compared models decrease when the range of parameter variations is increased. However, the drop in performance of MANGA is the least compared to the other methods. We attribute this favourable behavior, primarily to the fact that we have separated the processes of system parameter identification and policy learning with regularization. Hence, the latent space learned for conditioning the policy is not potentially negatively affected by the training of the system parameter identification module.
IvF Quick Adaptation: Rollouts in the test environment
Although most approaches for policy transfer [27, 30, 32] need rollouts in the test environments for reasonably good transfer, our proposed approach adapts a good policy zerosshot by estimating dynamics parameters based on the observation of random offpolicy state transition data, as shown by the reward at episode of the plots in Fig. 4. Furthermore, we observe that if allowed to update model parameters in the test environment (i.e. finetuning), MANGA quickly converges and achieves reward equivalent to the Oracle only within a few hundred episodes.
IvG Evaluation of performance in the presence of Motor Noise
We consider two variants of MANGA here: MANGANoise and MANGANoNoise. MANGANoise has been trained by considering random values of corresponding to each randomized environment, and a fixed (i.e. the weights of the random network are fixed) during training. We learn a model for estimating the value of along with as described in Sec III D and F. At test time we consider two situations of motor noise  known noise and unknown noise. Known noise corresponds to the case when at test time, the value of is same as that during training, while unknown noise corresponds to the case when the value of at testtime is different from that during training.
It is evident from Fig. 5 that MANGANoise effectively estimates the weight vectors and achieves much higher reward than MANGANoNoise in the presence of noise in the test environment. This suggests the effectiveness of the noise estimation technique described in Section III F.
V Conclusion
In this paper, we introduced a general framework for policy transfer that decouples the processes of policy learning and system identification, is agnostic to the algorithm used for training it and can quickly adapt to an environment at test time with variations in dynamics and motor noise. We compared the proposed approach with existing algorithms for policy transfer and demonstrated its efficacy with respect to robustness to the range of dynamics variations, variation in motor noise, quick adaptation to a test environment and learning of a transferable latent space for policy conditioning.
Vi Acknowledgement
We would like to acknowledge the support of Crissman Loomissan, Takashi Abesan, Masanori Koyamasan, Yasuhiro Fujitasan and many other colleagues at Preferred Networks Tokyo who helped shaped this work through the amazing research discussions during Homanga Bharadhwaj’s internship. We also thank Florian Shkurti (University of Toronto) for his valuable feedback on the draft and help in editing it.
References
 [1] (2018) Learning dexterous inhand manipulation. arXiv preprint arXiv:1808.00177. Cited by: §II, Fig. 2, §IVD.
 [2] (201905) A dataefficient framework for training and simtoreal transfer of navigation policies. In 2019 International Conference on Robotics and Automation (ICRA), Vol. , pp. 782–788. External Links: Document, ISSN 2577087X Cited by: §I, §II.

[3]
(2010)
Largescale machine learning with stochastic gradient descent
. In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §IVB. 
[4]
(2017)
Unsupervised pixellevel domain adaptation with generative adversarial networks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 3722–3731. Cited by: §II.  [5] (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §IVA.
 [6] (2018) Exploration by random network distillation. CoRR abs/1810.12894. External Links: Link, 1810.12894 Cited by: §IIIF.
 [7] (2016) Transfer from simulation to real world through learning deep inverse dynamics model. arXiv preprint arXiv:1610.03518. Cited by: §I, §IIIB, §IIIF.
 [8] (2017) Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1126–1135. Cited by: §II, §IVD.
 [9] (2018) Probabilistic modelagnostic metalearning. In Advances in Neural Information Processing Systems, pp. 9516–9527. Cited by: §II.
 [10] (2019) SplitNet: sim2sim and task2task transfer for embodied visual navigation. In International Conference in Computer Vision (ICCV), Cited by: §I, §II.
 [11] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §II.
 [12] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §IIID, §IIID.
 [13] (2019) Learning to augment synthetic images for sim2real policy transfer. arXiv preprint arXiv:1903.07740. Cited by: §II.
 [14] (2017) Automatic differentiation in pytorch. Cited by: §IVB.
 [15] (2017) Curiositydriven exploration by selfsupervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16–17. Cited by: §IIIG.
 [16] (2018) Simtoreal transfer of robotic control with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §I, §II, §IIIB, §IIIB, §IIIG, §IVD.
 [17] (2018) The Lyapunov neural network: adaptive stability certification for safe learning of dynamical systems. In Conference on Robot Learning, pp. 466–476. Cited by: §II.

[18]
Safe visual navigation via deep learning and novelty detection
. Cited by: §II.  [19] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §IVB.
 [20] (2017) Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: §II.
 [21] (2018) Universal planning networks: learning generalizable representations for visuomotor control. In International Conference on Machine Learning, pp. 4739–4748. Cited by: §II.
 [22] (2017) Domain randomization for transferring deep neural networks from simulation to the real world. CoRR abs/1703.06907. External Links: Link, 1703.06907 Cited by: §I, §II.
 [23] (2012) Mujoco: a physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §IVA.
 [24] (2017) Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176. Cited by: §II.
 [25] (2018) Intervention aided reinforcement learning for safe and practical policy optimization in navigation. In Conference on Robot Learning, pp. 410–421. Cited by: §II.

[26]
(20092009)
Algebraic geometry and statistical learning theory
. Cambridge University Press. Cited by: §IIID.  [27] (2019) NoRML: noreward meta learning. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems AAMAS, pp. 323–331. Cited by: §I, §II, §IIID, §IIID, §IVD, §IVF.
 [28] (2018) Bayesian modelagnostic metalearning. In Advances in Neural Information Processing Systems, pp. 7332–7342. Cited by: §II.
 [29] (2019) Simtoreal transfer for biped locomotion. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, Cited by: §I, §II.
 [30] (2019) Policy transfer with strategy optimization. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §I, §II, §IVF.
 [31] (2017) Preparing for the unknown: learning a universal policy with online system identification. arXiv preprint arXiv:1702.02453. Cited by: §II.
 [32] (2018) Decoupling dynamics and reward for transfer learning. arXiv preprint arXiv:1804.10689. Cited by: §I, §II, §IVF.
 [33] (2017) Adversarial discriminative simtoreal transfer of visuomotor policies. arXiv preprint arXiv:1709.05746. Cited by: §II.
 [34] (2019) Simreal joint reinforcement transfer for 3d indoor navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11388–11397. Cited by: §II.
 [35] (2018) Caml: fast context adaptation via metalearning. arXiv preprint arXiv:1810.03642. Cited by: §II.
Comments
There are no comments yet.