## I Introduction

Robot learning [yu2017preparing, peng2018sim, andrychowicz2020learning]

has been testified to successfully work not only in simulation but also on real-world robotic control. In this domain, reinforcement learning (RL)

[sutton1998introduction, duan2016benchmarking, dong2020deep] methods are typically applied for robotic control via sim-to-real transfer [peng2018sim, yu2019sim, valassakis2020crossing]. As one of the most widely applied methods in sim-to-real research literature, domain randomization (DR) can help to learn policies that are more general and robust to be applied in various environments which do not permit direct access to their underlying dynamics. System identification (SI) is another main category of methods that can help to bridge the domain gap. SI methods usually configure the physical parameters from historical transition data, either in an explicit or implicit manner. Both approaches have been proven to be feasible for some control tasks [andrychowicz2020learning, yu2017preparing], when encountering the sim-to-real transfer problems, or more generally, domain transfer problems. However, a successful execution of the task does not necessarily indicate that an optimal control strategy is achieved, which leaves the space for further improvement based on solving the existing defects of above approaches.Problems of existing methods are investigated previously. Conventional DR methods [tobin2017domain] are able to provide a policy with better generalization in various environments, but not the optimal policy for the settings in a specific environment. The optimization objective is additionally taken with respect to an expectation of the distribution of system dynamic properties, rather than directly on the true dynamic settings of the testing environment. Therefore, being aware of the true dynamics of a new environment is necessary for accomplishing optimal control on a task, which is lacking for DR in present works.

SI can help with bridging this gap via an active inferring of system dynamics with the necessary data collected. In [yu2017preparing], an online system identification (OSI) module is used for explicitly configuring the dynamics parameters of environments, with a subsequent universal policy (UP) conditioned on the identified parameters to generate adaptive actions in new environments. However, both the UP and OSI modules can be infeasible in practice for general robot learning tasks according to the discussions in [valassakis2020crossing]

. OSI may not be feasible since the trajectories may not provide enough information for accurately predicting the dynamics parameters in a point-estimation manner, therefore distributions can better represent the uncertainty in the dynamics identification. The problem can be ill-posed to directly predict dynamics parameters, according to

[zhu2018efficient]. Multiple potential dynamics models can lead to the same trajectories as a non-injective function, due to the entanglement of dynamics parameters. For UP, it can be hard to train when the dimension of dynamics parameters is high.Present methods for system dynamics identification usually require aligned trajectories for constructing the contrastive loss [zhu2018efficient, chebotar2019closing], which we find not to be a necessity. In previous methods, the trajectories alignment can usually be achieved with setting same initial states and executing the same policy, or rollout the simulation environment on a give state and certain action for a single timestep at each time. However, not all simulators can support a direct setup to be any state in the environments. Setting up the initial states only will lead to compounding errors on the trajectories due to the dynamics difference or different random seeds, or requires undesirable engineering efforts. Our proposed method solves the trajectories alignment procedure by leveraging on a well-trained forward dynamics prediction model.

This paper makes the following contributions: 1) We demonstrate the limited performances of typical DR methods when the dynamics of testing environment is uncertain or various, compared against our proposed method; 2) We propose a method to handle dynamics uncertainty and unmodeled effects [andrychowicz2020learning] separately, with SI module for implicitly identifying the dynamics parameters, as well as DR for unmodeled effects involving observation noise, observation delay and action noise; 3) For system dynamics identification, our method does not depend on aligned trajectories collected in the source and target domains, but with randomly sampled trajectories carried out by the same control policy; 4) Our method learns a regularized dynamics embedding rather than applying the oracle dynamics parameters as the representation of system dynamics, to handle the entangled effects or redundant information within dynamics parameters. A universal policy conditioned on the learned dynamics embedding is trained with DR for both dynamics and unmodeled effects.

## Ii Related Work

Existing methods for bridging the domain transfer gap (*e.g.*, sim-to-real) can be broadly categorized in to the following classes: 1). Domain Randomization [tobin2017domain] randomizes either visual features [james2017transferring] or dynamics parameters [peng2018sim] in a source domain to train policies with better adaptability to the target domain. The branch of works also include the recent process of structured DR [prakash2019structured] and active DR [mehta2020active]. 2). System Identification usually requires an extensive data collection and calibration process to mitigate the gap between the source and target environments. Explicit SI can be incorporated with a universal policy [yu2017preparing]

for achieving an adaptive control of the real robot for various environment settings. Implicit SI usually uses recurrent units like long short-term memory (LSTM) networks with sequential inputs to preserves information about environment dynamics

[peng2018sim, andrychowicz2020learning]. 3). Domain Adaptation (DA) [wang2018deep, tanwani2020domain]applies transfer learning techniques to match the distribution of source domain data with the target domain data, often applied in visuomotor control with images. 4). Strategy Optimization (SO)

[yu2018policy, yu2019sim, yu2020learning] requires evaluating a family of policies (called strategy) in the target domain and selecting the one with the best performance, usually with sampling-based methods like Bayesian optimization (BO) [mockus2012bayesian] and CMA-ES [hansen1995adaptation].Our works focus on the SI approach with universal policies trained for domain transfer. DR in general cannot provide optimal policy due to the induced noise for increasing model generality. The optimized objective for the control policy is additionally taken with respect to an expectation of randomized parameters in environments. Thus only sub-optimal policies can be performed in the target domain even for the case that target sample data falls in the domain of randomized distributions of the source. Traditional explicit SI methods [yu2017preparing] directly configure the system parameter, which is testified to be an ill-posed problem [zhu2018efficient, valassakis2020crossing] due to that the entangled effects of multiple parameters will lead to non-unique identification results. This problem no longer exists for implicit SI [peng2018sim, andrychowicz2020learning]. However, the true system parameters in the source domain are not leveraged for implicit SI, which hinders the learning efficiency of this approach. To this end, we found the embedding of system dynamics parameters can be achieved with encoders to improve the above approaches.

## Iii Preliminaries

### Iii-a Notations

A typical formulation of RL problems follows a standard Markov decision process (MDP), which can be represented as

, where and are feasible sets of state and action, and is reward function : .defines the transition probability from current state

and action to a next state based on a fixed dynamics setting: , and is a discount factor. However, it is not sufficient for a learned policy to be applied in a transferred domain. We consider a partially observable MDP (POMDP) with randomized dynamics as , with additional as the space of dynamics parameters ( is the number of dynamics parameters), as the observation space andas the emission probability distribution:

. Moreover, the transition probability will then be different from the standard MDP, and it further depends on the dynamics parameters as . Note that here we define to be independent on dynamics parameters , since we disentangle the observed transition process into the transition in underlying state space and the omission of observations from states, so only the transition of underlying states depends on the dynamics parameters .## Iv Methodology

In this paper, we propose the method for domain transfer with not only DR but also SI in an embedding space. The overview of our method is shown in Fig. 1. Three key components are involved in the framework:

Dynamics Encoder:

Forward Dynamics Predictor:

Universal Policy:

The Dynamics Encoder is an embedding network to generate low-dimensional embedding from the system parameters . In practical robotic learning tasks, dozens of system parameters can be involved in simulation [valassakis2020crossing, andrychowicz2020learning], which would severely increase the difficulty of universal policy learning. Moreover, explicit identification of each parameter value is neither necessary nor impossible for general cases [zhu2018efficient]. For example, an increment of friction coefficients on a specific joint may be counteracted by a decrease of mass of some link bodies when we simply look at the trajectories executed by the robot with the same torques applied. The entangled effects of system parameters are considerable in SI process. The Forward Dynamics Predictor is learned to mirror both the transition function of the environment with randomized dynamics and the observation emission function for transferring states to observations. By leveraging the conditional dependency on embedding in function , we propose to directly optimize in the embedding space using Bayesian optimization when given the data from the target domain, for achieving the embedding SI. The Universal Policy basically follows the standard settings in [yu2017preparing], where a concatenation of observations and identified parameter embedding is taken as policy inputs. By conditioning the policy on the system properties, it is capable of adaptively selecting optimal actions in different system settings and randomized parameters . We will show details in later sections.

### Iv-a Theoretical Insights

In the following, we provide the theoretical justification of our method. Since we assume the observation emission process has no impact on the dynamics of underlying state space, the analysis below will be based on a fully observable MDP with randomized dynamics but no observations. Apart from this, another difference of our implementation in practice from the theoretical analysis is the usage of dynamics embedding, *i.e.*, using instead of . These two differences will not cause the loss of generality for our analysis.

Following a similar setup in [yang2019imitation], we extend the definitions of different occupancy measures for MDP with randomized dynamics as in Table IV-A.

Occupancy Measure | State-Action | State Transitions | Joint w/o Dynamics | Joint w/ Dynamics |
---|---|---|---|---|

Denotation | ||||

Support | ||||

Definition |

###### Definition 1 (Estimated Dynamics).

The estimated dynamics distribution is defined as the distribution of dynamic parameters estimated from given the transition tuples , and it can be calculated as:

(1) |

In practice, the estimated dynamics can be represented as a parameterized model with learnable parameters and optimized with dataset from the source domain. During training or evaluation, the predicted distribution and true distribution are denoted as and respectively. The optimization objective for SI can be formulated as minimizing the discrepancy of estimated and true dynamics distributions, *e.g.*, . We will prove that the method of leveraging the dynamics prediction model to achieve SI (step (c). in the scheme as in Fig. 1) is a valid approach for this objective.

###### Lemma 2.

Given the same dataset distribution , the difference between the estimated dynamics distribution and the true distribution can be characterized as:

(2) | |||

(3) |

###### Definition 3 (Forward Dynamics Prediction).

The forward dynamics prediction model is defined as the distribution of next state given the current tuple in a MDP with randomized dynamics, as an approximation of the true underlying dynamics in the source/target domain:

(4) |

In practice, this model approximation process can be achieved with optimization on collected dataset in the source domain. We denote the parameters of forward dynamics prediction as , which are learned in the approximation process with objective (step (a).), where are given and are the true distributions of the forward prediction. After training, is evaluated on the testing dataste (*e.g.*, the target domain dataset), with its predicted distribution denoted as and true distribution as . In our proposed method leveraging the forward dynamics prediction, the SI process is characterized as (step (c).), where and are given. In conventional SI, without , this optimization objective actually becomes , where and are the transition functions of the source and target domains respectively, and it requires the simulator to set the same as the real world does all the time for a sim-to-real transfer example.

Now we have another lemma of discrepancy between and as follows.

###### Lemma 4.

The distance of the distribution from the forward dynamics prediction and the true distribution can be formulated with the KL-divergence, it thus follows:

(5) |

where are shorten for , and are shorten for .

###### Theorem 5.

The optimization of forward dynamics prediction is increasing the lower bound of the optimization objective for improving the estimated dynamics, i.e.,

(6) |

where are and are .

The above theorem tells us that a direct optimization of estimated dynamics for parameters can be approximately achieved with optimization of parameters for forward dynamics prediction then leverage it to optimize . Proofs of above theorems and lemmas are all provided in App. VI-A.

In further analysis, we note that step (a) and (c) in our method are actually a SI process, not in original parameter space but an embedding space. As proposed in [tanwani2020domain], the domain-invariant representation learning can be achieved by both minimizing the marginal discrepancy (called marginal distributions alignment) and minimizing the conditional discrepancy (called conditional distributions alignment) for an object classification task. Here we try to achieve a domain-invariant forward prediction model by optimizing the embedding of system dynamics. We will show how our method satisfies the process of minimizing conditional discrepancy. The reason that marginal distribution alignment is not achieved in current settings is due to the lack of true system dynamics parameters in the target domain, which are required in marginal distribution alignment.

For conditional distributions alignment, we first modify the definition in [tanwani2020domain] with additional conditional variable. Then we narrow down the definition for general domain-transfer models to the specific forward dynamics prediction model in our method.

###### Definition 6 (Conditional Distributions Alignment (modified from [tanwani2020domain])).

Given two domains and

drawn from random variables

andwith different output conditional probability distributions

, conditional alignment corresponds to finding the transformation such that the discrepancy between the transformed conditional distributions is minimized,*i.e.*, . Note that the number of additional variables is not limited to be one.

In our case with forward dynamics prediction model and dynamics encoder , we have , and conditional alignment for the forward dynamics prediction model optimizes the dynamics embedding such that, based on the optimized dynamics embedding, the resulting prediction distributions are minimized, *i.e.*, . In the source domain the embedding is optimized via back-propagating through the dynamics encoder , while in the target domain is directly optimized with methods like BO due to the lack of true dynamics parameters. From above analysis, we show that our method (specifically step (a) and (c)) accomplishes an extended conditional distribution alignment for identifying the dynamics embedding.

### Iv-B Universal Policy with Embedding System Identification

In previous methods, SI is usually achieved with aligned trajectories [zhu2018efficient, chebotar2019closing], which is found to be unnecessary in our method. We will detail the formulation of two approaches as follows. Suppose the datasets in the source and target domains are as and , is collected with dynamics randomization (without observation noise and action noise). Apart from deriving actions from the same policy as and , most present methods assume the underlying states to be the same in the source and target domains, as and for

, so as to define a loss function as

for optimizing the identified system dynamics parameters. The trajectory in underlying state space is required to be aligned in the source domain and the target domain to form a valid loss function, especially for the initial state . In our experiments, we find this assumption in dataset collection both unnecessary and inconvenient. On the one hand, due to the potential dynamics differences, the underlying states may not be well aligned for the dataset collected in source and target,*e.g.*, especially for the latest states in trajectories. On the other hand, manually set the states for the source domain to match with the target domain may not be feasible or inconvenient for some practical cases,

*e.g.*, hard to set a simulator to a certain state, etc. Therefore, in our proposed method, the trajectory alignment is no longer required for the SI process. Specifically, we assume that a forward dynamics prediction model : trained with source domain dataset will accurately predict the next observations in the target domain dataset only when the input dynamics embedding of the target environment is accurate. Therefore we can construct a loss/score function for optimizing the dynamics embedding based on the forward prediction results in the target domain.

In our method, the forward dynamics prediction model : , is trained with and further applied on to fit , assuming that the target data is within the distribution of source data. This can be achieved by increasing the randomization ranges in the source domain until satisfactory. For learning the forward prediction model and dynamics encoder , we have the following objectives:

(7) | |||

(8) |

where both and are mean squared error (MSE) loss function in our experiments, and is a trade-off coefficient for balancing the dynamics prediction performance and the reconstruction of dynamics parameters through encoding and decoding. This is corresponding to step (a) in Fig. 1.

The objective for optimizing the embedding with the learned forward dynamics prediction function is:

(9) |

where is also MSE in our experiments. This is corresponding to step (c) in Fig. 1.

The universal policy : in our method is trained in the source domain and applied for inference in the target domain. During the training process of the universal policy, observation noise and action noise are applied with randomized parameters, together with the randomization of dynamics parameters. The objective for learning the universal policy is:

(10) |

where are parameters of the policy . This is corresponding to step (b) in Fig. 1.

## V Experiment

### V-a Comparison Methods

The comparison involves the following methods:

(1). No DR. A conservative policy is trained in a certain environment with DR, as a comparison baseline.

(2). DR only. A general policy is trained with randomized environments on specified parameters and distributions.

(3). DR+UP (True). An adaptive universal policy is trained with true system parameters as additional inputs in randomized environments.

(4). DR+UP+SI. The policy is trained in the same manner as (3), but tested with a learned SI module to predict dynamics parameters from historical transitions.

(5). DR+UP+Encoding (BO). As the proposed method, an adaptive universal policy is trained with the embedding of system parameters as additional inputs in randomized environments, also with a BO process for configuring the embedding in new environment using the learned forward prediction model.

(6). DR+UP+Encoding (True). As an oracle for the proposed method, the universal policy is trained in the same way as (5), but the embedding is given by the true system parameters going through the learned encoder.

Method (1) works as a baseline for all other methods, which basically represents the optimal policy for a certain training environment and also the most conservative policy when testing in various environments. The comparison of (3) and (4) will imply the potential effects caused by the deficiency in SI module. The comparison of (5) and (6) will show the effects of embedding configuration based on BO when there are no true system parameters but only with samples of transition provided.

### V-B Experimental Setup

Environments InvertedDoublePendulum-v2, HalfCheetah-v3 in OpenAI Gym MuJoCo are used for testing the above methods. For InvertedDoublePendulum-v2, the environment is randomized for its five important dynamics parameters, including damping, gravity, two geometry lengths and density, which are detailed in App. VII Tab. VII. The encoded latent space is chosen to have a dimension of 2 as the disentangled compact representation of system dynamics. Fig. 2 shows visualization of three different configurations for the task scene with randomized parameters. For Halfcheetah-v3, 13 dynamics parameters are randomized as detailed in App. VII Tab. VII, with an embedding space of dimension 4. For training both the conservative or adaptive universal policies, we use the twin delayed deep deterministic policy gradient (TD3) algorithm [fujimoto2018addressing], with 4 MLP layers for both the policy and the value networks. The dynamics networks and SI networks have 4 MLP layers. SI model predicts the system dynamics parameters with a stack of 5 frames of transitions, using the same training data as in the training of dynamics prediction models. In our experiments, each method is tested with three runs with different random seeds.

### V-C Experimental Results

Fig. 3

shows the average learning performances in terms of episode rewards in training for all methods on two environments. The policies trained without DR are evaluated in their specific training environments as optimal baselines. The shaded areas indicating the standard deviations. Each run takes 2000 or 25000 episodes of training the policies with TD3 algorithm for

InvertedDoublePendulum-v2 and HalfCheetah-v3 respectively. The universal policies with either embedding SI or true system parameters significantly outperforms the DR baselines in two environments, although not as optimal as the no-DR policies since the episode reward is directly evaluated in the training environments (*i.e.*non-randomized for no-DR case but randomized for others).

For the embedding SI process, we take the InvertedDoublePendulum-v2 as an example. Fig. 4 shows the BO process for configuring the embeddings of two sets of randomly sampled system parameters following the step (c) in our proposed method (as in Fig. 1), with the dynamics prediction model trained with 10000 episodes of policy rollouts and BO for each parameter configuration with 1000 episodes of data. After 500 iterations of BO for embedding configuration, the best query point is already very close to the true embedding values as shown in Fig. 4. Results are similar for HalfCheetah-v3 just with a higher dimensional embedding. 2000 episodes of data are leveraged for training the dynamics prediction model and 100 episodes of data are used for BO in identifying each new testing environment.

Env | Method | Episode Reward |
---|---|---|

pendulum | No DR | |

DR only | ||

DR+UP (True) | ||

DR+UP+SI | ||

DR+UP+Encoding (BO) | ||

DR+UP+Encoding(True) | ||

halfcheetah |
No DR | |

DR only | ||

DR+UP (True) | ||

DR+UP+SI | ||

DR+UP+Encoding (BO) | ||

DR+UP+Encoding(True) |

Fig. 5 and Tab. V-C display the test performances of different methods on randomly sampled system parameters within the same distributions of DR, as the step (d) in Fig. 1. We show the episode reward distributions of different methods as a violin plot in Fig. 5. Policies for each method are tested on 10 randomly sampled environment dynamics for 10 episodes each, as a total of 100 episodes for each method. We can see that although the method without DR can achieve good rewards in training since there is no randomized dynamics but a fixed one, in test cases it has a heavy allocation for both the head and tail on the performance distribution. The reason is that the lack of randomized dynamics show some unseen cases incapable of being handled well by the policy. The method with only DR also does not work well because of the hardness in training an optimal policy in randomized environments, which is testified to be an ill-posed problem. For both environments, the heavy tails in no-DR method are greatly alleviated for methods with UP, due to its awareness of system dynamics. However, the true dynamics parameters of the testing environment are usually not accessible, which requires a SI module to configure. Our experiments show that the SI process with a stack of 5 frames of historical transitions (see similar settings in [yu2017preparing]) is not capable of providing an accurate estimation of the system dynamics, which severely degrades the performances for UP methods. For both environments, our proposed method with embedding SI using BO process shows advantageous performances over other methods, even as good as the one with true system parameters.

## Vi CONCLUSIONS and DISCUSSIONS

We propose to optimize the dynamics embedding rather than directly configuring the system parameters from historical transitions, and demonstrate its advantageous performances over standard DR and UP methods for a general domain transfer setting, with some primary tests on both a low-dimensional and a high-dimensional simulated environments. The deficiency for DR policies of not being able to achieve optimal actions and the difficulties within normal SI methods for directly configuring the system parameters are revealed. Our future work involves extending the current framework to more complex robot learning tasks, as well as its application on sim-to-real transfer problem, which is a subset of the domain transfer in general. The difficulty of UP learning due to the expanded input spaces with high-dimensional system parameters are not exposed in current experiments, therefore some higher-dimensional tasks will be investigated as well.

## Appendix

### Vi-a Theoretical Proofs

###### Lemma 2.

Given the same dataset distribution , the difference between the estimated dynamics distribution and the true distribution can be characterized as:

(12) | |||

(13) |

###### Proof.

RHS | (14) | |||

(15) | ||||

(16) | ||||

(17) | ||||

(18) | ||||

(19) | ||||

(20) |

∎

###### Lemma 4.

The distance of the distribution from the forward dynamics prediction and the true distribution can be formulated with the KL-divergence, it thus follows:

(21) |

where are shorten for , and are shorten for .

###### Proof.

###### Theorem 5.

The optimization of forward dynamics prediction is increasing the lower bound of the optimization objective for improving the estimated dynamics, i.e.,

(29) |

where are and are .

The last line of above proof is due to that is non-negative.

## Vii Randomized Parameters

Variable | Range |
---|---|

damping | [0.02, 0.3] |

gravity | [8.5, 11.0] |

length1 | [0.3, 0.9] |

length2 | [0.3, 0.9] |

density | [0.5, 1.5] |

Variable | Range |
---|---|

gravity | [5.5, 14.0] |

bthigh damping | [3.0, 9.0] |

bshin damping | [1.5, 7.5] |

bfoot damping | [1.0, 5.0] |

fthigh damping | [1.5, 7.5] |

fshin damping | [1.0, 5.0] |

ffoot damping | [0.2, 2.8] |

bthigh stiffness | [100, 380] |

bshin stiffness | [20, 340] |

bfoot stiffness | [10, 230] |

fthigh stiffness | [20, 340] |

fshin stiffness | [20, 220] |

ffoot stiffness | [10, 110] |