I Introduction
Reinforcement learning (RL) [1] algorithms have demonstrated potential in many simulated robotics domains such as locomotion [2], manipulation [3] and navigation [4]. Yet these algorithms still require large amounts of data, making learning from scratch on real robots challenging [5, 6, 7]. Recently, there have been many successes using simulation to speed up robot learning [8, 9, 10, 11]. Recent works have demonstrated that with enough domain randomization [8], zeroshot transfer of simtoreal policies are feasible, even in contactrich, hybrid discretecontinuous dynamical systems [8, 10].
A successful simtoreal transfer method balances the expensive realworld data and inexpensive simulated data to yield sampleefficient performance in the real world. Dynamics randomization [8] can yield successful policy transfer, but relies on manual selection of simulation parameters to randomize and does not incorporate realworld data in the learning process. Modelbased RL methods learn the dynamics model from data and tend to be more sample efficient than modelfree RL methods, but learning a model still requires high number of real world samples [12, 13], and may be prohibitive for complex dynamics. Adversarial domain adaptation methods have the potential to improve the simulation, but have only been demonstrated for adaptation of observations, rather than direct adaptation the system dynamics [14]. Classical system identification methods provide a sample efficient way to improve the model, but are limited in expressiveness by the model parameters exposed from the analytical models or simulators [15].
Nonprehensile robotic manipulation has been studied previously in the form of planar pushing with dynamics randomization and zeroshot simtoreal transfer [8]. Policies trained with dynamics randomization has been shown to produce robust policies which can successfully perform realworld nonprehensile robotic manipulation tasks. However, zeroshot simtoreal transfer using dynamics randomization does not use any real world data to improve the simulator and requires manual tuning of the choices and ranges of the physical parameters to vary in simulation. The chosen parameters are often very conservative, resulting in a policy that trades robustness to large variation of the model parameters at the expense of overall performance.
Alternatively, the simulated model can be improved to resemble the real world more closely. System identification techniques have been applied to nonprehensile manipulation through joint system identification and state estimation procedure that accurately models the contact dynamics
[15]. However, system identification methods require the true system dynamics to be in the realm of expressiveness of the analytical model. To get around this, others modelled the complex dynamics of planar pushing by combining both the learned and analytical models [16].Modelling complex dynamics is important for nonprehensile manipulation but is also critical for locomotion. Recent work performed simtoreal transfer of locomotion tasks by improving the simulation fidelity through system identification, modelling the actuators and latency of the system, which was then used in conjunction with dynamics randomization to train a robust policy for better simtoreal transfer [10]. The idea of using both the learned and the analytical model has also been applied to locomotion where the complex actuator dynamics were learned for a quadruped robot [17].
Ground Action Transformation (GAT) [18] is a more general method which modifies the actions from the policy trained in simulation. The modified actions are chosen such that, when applied in simulation, the resulting next state matches the next state observed on the real robot. Our proposed approach can be seen as a generalization of the Ground Action Transformation method. We introduce additive generalized forces to the environment instead of modifying the agent’s actions. The optimization of GAT and our method is also different, in that GAT learns an inverse dynamics model in both simulation and the real robot to compute the action transformation, while our method optimizes the objective directly over a discounted infinite horizon through the use of reinforcement learning.
This paper focuses on simtoreal transfer of robotic manipulation by efficiently utilizing a small amount of real world data to improve the simulation. Our contribution is twofold. First, we provide a novel simtoreal method which learns a state dependent generalized force model using neural networks as expressive function approximators to model the constraint forces of a dynamical system. Second, we efficiently optimize the statedependent generalized force model over a discounted infinite horizon through the use of RL. We validate the proposed approach on a nonprehensile robotic manipulation task on the Saywer robot.
This paper is organized as follows. Section II introduces the mathematical background on mechanical systems and reinforcement learning problems. Section III describes how modelling the generalized forces can be used to improve simtoreal transfer and how these forces can be learned efficiently using reinforcement learning. Finally, Section IV discusses the experimental validation of the proposed approach. Conclusions and perspectives conclude the paper.
Ii Background
Iia Mechanical Systems and Equations of Motion
In classical mechanics, any rigid body system subject to external forces can be described with the following equations of motion [19, Ch. 13.5]:
(1) 
where and are the system configuration and velocity respectively. is the mass matrix, represents the bias forces, including Coriolis, centrifugal, joint friction and gravitational forces. denotes the internal, actuation, torques, and
is a vector of size
of zeros denoting the unactuated variables. External 6D forces are mapped to generalized forces by the corresponding contact Jacobian .Note that Eq. (1) is quite general. Depending on the definition of the state variables and , it can be used to represent an articulated system, e.g. a robot arm, a generic free floating object, or both systems. If the systems are interacting with each other or with the environment, Eq. (1) needs to be complemented with additional equations describing the contact constraints. Without making any assumption on the kind of contacts, we can express these constraints as a generic function of the state of the involved systems:
(2) 
Note that, in the case of rigid contacts, (2) is greatly simplified, but in the following we will consider its most general form.
IiB Reinforcement Learning
We consider the reinforcement learning setting with a discounted infinite horizon Markov Decision Process (MDP). An MDP consists of the tuple
: the set of valid states, set of valid actions, the reward function, transition dynamics and the initial state distribution. Let denote the reward obtained in state of an MDP when executing action . The reinforcement learning objective is to find the optimal policy such that:(3) 
where is the discount factor and is the set of all possible policies and the expectation with respect to is , and the states follow the system dynamics of the MDP, i.e. we have where
denotes the transition probability.
With this notation in place, we can define the actionvalue function, which evaluates the expected sum of rewards for a policy with a onestep Bellman equation of :
(4) 
The actionvalue function is learned for offpolicy reinforcement learning during the policy evaluation step to evaluate the policy. The policy itself is improved using the result of policy evaluation in the policy improvement step. Iteratively applying these two steps is referred to as policy iteration [1].
Iii Modelling Generalized Forces with RL
Our method uses a hybrid approach: the analytical model of a mechanical system is supplemented with a parametric function representing additional generalized forces that are learned using real world data. Consider the dynamics equation in Eq. (1). In addition to the actuation and external forces, we add a statedependent generalized force term, i.e.
(5) 
We represent the contactaware system dynamics in Eqs. (5) and (2) as a hybrid discrete dynamics model , parameterized by vector , which takes as input the current state and action and outputs the prediction of the next state . We want to find the parameter vector that minimizes the dynamics gap along a horizon , that is:
(6) 
where represents the resulting next state in the real system. In contrast to other hybrid modelling approaches that learn the residual state difference, we choose to model the gap with generalized forces. This choice has two benefits: i) generalized forces are applied directly to the dynamics equation, that is, the next state resulting from the application of Eqs. (5), (2) is a physically consistent state; ii) it is easier to impose constraints on the learned generalized forces as they have a physical correspondence.
It is worth noting that the resulting hybrid model is nondifferentiable, as Eq. (2) is in general nondifferentiable. By consequence, the function is not differentiable with respect to its arguments and the optimal generalized forces are not trivial to obtain using classical methods. However, the formulation of optimizing a statedependent model interacting with a nondifferentiable component over a horizon strongly resembles the reinforcement learning objective in Equation 3. We therefore reformulate the original problem of optimizing the objective in Equation 6 as a reinforcement learning problem.
Define a trajectory consisting of a sequence of states and actions collected on the real system, and the corresponding dataset of all the acquired real world trajectories. Let us define the following reward function:
(7) 
with a suitable distance function for measuring the distance between and . At the beginning of each training episode, we randomly sample an initial state , and we initialize the internal dynamics model to . The sum of the rewards in Equation 7 is maximized with the RL objective in Equation 3, resulting in the following objective:
(8) 
where
is the transition probability distribution of the unmodified dynamic model given in Eqs. (
1),(2).The resulting generalized force model (GFM), , is then used as the transition model for the hybrid model (5) and (2). We then train the policy for a control task of interest using the hybrid model with the transition probability distribution , resulting in the following reinforcement learning objective:
(9) 
To optimize both Equation 8 and 9, we chose the policy iteration algorithm called Maximum a Posteriori Policy Optimization (MPO) [20]
, which uses an expectation maximization style
policy improvement with an approximate offpolicy policy evaluation algorithm Retrace [21] which estimates the actionvalue function given in Equation 4.Fig. 2 shows the proposed method, highlighting the different steps from the first policy training in the unmodified simulator (, step 1) to the final policy trained on the updated simulator model (, step 4).
Iv Experimental Results and Discussion
In this section we present the results of the experiments we performed to validate the proposed simtoreal transfer method. The experiments have been devised to answer to the following questions. i) Does the use of the GFM improve the modelling capacity with respect to an analytical model with parameters obtained from system identification? ii) How does the training of GFMs scale with an increase in real world data? iii) What is the impact of GFMs on policy training in simulation? iv) Does the use of GFMs improve simtoreal transfer for nonprehensile robotic manipulation tasks?
Referring to Eq. (5), the configuration of our experimental system is with the velocity . The actuated torques are with the learned generalized forces .
The position and velocity of the robot joints are obtained by using the local robot sensors, i.e. the joint encoders. The object is tracked by using a fiduciary marker system based on AR tags, see Fig. 4. We track the object independently with three cameras and then merge the information into a single 6D estimation value. To obtain the velocity, we perform a finite differentiation of the estimated position. The robot is controlled at while the simulation is integrated every .
With reference to the algorithm outlined in Sect. III, we define the distance function as follows:
(10) 
Iva Experimental Setup
Fig. 1
shows the real and simulated setup used in our experiments. The platform is a 7 degrees of freedom (DoF) Rethink Robotics Sawyer robotic arm. A basket, where the objects lie and move, is placed in front of the robot. Depending on the specific experiment, the object and the end effector of the robot can change. The full hardware setup, i.e. robot, basket and objects, are modelled in simulation using the MuJoCo simulator
[22] where we represented the orientation component by using unitary quaternions, is the quaternion product and denotes the complex conjugate of the quaternion. We define the following limits for the learned generalized forces:
for the forces acting on the robot joints,

for the forces acting on the translational part of the object and

for the forces acting on the rotational part of the object.
While learning the GFMs, we also provide as observation to the RL algorithm the 5 previous states of the dynamics simulation.
Both the GFM and the control policy are feedforward neural networks with two layers, each with a size of 200. The neural network outputs a Gaussian distribution with the mean
and diagonal Cholesky factors , such that . The diagonal factor has positive diagonal elements enforced by the softplus transform to ensure positive definiteness of the diagonal covariance matrix.IvB Joint Position Control
To examine the GFM’s modelling capabilities, we collect trajectories of the robot controlled in position mode, following a predefined waypointbased trajectory. No interactions with the object or the basket took place in this experiment. MuJoCo by default exposes two parameters for modelling the actuator, namely the actuator gain and damping. We identify these parameters by using SciPy [23] Sequential Least Squares Programming. We then train our proposed hybrid model using the approach described Sect. III.
We report the comparison of the trajectories resulting from the different models in Fig. 3. We can notice that both the GFM and the parametersidentified model improve upon the default simulation model. However, the GFM hybrid model is able to capture the complexity of the real data better than the simulator’s simplified actuator model alone. This also demonstrates the generality of our GFM hybrid model approach, as no specific knowledge about actuator dynamics are needed for training the GFM. However, it is worth noting that the GFM hybrid model is a local approximation around the real data used for training.
IvC Position Matching Task
To validate and assess quality of the proposed method, we use a position matching task for the freemoving object. The goal of this task is to move the object to a specific target location. The robotic arm is equipped with an acrylic plate as its end effector as can be seen in Fig. 4.
We provide the state of the robot, the pose of the object, and the target position of the brick as observations to the agent, as described in Sect. IVA. We provide a history of observations by providing to the agent the last , including the current, observations. Each episode is composed of a maximum of transitions (corresponding to ). The episode is terminated, and considered a failure, if the endeffector of the robot leaves its workspace.
The agent’s actions consist of joint velocities, representing reference setpoints for the robot’s lowlevel joint velocity controller. We decided to limit the maximum joints velocity to 5 percent of the allowed range of velocities. Two reasons motivated this choice: i) lower velocity implies safer action from the robot while interacting with the object and the environment, and, ii)
lower velocity for the object whilst under interaction with the robot allows better pose estimation from the AR tags system.
We evaluate the agent’s performance on the task with a binary successfailure criterion for 5 trials. An episode is a success if the error between the object’s position and the target is less than for consecutive time steps, i.e. . Each trial consists of 10 target positions, making the total number of target positions attempted for each policy to . The initial pose of the object is reset to the center of the basket at the beginning of every attempt. The target locations are fixed and are radially positioned around the center of the basket – See Figure 5.
IvD Ensemble of Hybrid Models
We first examine how the training of GFMs scales with the increase in real world data.
We train a policy for the position matching task using the default simulator model as shown in step 1 of Fig. 2, and we use this policy to collect a dataset of trajectories on the real robot.
We then train two GFMs using different amounts of real world trajectories. The first one uses 5 trajectories, while the second one uses all 10 trajectories. The second GFM faces a more challenging modelling problem as it has to learn to fit more data points from the real world. In our GFM training, using 5 trajectories on average resulted in around 10% higher average return when compared to the GFM that used 10 trajectories.
In practice, given that we would like to use more than 5 real world trajectories to improve the dynamic model, we decide to employ an ensemble of GFMs instead of increasing the number of trajectories used during the GFM learning. All the following experiments are trained with the fixed number of 5 trajectories per model.
IvE Training Policies on Hybrid Models
The additional complexity introduced by the hybrid model has an impact in the learning performance of the final policy. Fig. 6 shows that the additional term significantly slows down training when compared to the original unmodified model. This is expected, as GFM essentially injects force perturbations which makes the task harder to solve. To reliably obtain the optimal policy when training with the hybrid model, we employ a simple curriculum learning strategy. At the beginning of the training, we allow higher control authority to facilitate exploration and task completion, while progressively lowering it towards the end. We implemented the curriculum by introducing an additional multiplicative gain to the controlled joint velocities. We started with a gain of , progressively reducing it to by unitary steps every episodes. Fig. 6 shows the training curves for policy learning on both the unmodified dynamic model and the hybrid model when using the curriculum strategy. We can clearly see that the curriculum improves the speed and the final performance of the policy.
Models  Task Success  Std. Dev. 

Original model  38%  3.4 
Original model with Curriculum  62%  3.4 
Random Force Perturbations  70%  3.2 
Hybrid Model (1 model in ensemble)  58%  3.5 
Hybrid Model (3 models in ensemble)  84%  2.6 
Hybrid Model (5 models in ensemble)  74%  3.1 
IvF Position Matching Task: Experimental Results
This section reports the results of the simtoreal policy transfer for the nonprehensile position matching task when using both the model ensemble and curriculum learning strategy discussed in the previous sections.
Table I
summarizes the results, reporting the average success percentage and the standard deviation. We compare our method to 3 baseline methods which are
i) policy trained with the original dynamic model, ii) policy trained with the original model and using curriculum learning and, iii) policy trained with random perturbations of generalized forces and using curriculum learning. As expected, the policy trained with the original model performs poorly due to the dynamics gap resulting in 38% success. However, the default simulation trained with our curriculum learning performs significantly better with 62% success. Given that the curriculum learning induces timevarying actuator gains, this has a similar effect to applying domain randomization. Lastly, the addition of random generalized force perturbations on top of our curriculum learning results in 70% success.The lower part of Table I reports the performance of transferred policies trained with the hybrid model for a different number of model ensembles (in parentheses). While the use of one GFM leads to an improvement over the original model, it does not perform better than the other two baselines. Given that a single GFM only uses 5 real world trajectories, it is plausible that the hybrid model is overfitting to those particular trajectories. The use of 3 GFMs significantly improves the simtoreal transfer and results in 84% success. Using 5 GFMs still outperforms all of the baselines but results in a degradation of performance when compared to using only 3 GFMs. Indeed, while using more GFMs allows us to utilize more real world data, it also makes the task more challenging, even compared to the real world. This results in a more conservative controller which is not as optimal as when we have the right balance of GFMs to model the real world.
Models  Task Success  Std. Dev. 

Original model  18%  2.7 
Hybrid Model (3 models in ensemble)  44%  3.5 
IvG 6D Pose Matching Task
Finally, we provide the results for simtoreal policy transfer for a much more challenging nonprehensile manipulation task. The task is similar to the one described in Sect. IVC, but this time we consider also the orientation of the object. The task and learning setup is exactly the same as the one described in Sect. IVC, except for the following changes. The episode duration is transitions, i.e. . During the evaluation procedure we consider an attempt as successful if the position error is less than and the orientation error around the vertical axis is less than degrees.
Table II presents the results of the simtoreal transfer for the 6D pose matching task. As for the previous experiment, our method is able to significantly improve over the baseline of 18% to 44% success. The success rate is lower when compared to the previous experiment, but this applies to both the policy learned on the original model and on the hybrid model, showing the complexity of the task.
V Conclusions
In this work we presented a novel simtoreal method that aims to close the simtoreal gap by modelling and learning the state dependent generalized forces that capture this discrepancy. We validated the proposed approach by performing simtoreal transfer for nonprehensile robotic manipulation tasks. We showed that our method improves the performance of simtoreal transfer for 3D position matching and 6D pose matching task of different objects. While the results reported in Sect. IVG on 6D pose matching showed that our approach yields a considerable improvement with respect to the default model, there are still rooms for improvements. We thus want to analyze and improve on this task further.
When controlling interaction forces, it is common to choose torque control as the low level controller. Despite this classic choice, the use of torque control is not common in reinforcement learning. Future work could investigate if the use of our proposed generalized force model helps in learning an interaction task while controlling the robot by using torque control.
References
 [1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
 [2] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. M. A. Eslami, M. A. Riedmiller, and D. Silver, “Emergence of locomotion behaviours in rich environments,” CoRR, vol. abs/1707.02286, 2017.
 [3] A. Rajeswaran*, V. Kumar*, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine, “Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations,” in Proceedings of Robotics: Science and Systems (RSS), 2018.
 [4] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu, “IMPALA: scalable distributed deeprl with importance weighted actorlearner architectures,” CoRR, vol. abs/1802.01561, 2018.
 [5] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “Endtoend training of deep visuomotor policies,” CoRR, vol. abs/1504.00702, 2015.
 [6] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning handeye coordination for robotic grasping with deep learning and largescale data collection,” CoRR, vol. abs/1603.02199, 2016.

[7]
M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Wiele, V. Mnih,
N. Heess, and J. T. Springenberg, “Learning by playing solving sparse reward
tasks from scratch,” in
International Conference on Machine Learning
, 2018, pp. 4341–4350.  [8] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Simtoreal transfer of robotic control with dynamics randomization,” CoRR, vol. abs/1710.06537, 2017.
 [9] F. Sadeghi and S. Levine, “Cad2rl: Real singleimage flight without a single real image,” CoRR, vol. abs/1611.04201, 2016.
 [10] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke, “Simtoreal: Learning agile locomotion for quadruped robots,” CoRR, vol. abs/1804.10332, 2018.
 [11] Y. Zhu, Z. Wang, J. Merel, A. A. Rusu, T. Erez, S. Cabi, S. Tunyasuvunakool, J. Kramár, R. Hadsell, N. de Freitas, and N. Heess, “Reinforcement and imitation learning for diverse visuomotor skills,” CoRR, vol. abs/1802.09564, 2018.
 [12] I. Clavera, J. Rothfuss, J. Schulman, Y. Fujita, T. Asfour, and P. Abbeel, “Modelbased reinforcement learning via metapolicy optimization,” 2018.
 [13] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning,” 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566, 2018.
 [14] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, and V. Vanhoucke, “Using simulation and domain adaptation to improve efficiency of deep robotic grasping,” CoRR, vol. abs/1709.07857, 2017.
 [15] K. Lowrey, S. Kolev, J. Dao, A. Rajeswaran, and E. Todorov, “Reinforcement learning for nonprehensile manipulation: Transfer from simulation to physical system,” CoRR, vol. abs/1803.10371, 2018.
 [16] N. F. M. B. L. P. K. J. B. T. Anurag Ajay, Jiajun Wu and A. Rodriguez, “Augmenting physical simulators with stochastic neural networks: Case study of planar pushing and bouncing,” in International Conference on Intelligent Robots and Systems (IROS), 2018.
 [17] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter, “Learning agile and dynamic motor skills for legged robots,” Science Robotics, vol. 4, no. 26, 2019.
 [18] J. P. Hanna and P. Stone, “Grounded action transformation for robot learning in simulation,” in AAAI, 2017.
 [19] J. E. Marsden and T. S. Ratiu, Introduction to Mechanics and Symmetry: A Basic Exposition of Classical Mechanical Systems. Springer Publishing Company, Incorporated, 2010.
 [20] A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. A. Riedmiller, “Maximum a posteriori policy optimisation,” CoRR, vol. abs/1806.06920, 2018.
 [21] R. Munos, T. Stepleton, A. Harutyunyan, and M. G. Bellemare, “Safe and efficient offpolicy reinforcement learning,” CoRR, vol. abs/1606.02647, 2016.
 [22] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for modelbased control,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 2012, pp. 5026–5033.

[23]
E. Jones, T. Oliphant, P. Peterson, et al.
, “SciPy: Open source scientific tools for Python,” 2001.