I Introduction
The ability to learn new tasks from nonrobotic experts would be of great benefit for future robots in industry, as well as everyday applications. To this end, Learning from Demonstration (LfD) [1], and particularly movement primitives, offer a way to learn generalizable trajectory generators, from a few demonstrated trajectories. However, in precise or contactrich manipulation tasks, often pure imitation of demonstrated behaviors does not work well [2]. Additionally, in real robotic applications controller accuracy might differ between robots or tasks, and lower controller gains, desirable e.g. to ensure safety in close cooperation with humans, could lead to tracking inaccuracies of trajectories learned from demonstrations.
Recently introduced Residual Reinforcement Learning (RRL) [3, 4, 5] has shown success in combining classical controllers with residual policies learned through direct interactions with simulated or real environments. In this paper, we explore the direction of combining RRL with movement primitives to use advantages of both in a symbiotic way. This has already been investigated for the cases of Dynamic Movement Primitives (DMPs) [6]
and Gaussian Mixture Model (GMM) based Primitives
[7]. Inspired by these works, we are investigating here the benefit of RRL for another movement primitive representation, namely Probabilistic Movement Primitives (ProMPs) [8, 9]. ProMPs encode demonstrated trajectories as distributions over weighted basis functions and allow to capture variability and correlations within demonstrated trajectories. Probabilistic operators such as conditioning on different goals and start points can then be used to generalize from demonstrations to unseen situations. While this works well for tasks demonstrated with jointspace kinesthetic teaching, e.g. assisting in coffee making [10], these conditioning operations often are too imprecise for insertion tasks, in particular when executed on a real robot with limited controller gains.To overcome these limitations, we propose to combine ProMPs with RRL, such that during task execution a robot can refine and iteratively improve trajectories. In particular, we investigate in an experimental study whether ProMPs, which were learned from external observations, can be used in combination with RRL to solve a precision D block insertion task. The main contributions of this paper are hereby the following:

On top of a nominal trajectory generated with a ProMP we propose to learn a residual to account for both corrections in position and orientation with SoftActor Critic (SAC) [11] in a real robotic system.

We use the variability in the demonstrations as a decision variable to reduce the search space for RRL and compare this approach to a distancebased strategy for weighting nominal and residual policy.

We evaluate the proposed method on a D block insertion task on a DoF Franka Emika Panda robotic arm. Contrary to peginhole, this task is not invariant to rotations around all axes.
The rest of this paper is structured as follows. In Section II we discuss related work for RRL with classical controllers and LfD. In Section III we introduce our approach to combine RRL with an objectcentric formulation of ProMPs. Here we also discuss three different options for the combination of the residual and nominal policies, which we then evaluate on a D block insertion task with a real robot in Section IV. In Section V we draw conclusions, discuss current limitations of the proposed approach and give an outlook on future work.
Ii Related Work
Residual reinforcement learning [3, 4, 5] has been proposed as a way to solve challenging robotic manipulation tasks by adapting control actions from a conventional (modelbased) controller with a policy learned with modelfree RL, which significantly reduces the search space and thus improves sample efficiency. [3] introduced RRL as learning a residual on top of an initial controller to improve nondifferentiable policies using modelfree deep RL. [4] uses a residual policy and a handengineered base controller in a robotic task to insert a block between two others that can topple, using a dense reward and a fined tuned vision setup. [12] extends RRL to learn from sparse rewards and visual input demonstrations as image sequences in simulation. Through behavioral cloning, they first learn taskrelevant visual features. Afterwards, the RL policy is optimized using the pretrained state features. [5] applies deep RRL to industrial tasks in the real world, uses feature state from images, a sparse reward, and a handdesigned Pcontroller as a nominal policy. The cartesian actions are transformed to jointspace via inverse kinematics. [13] learns assembly tasks with a real robot in a few minutes, combining cartesian impedance control with a recurrent policy version of TD3 [14] to learn in the presence of position uncertainties. Variable Impedance EndEffector Space (VICES) [15] studied the effects of different action spaces, and argued for impedance control in endeffector space. Besides learning a policy to learn small changes in pose, they also learn statedependent gains for the impedance controller. Older work [16] learned impedance gains of a robotic arm based on the equilibrium point control theory and the natural actorcritic algorithm [17] for contact tasks. [18] learns force control for rigid positioncontrolled robots and its follow up work [19] solves peginhole tasks with holeposition uncertainty with an offpolicy, modelfree RL method using several sim2real techniques, such as domain randomization. [20]
imitate human assembly skills through hybrid trajectory and force learning with hierarchical imitation learning. A scheme to learn an optimal force control policy with goal conditioned imitation learning is presented in
[21] and closely connect to [22]. Guided UncertaintyAware Policy Optimization (GUAPO) [23]quantifies uncertainty in pose estimation to define a binary switching strategy that determines where to use modelbased or RL policies.
[24] evaluates Qgraphbounded DDPG for improving modelfree RL to solve a peginhole with a forcetorque action space. [25] proposes a hybrid RRL to modify the signals used by the RL policy to prevent internal feedback signals of the lowlevel controller limiting the RL agent to adequately improve its policy and thus harming learning. Their approach is shown in a contactrich peginsertion task. [26] combines visual servoingbased LfD and forcebased Learning by Exploration (LbE), and proposes regionlimited residual RL (RRRL) policy that acts only on a region close to the goal, determined by the euclidean distance. [27] builds upon Deep Deterministic Policy Gradients (DDPG) to incorporate existing base controllers into stages of exploration, value learning, and policy update, and present a straightforward way of synthesizing different base controllers to integrate their strengths. [7] proposes SoftActor Critic Gaussian Mixture Model (SACGMM), which is a hybrid approach that learns robot skills through a dynamical system modeled in statespace with GMMs and adapts the learned skills through interactions with the environment using RRL. They present results in simulation for peginsertion and powerleversliding skills, and realworld results for a dooropening task, using a camera image as a policy input. [28] propose the InsertionNet, which uses visual and wrench inputs to learn a residual policy in position and orientation. Demonstrations are provided with a carefully designed procedure with backwards learning by first physically moving the robot to the final pose, e.g. the hole in the peginhole task, and then generating collisions in order to collect data. Data augmentation of images is used for robustness. Having this dataset the insertion problem resorts to learning function parameters in a regression task. To reach the insertion area they use a PD controller to follow a precomputed computed trajectory.The closest work to ours is the combination of Dynamic Movement Primitives (DMPs) with residual learning as done in [6], where it was shown that RRL is better than learning the DMP parameters with RL. Additionally, it is stated that learning orientation is important for reliable insertion tasks (several other works learn only the deviation in position). Our work differs from [6]
in that we make use of the variance in the demonstrations to check if the current position is inside the confidence interval and decide if another demonstration is needed, and when to adapt the nominal controller, while
[6] specifies a time period after which the RL policy contributes to solving the task. In [29] insertion tasks have been learned with DMPs by tuning the parameters with episodic RL using Policy learning by Weighting Exploration with the Returns (PoWER) [30], but the trajectories are demonstrated with kinesthetic teaching, which facilitates learning in joint space. On the contrary, we work in cartesian space, which is inherently more difficult.Iii Residual RL for ObjectCentric ProMPs
In this section, we explain the different components of our proposed approach to combine ProMPs and RRL. An overview of the resulting method is shown in Fig. 2. We discuss the used underlying control structure in Section IIIA and provide a short recap on ProMPs in Section IIIB and on RL and SAC in Section IIIC. Afterwards, we introduce our approach for adapting Cartesian ProMPs with Residual Robot Learning in Section IIID and explain the used action space and policy parametrization in Section IIIE.
Iiia Cartesian Impedance Control
When performing contactrich tasks, such as an insertion, cartesian impedance control [31] is an appropriate choice, because it allows to specify the desired compliant behavior of the robot in presence of external forces, e.g. collisions. This aspect is particularly relevant not only to prevent damaging the robot but also in humanrobot collaboration. Given a desired setpoint (with zero velocity) of the endeffector pose , with position and orientation represented as quaternion , the torques applied at the robot joints are computed as
where are the current robot joint positions and velocities, is the Jacobian matrix, and the coriolis forces and gravity compensation terms, is the simulated external wrench, and the position and orientation stiffness matrices, the damping matrix and denotes a difference in quaternion space and translated to axisangle. Lower the gains and allows for safer robot interaction and exploration, but results in a larger tracking error.
IiiB Probabilistic Movement Primitives
mp are a convenient way to represent timebased smooth robot and object movements [32]. In particular, probabilistic formulations allow to also capture variance in the demonstrations [33, 34, 8]. Here, we use ProMPs which are able to construct distributions that are conditioned on arbitrary timesteps and points inside a confidence interval of the demonstrations while relying on a small amount of training data, when compared to other statebased representations [35].
Formally, a ProMP is a compact representation of a trajectory, where a point in the trajectory is assumed to be a linear combination of basis functions , with a basis function matrix, the learnable weights and a phasevariable. A distribution
over the weights is learned from multiple demonstrations. Assuming the weights are Gaussian distributed
, the mean and covariance matrix are obtained via maximum likelihood estimation. For more details on the exact training procedure, we refer the reader to [8, 10]. Let be a point to reach at step with covariance . The conditional distribution over weights is computed with Bayes’ rule for Gaussian distributions as , and the resulting trajectory distribution , withIiiC Reinforcement Learning and SoftActor Critic
Let a mdp be defined as a tuple , where is a continuous state space , is a continuous action space ,
is a transition probability function, with
the density of landing in state when taking action in state , is a reward function, is a discount factor, and the initial state distribution. A policy is a (stochastic) mapping from states to actions. The stateaction value function – function – is the discounted sum of rewards collected from a given stateaction pair following the policy , . In general, the goal of a rl agent is to maximize the expected sum of discounted rewards . In highdimensional and continuous action spaces, typically a policy with parameters is updated iteratively with a gradient ascent step on , using a variation of the policy gradient theorem [36]. The sac algorithm is a sampleefficient method to compute an offpolicy gradient estimate of [37, 11], with several improvements over previous approaches, namely the entropy regularization and the squashed Gaussian policy. The entropy term encourages exploration by preventing the policy from becoming too deterministic during learning. The surrogate objective optimized by sac iswhere is an offpolicy state distribution, the
function is a neural network parameterized by
and weighs the entropy regularization term. The (unbiased) policy gradient of is computed by sampling from a replay buffer containing offpolicy samples and using the reparametrization trick [38] to differentiate the expectation over actions.IiiD Adapting Cartesian ProMPs with Residual Robot Learning
Residual learning is commonly formulated as a combination of policies, which can be both time and state dependent as
where is a nominal (or modelbased) policy, is a learnable policy, and and we call adaptation parameters. The operation is dependent on the action space. If represents a translation, then it can be a sum, but if it is an orientation it can be a quaternion multiplication. In [3], the authors assume , and hence , meaning one can use the policy gradient to optimize without knowing . However, this is equivalent to write the transition function of the residual MDP with the policy transformation . Because the transformation is now part of the environment, the agent is unaware of it. Note that in the original formulation of [3] the policy is only statedependent, and not timedependent. We augment this definition by using a time dependency, to include policies that result from timedependent movement primitives such as ProMPs. This does not break the MDP assumption, since can be seen as part of the state (in an episodic task).
In our approach, we learn ProMPs from external observation of demonstrated object trajectories in cartesian space. For experiments in this paper, we used a motion capturing system and markers on the objects to obtain the demonstrated trajectories. An example of such a demonstration can be seen in Fig. 2 on the top left. We assume that for an insertion task we collect a set of trajectories of the pose of an interest object I (red object in Fig. 1) in the reference frame of a target object T (blue object), where , with the th trajectory length, and , with is the position and
the orientation (as quaternion). Representing the trajectory in the target frame allows to get the intention of the demonstration, i.e. classify if it is an insertion task, and additionally, if the target or interest objects move to a different pose, their relation is maintained. For a compact representation of the trajectory, we encode the position with an objectcentric ProMP in cartesian space. The orientation representation is the average over trajectories. Learning ProMPs for orientation spaces in an ongoing topic of research
[39], and we leave this for future work.A key advantage of movement primitives is their ability to generalize from demonstrated trajectories to new situations. In particular, here ProMPs can be used to compute nominal trajectories for varying start positions of the object. In particular, they can also distinguish between starting points covered by the provided demonstrations and starting points outside this region. We focus here on generalization to starting points within the demonstrated distribution over trajectories only. However, as a future direction it would be also possible to include active requests for additional demonstrations or multimodal ProMPs using incremental Gaussian Mixture Models [40].
For our approach we compute the nominal trajectory by conditioning the ProMP on the initial timestep and initial pose with a small covariance and compute the resulting mean trajectory . Afterwards, the desired trajectory is translated to the endeffector in the target frame with , where a transformation from the endeffector to the interest object is given via a grasping pose.
To follow we could compute the inverse kinematics and track it with an inverse dynamics controller in joint space. However, this strategy would require a large gain to ensure a low tracking error, necessary for an insertion task with limited tolerance. These large gains could damage the robot and the environment in case of interaction. For this reason, a low gain controller in cartesian space is better suited for this task. However, due to the low gains, velocity constraints, among others, the controller will not perfectly follow the desired trajectory and thus cannot complete the task, as depicted in the experiment of Fig. 4.
An important decision is where/when to activate the residual part of the policy. In the original formulation of residual RL , which for longer trajectories can lead to exploring in freespace regions far away from the insertion goal. To prevent this unnecessary exploration, [23] and [26] set if , a region in the vicinity of the goal, and otherwise. [23] defines based on an uncertainty quantification in pose estimation, and [26] defines
as the distance to the goal, both being hyperparameters.
[13] uses a timebased weighting as , with , with the settling time of the nominal controller, which in practice means that the learned policy only acts if the nominal controller fails, and the residual part of the policy acts alone in the environment. In [6] after executing the nominal controller for a certain amount of time (in their task sec), and otherwise .In our approach, we make use of the covariance over originally demonstrated trajectories computed with . For our particular insertion task, we motivate this with the intuition that exploration is more beneficial closer to the insertion location, that have lower entropy (less variance), as can be seen from the last time steps of the learned ProMPs in Fig. 3. We propose a variancebased adaptation scheme with and
(1) 
where
is the standard deviation of the
th dimension of the original ProMP at time step .IiiE Action Space and Policy Parametrization
For the RL agent, the cartesian impedance controller is part of the environment and takes as input a desired pose , computed with the adaptation scheme. The policy is learned with SAC, and encodes the mean and variance of a Gaussian distribution over positions and orientations . The functions encoding the mean and covariance are neural networks that share the same features up to the last linear layer. The delta in orientations parametrizes the coordinates of an axisangle representation. Both positions and orientations are squashed to the cartesian controller limits with a hyperbolic tangent operator. The nominal and learned policies are combined as follows. For the position it is a simple addition . For orientations, we first compute the quaternion representation of as , and afterwards apply a quaternion multiplication to obtain the desired rotation .
Iv Experimental Results
As a proof of concept, we evaluate our proposed method in a block insertion task with a 7DoF Franka Panda Robot. In these experiments, we investigate if the task, which we could not solve with basic ProMPs before, benefits from the combination of ProMPs and RRL and compare different ways of combining nominal and residual policy to assess which one works best in the real system.
Iva Experiment Setup
We evaluate the proposed method on an approximation of the Ubongo3D game [41], which consists of different shapes that have to be assembled together in a limited space. For the experiments in this paper, we use 3 different shapes  the red, green and blue elements in Fig. 1  and a base plate (in black). The tolerance for insertion is approximately mm. Each shape is a custom D printed structure, consisting of cubes with cm sides. The goal is to build a a priori unknown structure with a height of two cubes and such that the base plate is covered. The game has a planning and a manipulation part. Solving the planning problem involves deciding the pose of each shape and could e.g. be done using MixedInteger Programming [42]. On the other hand, stacking/inserting the shapes together can be seen as a fine manipulation task, which is knowingly difficult for robots [2].
We see the Ubongo D task as a proxy for more complex assembly scenarios with small tolerances. Due to the involvement of multiple points of contact for the insertion, we also consider it a particularly suitable task for modelfree RL approaches, since they do not rely on accurate models which would be hard to obtain in practice. Additionally, the insertion here is noninvariant to changes in orientation and therefore requires learning orientations as part of the residual policy.
For the experiments in this paper, we assume the blue and green shapes are already placed and fixed and learn how to insert the red shape from human demonstrated trajectories. Therefore, demonstrations are recorded starting from different initial configurations, using a motion capturing system and markers attached to the objects. We record the positions and orientations of the interest object pose (the red shape) in the reference frame of a target object (the blue shape). It is notable here that these demonstrations are recorded from external observations in Cartesian space, i.e. a human demonstrator moving the objects and not with kinesthetic teaching on the robot. Fig. 3 shows the recorded trajectories, the resulting learned ProMP (orange), and an example for conditioning on a new (not initially demonstrated) initial position (blue). For learning the ProMP we decided to use basis function after comparing the data loglikelihood in a grid search.
The cartesian impedance controller runs at Hz and learnable policy at Hz. The task is executed episodically with steps, amounting to approximately seconds per episode, and a learning trial took approximately minutes. The episode terminates if the interest object (red shape) is at a distance of the goal position and orientation less than mm and , respectively. The state is the position and orientation (as quaternion) of the endeffector in the target object . While recent works [13] also use the external wrench expressed in the target frame, we found in our experiments that the measurement provided by the Panda robot was too unreliable and not useful for learning.
We experimented using a sparse reward, but found that the task was too difficult to learn without a reward signal. We hypothesize that this was due to the termination condition being too strict. Quite often the red shape was already fairly in place, but not enough to get to terminate and obtain a reward. For a peginhole task, many works use a sparse reward, but once the peg is inside the hole the region of exploration is lower, especially with a fixed orientation and the agent simply has to push down the peg. In our case, we have a combination of an insertion and precise placing task, where after starting the insertion, the red shape can still rotate around one axis. Since an approximation of the final pose is known, we decided to use instead a perstep dense reward function that weighs the absolute distances of the error in position and orientation . Using dense rewards is also common in other works that learn in the real system [4]. The discount factor was .
The policy and functions are twolayer neural networks with
hidden neurons and ReLU activations, and the entropy regularization weight in SAC (
) is also learned. All parameters are optimized with ADAM [43] and a learning rate . The initial replay size and the number of samples before policy updates is , as well as the batch size for actor and critic learning. In practice, we did not find the need to use a recurrent policy as in [13, 6].IvB Results
We report the results of executing the nominal controller and three adaptation strategies:

Residual RL  the residual policy is always active

Residual RL with variancebased adaptation (ours)  detailed in Eq. 1
Figures 4 and 5 show snapshots of the execution of the nominal controller and the learned residual policy with the variancebased adaptation strategy, respectively. Notice that the initial position starts far away from the goal, which makes it more difficult to precisely track the object trajectory until the insertion point.
The average reward, number of time steps to succeed, and the success rate during training of the three adaptation strategies are depicted in Fig. 6. From the figures, we can observe that all methods show an improvement trend on the average reward curve (most left plot). The results show that while the nominal controller (red dashed line) cannot solve the task due to the lowgain controller, our method (yellow line), and residual RL (blue line) can improve it, as seen in the average reward plot. Using the residual RL policy for the whole trajectory can lead to explore far from the insertion point and small deviations can accumulate over time, leading to a point where the red shape gets stuck and cannot finish the insertion. In the rightmost plot, the success rate at step is different from , because some trials could already perform an insertion just by adding small Gaussian noise around the nominal trajectory. Even though we initialize the policy neural network to output means close to , as is common in RRL, by construction the network computing the variance outputs a value different from , making it sufficient to move the shape slightly around the nominal trajectory. For the distancebased adaptation (green lines) we made sure to only switch fully to the learnable controller when the red shape is already in contact with the green shape (th frame in Fig. 4). While this strategy is slowly learning  note the increase in reward the leftmost plot of Fig. 6  meaning the policy is successfully bringing the red shape to the goal location, it could not complete the task in any of the trials, as was expected due to the random exploration. Lastly, the steps it takes for the episode to terminate are correlated with the success rate.
V Conclusion and Future Work
In this paper, we studied how to use human demonstrations of objectcentric trajectories in combination with ProMPs and residual learning. Because the cartesian impedance controller has a low gain to guarantee safe interaction, simply following the trajectory from a conditioned ProMP resulted in task failure. We overcome this problem by proposing to learn a residual policy in position and orientation with modelfree reinforcement learning. Making use of the variability in the demonstrations, we present an adaptation strategy based on the variance of the ProMP as an indication of a region where an insertion task can take place, and thus where the policy needs to be learned, thereby increasing the sample efficiency. The experimental evaluations in the real robot showed that our method is able to learn a policy that corrects the ProMP trajectory to perform a block insertion task in the UbongoD game.
In future work we plan to evaluate our approach in other tasks with different objects, include a formulation of orientation ProMPs in Riemannian Manifolds [44], learn the adaptation strategy parameters as part of the policy learning, and study other trajectory representation methods using statespace information [45, 35].
References
 [1] A. Billard, S. Calinon, R. Dillmann, and S. Schaal, “Survey: Robot programming by demonstration,” Springrer, Tech. Rep., 2008.

[2]
O. Kroemer, S. Niekum, and G. D. Konidaris, “A review of robot learning for
manipulation: Challenges, representations, and algorithms,”
Journal of machine learning research
, vol. 22, no. 30, 2021.  [3] T. Silver, K. Allen, J. Tenenbaum, and L. Kaelbling, “Residual policy learning,” arXiv preprint arXiv:1812.06298, 2018.
 [4] T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. A. Ojea, E. Solowjow, and S. Levine, “Residual reinforcement learning for robot control,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 6023–6029.
 [5] G. Schoettler, A. Nair, J. Luo, S. Bahl, J. A. Ojea, E. Solowjow, and S. Levine, “Deep reinforcement learning for industrial insertion tasks with visual inputs and natural rewards,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 5548–5555.
 [6] T. B. Davchev, K. S. Luck, M. Burke, F. Meier, S. Schaal, and S. Ramamoorthy, “Residual learning from demonstration: Adapting dmps for contactrich manipulation,” IEEE Robotics and Automation Letters, 2022.
 [7] I. Nematollahi, E. RoseteBeas, A. Röfer, T. Welschehold, A. Valada, and W. Burgard, “Robot skill adaptation via soft actorcritic gaussian mixture models,” arXiv preprint arXiv:2111.13129, 2021.
 [8] A. Paraschos, C. Daniel, J. R. Peters, and G. Neumann, “Probabilistic movement primitives,” in Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds., vol. 26. Curran Associates, Inc., 2013.
 [9] A. Paraschos, C. Daniel, J. Peters, and G. Neumann, “Using probabilistic movement primitives in robotics,” Autonomous Robots, vol. 42, no. 3, pp. 529–551, 2018.
 [10] S. GomezGonzalez, G. Neumann, B. Schölkopf, and J. Peters, “Adaptation and robust learning of probabilistic movement primitives,” IEEE Transactions on Robotics, vol. 36, no. 2, pp. 366–379, Mar. 2020.
 [11] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor,” in Proceedings of International Conference on Machine Learning (ICML), vol. 80, 2018, pp. 1856–1865.
 [12] M. Alakuijala, G. DulacArnold, J. Mairal, J. Ponce, and C. Schmid, “Residual reinforcement learning from demonstrations,” arXiv preprint arXiv:2106.08050, 2021.
 [13] P. Kulkarni, J. Kober, R. Babuška, and C. Della Santina, “Learning assembly tasks in a few minutes by combining impedance control and residual recurrent reinforcement learning,” Advanced Intelligent Systems, p. 2100095, 2021.
 [14] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actorcritic methods,” in Proceedings of the 35th International Conference on Machine Learning, vol. 80, 2018, pp. 1587–1596.
 [15] R. MartínMartín, M. A. Lee, R. Gardner, S. Savarese, J. Bohg, and A. Garg, “Variable impedance control in endeffector space: An action space for reinforcement learning in contactrich tasks,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 1010–1017.
 [16] B. Kim, J. Park, S. Park, and S. Kang, “Impedance learning for robotic contact tasks using natural actorcritic algorithm,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 40, no. 2, pp. 433–443, 2009.
 [17] J. Peters and S. Schaal, “Natural actorcritic,” Neurocomput., vol. 71, no. 7–9, p. 1180–1190, Mar. 2008.
 [18] C. C. BeltranHernandez, D. Petit, I. G. RamirezAlpizar, T. Nishi, S. Kikuchi, T. Matsubara, and K. Harada, “Learning force control for contactrich manipulation tasks with rigid positioncontrolled robots,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 5709–5716, 2020.
 [19] C. C. BeltranHernandez, D. Petit, I. G. RamirezAlpizar, and K. Harada, “Variable compliance control for robotic peginhole assembly: A deepreinforcementlearning approach,” Applied Sciences, vol. 10, no. 19, p. 6923, 2020.
 [20] Y. Wang, C. C. BeltranHernandez, W. Wan, and K. Harada, “Robotic imitation of human assembly skills using hybrid trajectory and force learning,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 11 278–11 284.
 [21] Y. Ding, C. Florensa, P. Abbeel, and M. Phielipp, “Goalconditioned imitation learning,” Advances in neural information processing systems, vol. 32, 2019.
 [22] Y. Wang, C. C. BeltranHernandez, W. Wan, and K. Harada, “Hybrid trajectory and force learning of complex assembly tasks: A combined learning framework,” IEEE Access, vol. 9, pp. 60 175–60 186, 2021.
 [23] M. A. Lee, C. Florensa, J. Tremblay, N. Ratliff, A. Garg, F. Ramos, and D. Fox, “Guided uncertaintyaware policy optimization: Combining learning and modelbased strategies for sampleefficient policy learning,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 7505–7512.
 [24] S. Hoppe, M. Giftthaler, R. Krug, and M. Toussaint, “Sampleefficient learning for industrial assembly using qgraphbounded ddpg,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 9080–9087.
 [25] A. Ranjbar, N. A. Vien, H. Ziesche, J. Boedecker, and G. Neumann, “Residual feedback learning for contactrich manipulation tasks with uncertainty,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 2383–2390.
 [26] Y. Shi, Z. Chen, Y. Wu, D. Henkel, S. Riedel, H. Liu, Q. Feng, and J. Zhang, “Combining learning from demonstration with learning by exploration to facilitate contactrich tasks,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 1062–1069.
 [27] G. Wang, M. Xin, W. Wu, Z. Liu, and H. Wang, “Learning of longhorizon sparsereward robotic manipulator tasks with base controllers,” arXiv eprints, pp. arXiv–2011, 2020.
 [28] O. Spector and D. Di Castro, “Insertionneta scalable solution for insertion,” IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 5509–5516, 2021.
 [29] N. J. Cho, S. H. Lee, J. B. Kim, and I. H. Suh, “Learning, improving, and generalizing motor skills for the peginhole tasks based on imitation learning and selflearning,” Applied Sciences, vol. 10, no. 8, 2020. [Online]. Available: https://www.mdpi.com/20763417/10/8/2719
 [30] J. Kober and J. Peters, “Policy search for motor primitives in robotics,” Machine Learning, vol. 84, no. 12, pp. 171–203, 2011.
 [31] A. AlbuSchaffer, C. Ott, U. Frese, and G. Hirzinger, “Cartesian impedance control of redundant robots: recent results with the dlrlightweightarms,” in 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422), vol. 3, 2003, pp. 3704–3709 vol.3.
 [32] A. J. Ijspeert, J. Nakanishi, H. Hoffmann, P. Pastor, and S. Schaal, “Dynamical movement primitives: Learning attractor models for motor behaviors,” Neural Computation, vol. 25, no. 2, pp. 328–373, 2013.
 [33] S. Calinon, F. Guenter, and A. Billard, “On learning, representing, and generalizing a task in a humanoid robot,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 37, no. 2, pp. 286–298, 2007.
 [34] Y. Huang, L. Rozo, J. Silvério, and D. Caldwell, “Kernelized movement primitives,” The International Journal of Robotics Research, vol. 38, pp. 833–852, 05 2019.
 [35] J. Urain, M. Ginesi, D. Tateo, and J. Peters, “Imitationflow: Learning deep stable stochastic dynamic systems by normalizing flows,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2020. [Online]. Available: https://www.ias.informatik.tudarmstadt.de/uploads/Team/JulenUrainDeJesus/2020iflowurain.pdf
 [36] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in Neural Information Processing Systems (NIPS), 1999, pp. 1057–1063.
 [37] T. Degris, M. White, and R. S. Sutton, “Offpolicy actorcritic,” in Proceedings of the 29th International Coference on International Conference on Machine Learning, ser. ICML’12. Madison, WI, USA: Omnipress, 2012, p. 179–186.
 [38] D. P. Kingma and M. Welling, “AutoEncoding Variational Bayes,” in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Conference Track Proceedings, 2014.
 [39] L. Rozo and V. Dave, “Orientation probabilistic movement primitives on riemannian manifolds,” CoRR, vol. abs/2110.15036, 2021. [Online]. Available: https://arxiv.org/abs/2110.15036
 [40] D. Koert, J. Pajarinen, A. Schotschneider, S. Trick, C. Rothkopf, and J. Peters, “Learning intention aware online adaptation of movement primitives,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3719–3726, 2019.
 [41] “Ubongo 3d,” May 2021. [Online]. Available: https://www.kosmosgames.co.uk/games/ubongo3d/
 [42] M. Conforti, G. Cornuejols, and G. Zambelli, Integer Programming. Springer Publishing Company, Incorporated, 2014.
 [43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014, cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015. [Online]. Available: http://arxiv.org/abs/1412.6980
 [44] L. Rozo* and V. Dave*, “Orientation probabilistic movement primitives on riemannian manifolds,” in Conference on Robot Learning, vol. 5, 2021, p. 11. [Online]. Available: https://cps.unileoben.ac.at/wp/orientation_probabilistic_move.pdf,ArticleFile
 [45] S. M. KhansariZadeh and A. Billard, “A dynamical system approach to realtime obstacle avoidance,” Autonomous Robots, vol. 32, no. 4, pp. 433–454, 2012, the final publication is available at www.springerlink.com. [Online]. Available: http://infoscience.epfl.ch/record/174759