Bayesian Gaussian mixture model for robotic policy imitation

04/24/2019
by   Emmanuel Pignat, et al.
Idiap Research Institute
0

A common approach to learn robotic skills is to imitate a policy demonstrated by a supervisor. One of the existing problems is that, due to the compounding of small errors and perturbations, the robot may leave the states where demonstrations were given. If no strategy is employed to provide a guarantee on how the robot will behave when facing unknown states, catastrophic outcomes can happen. An appealing approach is to use Bayesian methods, which offer a quantification of the action uncertainty given the state. Bayesian methods are usually more computationally demanding and require more complex design choices than their non-Bayesian alternatives, which limits their application. In this work, we present a Bayesian method that is both simple to set up, computationally efficient and that can adapt to a wide range of problems. These advantages make this method very convenient for imitation of robotic manipulation tasks in the continuous domain. We exploit the provided uncertainty to fuse the imitation policy with other policies. The approach is validated on a Panda robot with three tasks using different control input/state pairs.

READ FULL TEXT VIEW PDF

Authors

page 1

page 5

page 6

08/02/2019

Combining learned skills and reinforcement learning for robotic manipulations

Manipulation tasks such as preparing a meal or assembling furniture rema...
01/31/2017

Deep Reinforcement Learning for Robotic Manipulation-The state of the art

The focus of this work is to enumerate the various approaches and algori...
04/15/2022

Divide Conquer Imitation Learning

When cast into the Deep Reinforcement Learning framework, many robotics ...
11/09/2021

AW-Opt: Learning Robotic Skills with Imitation and Reinforcement at Scale

Robotic skills can be learned via imitation learning (IL) using user-pro...
09/15/2019

A Linearly Constrained Nonparametric Framework for Imitation Learning

In recent years, a myriad of advanced results have been reported in the ...
03/22/2021

Introspective Visuomotor Control: Exploiting Uncertainty in Deep Visuomotor Control for Failure Recovery

End-to-end visuomotor control is emerging as a compelling solution for r...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many learning modalities exist to acquire robot manipulation tasks. Reward-based methods, such as optimal control (OC) or reinforcement learning (RL), either require accurate models or an important number of samples. An appealing approach is behavior cloning (or policy imitation) where the robot learns to imitate a policy, a conditional model

from state to control command . Due to modeling errors, perturbations or different initial conditions, executing such policy can quickly lead the robot far from the distribution of states visited during the learning phase. This problem is often referred to as the distributional shift [1]. When applied to a real system, the actions can therefore be dangerous and the consequences catastrophic. Many approaches, such as [1], have been addressing this problem in the general case. A subset of these approaches focus on learning manipulation tasks from a small set of demonstrations [2][3][4]

. In order to guarantee safe actions, these techniques typically add constraints to the policy, by introducing time-dependence structures or by developing hybrid, less general approaches. We propose to keep the flexibility of policy imitation without constraining too much the policy to be learned by relying on Bayesian models, which have the advantage of quantifying their uncertainty. Uncertainty quantification is very important in behaviour cloning due to the problems mentioned before. This capability generally comes at the expense of being computationally demanding, which often reduces their applicability. In this work, we propose a computationally efficient and simple-to-apply Bayesian model that can be used for policy imitation. It allows active learning or fusion of policies, which will be detailed in Sec. 

IV. The flexibility of the proposed model enables its use in wide-ranging data problems, in which the robot can start learning from a small set of data, without limiting its capability to increase the complexity of the task when more data become available.

Throughout the article, we will consider a simple 2D velocity controlled system with state as position for didactic and visualization purposes (see e.g., Fig. 1). However, the approach is developed for higher dimensional systems, which will be demonstrated in Sec. V with velocity and force control of a 7-axis manipulator.

Fig. 1: Comparison of Bayesian and non-Bayesian GMM conditional distributions. The flow field represents the expectation of and the colormap its entropy . Yellow color is for lower entropy, meaning higher certainty. The black lines are the demonstrations (a) Bayesian model:

Certainty is localized in the vicinity of the demonstrations. The policy retrieved further away can result in poor generalization, but the system is aware of this through an uncertainty estimate.

(b) Non-Bayesian model: The entropy only relates to the variations of the demonstrated policy instead of a Bayesian uncertainty.

Ii Related Work

Several works have tackled the distributional shift problem in the general case. In [1], Ross et al. used a combination of expert and learner policy. In [5], perturbations are added in order to force the expert to show how to recover from them, resulting in a more robust policy.

More closely related to our manipulation tasks, Khansari-Zadeh et al. have also used GMM conditioning to learn a policy on the form [3]. They impose a structure on the parameters to guarantee asymptotic convergence. Dynamic movement primitives (DMP) is a popular approach that combines a stable controller (spring-damper system) with a non-linear modulation (forcing term), decaying over time through the use of a phase variable, ensuring convergence at the end of the motion [6]. A similar approach is used in [2] using GMM conditioning for the non-linear part. Due to their underlying time dependence, these approaches are often limited to either point-to-point motions or cyclic motions of known period, with limited temporal and spatial robustness to perturbations.

Some approaches avoid these problems by modeling distributions of states or trajectories instead of learning policies [7, 4], which consequently limits these techniques to mainly replicate trajectories.

Inverse optimal control [8], which tries to find the objective minimized by the demonstrations, is another direction to increase robustness and generalization. It often requires more training data and is very heavy computationally.

If a reward function for the task is accessible, an interesting alternative is to combine policy imitation with reinforcement learning [9, 10]. The imitation loss helps the duration of the exploration phase to be reduced while the reinforcement learning overcomes the limits of imitation.

Iii Bayesian Gaussian Mixture model conditioning

Numerous regression techniques exist and have been applied to robotics. In this section, we derive a Bayesian version of Gaussian mixture conditioning [11], which was already used in [2, 3]. We start by listing all the desirable features for our task, besides uncertainty quantification, to motivate the need for a different approach.

Multimodal conditional

Policy from human demonstrations are never optimal but exhibit some stochasticity. In some tasks (e.g., implying obstacle), the policy can even be clearly multimodal. Our approach should be able to encode arbitrarily complex conditional distributions. Existing approaches such as LWPR [12] or GP [13] only model unimodal conditional. In its original form, GP assumes homoscedasticity (constant covariance over the state).

Efficient and robust computation

Most of the Bayesian models are computationally more demanding than their non-Bayesian counterparts, both at learning and at prediction. To keep a fast and interactive learning, we seek to keep a low training time (below 5 sec for the tasks shown in the experiments). For reactivity and stability, the prediction should be far below 1 ms. We seek to avoid difficult model selection/hyperparameters tuning, for direct application in a wide range of tasks, controllers and robotic platforms.

Wide-ranging data

The approach should be able to encode very simple policy with a few datapoints and more complex ones when more datapoints are available.

Representing a joint distribution as a Gaussian Mixture model and computing conditional distribution is simple and fast

[11], with various applications in robotics [2, 3]. In this section, we will present the required components to derive its Bayesian version. Bayesian regression methods approximate the posterior distribution of the model parameters given the input and output datasets. Here we refer the output as while it corresponds to the control command for the imitation problem. Then predictions are made by marginalizing over the parameters posterior distribution, to propagate model uncertainties to predictions

(1)

where is a new query. This distribution is called predictive posterior. Parametric non-Bayesian methods typically rely on a single point estimate of the parameters such as maximum likelihood. A variety of methods, adapted to the various model, exists for approximating

: for example variational methods, Monte Carlo methods or expectation propagation. For its scalability and efficiency, we present here a variational method using conjugate priors and mean field approximation, see

[14] for additional details.

Iii-a Bayesian analysis of multivariate normal distribution

We first focus on the Bayesian analysis of multivariate normal distribution (MVN), also detailed in

[15], and then treat the mixture case. As a standard notation, we use only to denote observations in a first place, but we will then encode a joint distribution of and .

Prior

The conjugate prior of the MVN is the normal-Wishart distribution. The convenience of a conjugate prior is the closed-form expression of the posterior. The normal-Wishart distribution is a distribution over mean and precision matrix ,

(2)
(3)

where and

are the scale matrix and the degree of freedom of the Wishart distribution, and

is the precision of the mean.

Posterior

The closed form expression for the posterior is

(4)
(5)
(6)
(7)
(8)

where is the number of observations and is the empirical mean. As for conjugate priors, the prior distribution can be interpreted as pseudo-observations to which the dataset is added.

Posterior predictive

The posterior predictive distribution is a multivariate t-distribution with degree of freedom

(9)

The multivariate t-distribution has heavier tails than the MVN. The MVN is a special case of this latter, when the degree of freedom parameter tends to infinity, which corresponds to having infinitely many observations.

Conditional posterior predictive

For our application, we are interested in computing conditional distributions in the joint distribution of our input and output . We rewrite result from (9) as

(10)

Following [16], the multivariate t-distribution conditional distribution is also multivariate t,

(11)

with

(12)
(13)
(14)

where is the dimension of and and can be decomposed as

(15)

Its mean follows a linear trend on and its scale matrix increases as the query point is far from the input marginal distribution, depending on the degree of freedom. These expressions are similar to MVN conditional but the scale ( for the MVN) has an additional factor, increasing uncertainty as is far from known distribution.

Iii-B Bayesian analysis of the mixture model

Using the conjugate prior for the MVN leads to very efficient training of mixtures with mean-field approximation or Gibbs sampling. Efficient algorithms similar to expectation maximization (EM) can be derived. For brevity, we will here only summarize the results relevant to our application (see e.g.,

[14] for additional details). Using mean-field approximation and variational inference, the posterior predictive distribution of a mixture of MVN is a mixture of multivariate t-distributions

(16)

The conditional posterior distribution is then also a mixture

(17)

where we need to compute the marginal probability of the component

given the input

(18)

and apply conditioning in each component using (12)–(14)

(19)

Equation (18) exploits the property that marginals of multivariate-t distributions are of the same family [16].

Fig. 2: Conditional distribution where

is modeled as a mixture of Gaussians. The mean and the standard deviation are represented by applying moment matching on the multimodal conditional distribution.

(blue) The joint distribution has a Dirichlet process prior. (red) The prior is a Dirichlet distribution, with a fixed number of clusters. (yellow) The prediction is done without integrating the posterior distribution, which results in a lack of estimation of its uncertainty far from training data.

Dirichlet distribution or Dirichlet process

In [14], the Bayesian Gaussian mixture model is presented with a Dirichlet distribution prior over the mixing coefficients . An alternative is to use a Dirichlet process, a non-parametric prior with an infinite number of clusters. For the learning part, this allows the model to cope with an increasing number of datapoints and adapting the model complexity. Very efficient online learning strategies [17] exist. With a Dirichlet process, the posterior predictive distribution is similar to (16) but with an additional mixture component being the prior predictive distribution of the MVN. Conditional distribution given points far from the training data would have the shape of the marginal predictive prior, similarly as in GP. Fig. 2 illustrates the difference, when conditioning, between the Dirichlet process and the distribution. Speaking in terms of policy, it means that when diverging from the distribution of visited states, the control command distribution will match the given prior.

With a Dirichlet process, the model presented here is a particular case of [18], having the advantages of faster training with variational techniques and faster retrieval with closed-form integration.

Iv Product of policy distributions

In cognitive science, it is known that two heads are better than one if they can provide an evaluation of their uncertainty when bringing their knowledge together [19]

. In machine learning, fusing multiples sources is referred to as product of experts (PoE)

[20]. In this section, we propose to exploit the uncertainty presented in the above, by fusing multiple policies , coming from multiple sources or learning strategies.

In the general case, computing the mode of a PoE requires optimization. For estimating the distribution, methods similar to posterior distribution approximation should be used. It affects the use of this method in applications where the policy should be computed fast. However, when the experts are MVN, , the product distribution has the closed form expression

(20)

where denotes the precision matrix (inverse of covariance). This result has an intuitive interpretation: the estimate is an average of the sources weighted by their precisions. To be able to use this formula using an expert being the Bayesian GMM conditioning, moment matching can be applied to approximate the mixture of multivariate t-distribution of the conditional by an MVN.

Iv-a Example of controllers

We present a set of policies that can be combined to increase robustness and that will be used in the experiments.

Optimal control

If the task can be formulated as a cost (e.g., attaining a given state), an interesting strategy is to use optimal control (OC). Classically, these techniques require an accurate model of the system and are subject to local minima (for example in an environment with obstacles). We propose to use, in combination with imitation, OC with crude model approximations (e.g., without modeling obstacles).

In order to combine the policy, we need the OC solver to retrieve a distribution of commands

. A first way will be to use the solution as the mean of an MVN and fix its precision heuristically, such that it dominates the imitation policy outside of the training data. A more rigorous way is to use the maximum entropy principle

[21], retrieving a near optimal stochastic policy. However, this technique is much more computationally demanding than the standard OC problem. When using linear dynamics and quadratic cost, the maximum entropy solution can be retrieved very efficiently [22], using linear quadratic regulator (LQR) as

(21)

It should also be investigated if uncertainties about the dynamics or parameters of the cost function can be propagated to the policy.

Time-dependent policy

For discrete, point-to-point tasks, it is possible to use a policy that will ensure convergence at the end of the motion. This structure is already given in dynamical movement primitives (DMP) [6] or in [2], where a phase variable is responsible for the switch between the controllers. Our approach allows for a more complex and better-motivated fusion of policies. This stable controller can either be engineered or computed with OC. For this paper, we use an LQR where the cost on the state only applies at the end of the task. The controller, in the form of (21), has gain and precision matrix increasing along time, as shown in Fig. 3b for .

Fig. 3: Mixing imitation and LQR, with the final target specified. (a) Bayesian GMM policy: Time independent policy learned with the presented method. (b) LQR policy: Only final state target distribution is specified. It results in a time-dependent policy whose control gains increases along time. (c) Combination: The combination is a time-dependent policy. It applies, at the beginning, the imitation policy in zones of high certainty, and is lightly converging outside. At the end of the task, it becomes more strongly converging all over the states.

Conservative policy

We also propose to use a policy that brings us back to the distribution of states where the policy is known, as in [23]. In their case, the policy was designed as a PD controller, but we propose to use OC for more generality. In the proposed GMM model, the conservative policy can either be optimized to minimize the uncertainty of the imitation policy (e.g., given by the entropy of the conditional distribution) or to converge to the marginal distribution of states. For the experiments, we choose to solve the latter with an LQR and a local quadratic approximation of the marginal distribution. Fig. 4b illustrates this policy and shows that this technique can encode cyclic motions, which would have been difficult to achieve with the previous propositions. We also note that the combination given by (20) is more complex than the scalar combination proposed in [23] because it can take into account that policies may have variable precisions along the different axes.

Fig. 4: Encoding a limit cycle can be done without any modifications of our approach. (a) Bayesian GMM policy: By only applying this policy, the robot quickly diverges from the cycle and goes into area where the policy is unknown. (b) Conservative policy: This policy forces the robot to go back to zones where the robot knows the policy, with a stabilizing function. (c) Product of policies: The Product of the two policies follows the imitated policy in known zones and converges back to these zones outside of it.

V Experiments

Fig. 5: (left) Obstacle navigation with joint velocity controller. (center) Vision based peg-in-hole with velocity controlled robot. (right) White board wiping with force controlled robot.

Three experiments are presented to demonstrate that the proposed approach can be used in a variety of tasks. They are performed on a 7-axis Panda robot.

V-a Obstacle navigation

In the first experiment, the robot should navigate through fixed obstacles from various initial configurations in order to grasp an object, see Fig. 5-(left). The system is defined with joint angles as states and joint velocities as control commands. In total, 13 demonstrations were performed (totaling 115 sec of recording). Five of them were only partial, providing a precise policy only at the end of the motion. We chose to fuse the imitation policy with a conservative policy, converging to the marginal distribution of joint angles, as illustrated in Fig. 4. The non-Bayesian policy, as planned, was quickly diverging and dangerously accelerating outside of the known area.

While an optimal control/planning approach would require the modeling of the obstacles and the robot volume, ours was able to propose a solution, robust to a wide range of starting points, that could be set up within a few minutes. We also avoided the problems of trajectory-based approaches and were able to perform partial demonstration in areas requiring precise motion to be demonstrated.

V-B Vision based Peg-in-hole

In the second task, the robot should insert a peg in a moving hole by looking from two side cameras, see Fig. 5-(center). The center of the red peg and the center of the blue tube (top part) are extracted by image processing. The diameter of the peg is 15% smaller than the hole, such that the task can be solved with an imperfect vision system and without impedance control. The state is the pixel displacement from the hole and the peg from the two cameras . The control command is the Cartesian velocity of the gripper in robot frame (the orientation was held fixed) , but joint angle velocities would have been possible as well, likely requiring some more demonstrations.

As an evaluation, the robot was initialized at 10 random postures (with the peg still being seen by the two cameras). We evaluated its capacity to insert the peg without touching the border of the hole or hitting the support. Multiple combinations of policies were used (among imitation, conservative and optimal control, as presented in Sec. IV). The conservative policy was similar to the one in the previous experiment, but converging to the marginal distribution in pixel space. Computing this policy requires the dynamic model , unknown in this experiment, which is learned by recording 10 sec of random motions (using the same method as for policy imitation). The cost for the optimal control policy is defined as the negative log-likelihood of the distribution of desired final states, encoded as an MVN. The cost on the state only applies after some time, matching the one of the longest demonstrations. This allows variability during the task and forces convergence at the end of the task. The cost on is the negative log-likelihood of the distribution of control command during the demonstrations. Results are reported in Table  I.

Policies success failure
- with touching
Imitation only 0.3 0.1 0.6
Im. + OC 0.4 0.4 0.2
Im. + CS 0.7 0.1 0.2
Im. + CS + OC 0.8 0.1 0.1
OC 0.0 0.1 0.9
TABLE I: Results of the peg-in-hole task with different combinations of policies (imitation, optimal control (OC) and conservative (CS)). The task is considered a success if the peg is in the hole without touching the border, or with a slight touch. Hitting the border or being stuck is considered as a failure.

Imitation alone shows the effect of an accumulation of errors, often fails if brought far from the known areas, and only slowly stops. Having not learned or modeled the obstacles, the OC policy results in a straight line motion to the target and thus almost always hits the border (as illustrated in Fig. 3). It only works when employed very close to the goal. It provides good correction in this context when used in combination with the other policies. The conservative policy has a relevant effect to converge back to the known areas and provide a good improvement of the results.

In this experiment, we are in a similar situation as [3] with velocity control. The difference is that, as we do not control directly the pixel velocity, another strategy is required to increase stability.

V-C Force based board wiping

In the third task, we demonstrate learning of a force policy within a cyclic motion. The robot has to wipe a board by applying a force against it while doing circular motions. The state is composed of the position and linear velocity of the end-effector (defined at the gripper) in the robot base frame . The control command is the force to apply at the end-effector , which is then transformed to torques as , where Coriolis and gravity

compensation torques are added. To be able to record forces during demonstrations, teleoperation is used with a second 7-axis Panda robot. The two end-effectors are linked to each other using a virtual spring-damper system with a position offset. One demonstration of the circular cyclic motion is performed, 7 others are performed showing how to start/recover when not in contact with the board or outside of the wiping area. As in the first experiment, we used a combination of the imitation policy and the conservative policy, which forces convergence to the marginal distribution of position and velocity. By executing only the imitation policy, the robot is pressing against the board and applying a very imprecise force tangentially to the wiping path to compensate for friction. It diverges quickly from the original path. Except perpendicularly to the board, the applied forces are quite clumsy and show a wide variance, which explains the failure of this policy. The conservative policy alone executes circle-shaped motions very robustly without applying any force against the board, resulting in no cleaning. The fusion rule with full precision matrices as in (

20) allows the imitation policy to be used perpendicularly to the board and the conservative policy along the other direction, according to their respective precisions. The resulting policy is very robust, even when the user strongly perturbs the robot, showing that it slowly converges back to the board and to the circular motion before starting wiping. This robustness and combination of periodic (wiping) and discrete (converging back to the wiping area) would have been very difficult to achieve (if not impossible) with trajectory based or time-dependent approaches.

Vi Conclusion

In this paper, we presented a regression technique with several interesting features for robotic applications. We applied this technique to the common problem of distributional shift in policy imitation, where the uncertainty can be exploited for an intelligent fusion of controllers. Many approaches for learning robotic manipulation tasks impose structure or restrictions on the policy, which limit their range of applications. We showed in three distinct experiments that our approach can be applied without modifications to many state-control systems and to a variety of tasks (discrete or/and periodic).

References

  • [1]

    S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in

    Proc. Intl Conf. on Artificial Intelligence and Statistics (AISTATS)

    , 2011, pp. 627–635.
  • [2] M. Hersch, F. Guenter, S. Calinon, and A. Billard, “Dynamical system modulation for robot learning via kinesthetic demonstrations,” IEEE Trans. on Robotics, vol. 24, no. 6, pp. 1463–1467, 2008.
  • [3] S. M. Khansari-Zadeh and A. Billard, “Learning stable nonlinear dynamical systems with gaussian mixture models,” IEEE Trans. on Robotics, vol. 27, no. 5, pp. 943–957, 2011.
  • [4] A. Paraschos, C. Daniel, J. R. Peters, and G. Neumann, “Probabilistic movement primitives,” in Advances in Neural Information Processing Systems (NIPS), 2013, pp. 2616–2624.
  • [5] M. Laskey, J. Lee, R. Fox, A. D. Dragan, and K. Y. Goldberg, “DART: Noise injection for robust imitation learning,” in Conference on Robot Learning (CoRL), 2017.
  • [6] S. Schaal, J. Peters, J. Nakanishi, and A. Ijspeert, “Learning movement primitives,” in Intl Journal of Robotic Research.   Springer, 2005, pp. 561–572.
  • [7] S. Calinon, D. Bruno, and D. G. Caldwell, “A task-parameterized probabilistic model with minimal intervention control,” in Proc. IEEE Intl Conf. on Robotics and Automation (ICRA).   IEEE, 2014, pp. 3339–3344.
  • [8] C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverse optimal control via policy optimization,” in International Conference on Machine Learning, 2016, pp. 49–58.
  • [9] A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine, “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations,” arXiv preprint arXiv:1709.10087, 2017.
  • [10] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Overcoming exploration in reinforcement learning with demonstrations,” in Proc. IEEE Intl Conf. on Robotics and Automation (ICRA).   IEEE, 2018, pp. 6292–6299.
  • [11] H. G. Sung, “Gaussian mixture regression and classification,” PhD thesis, Rice University, Houston, Texas, 2004.
  • [12] S. Schaal, C. G. Atkeson, and S. Vijayakumar, “Scalable techniques from nonparametric statistics for real time robot learning,” Applied Intelligence, vol. 17, no. 1, pp. 49–60, 2002.
  • [13] C. E. Rasmussen, “Gaussian processes in machine learning,” in Advanced lectures on machine learning.   Springer, 2004, pp. 63–71.
  • [14] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics).   Secaucus, NJ, USA: Springer, 2006.
  • [15]

    K. P. Murphy, “Conjugate bayesian analysis of the Gaussian distribution,” 2007.

  • [16] R. M., On the multivariate t distribution.   Linköping University Electronic Press, 2013.
  • [17] M. C. Hughes and E. Sudderth, “Memoized online variational inference for Dirichlet process mixture models,” in Advances in Neural Information Processing Systems (NIPS), 2013, pp. 1133–1141.
  • [18] L. A. Hannah, D. M. Blei, and W. B. Powell, “Dirichlet process mixtures of generalized linear models,” Journal of Machine Learning Research, vol. 12, no. Jun, pp. 1923–1953, 2011.
  • [19] B. Bahrami, K. Olsen, P. E. Latham, A. Roepstorff, G. Rees, and C. D. Frith, “Optimally interacting minds,” Science, vol. 329, no. 5995, pp. 1081–1085, 2010.
  • [20] G. E. Hinton, “Products of experts,”

    Proc. Intl Conf. on Artificial Neural Networks (ICANN)

    , 1999.
  • [21] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning.” in Proc. AAAI Conference on Artificial Intelligence, vol. 8.   Chicago, IL, USA, 2008, pp. 1433–1438.
  • [22] S. Levine and P. Abbeel, “Learning neural network policies with guided policy search under unknown dynamics,” in Advances in Neural Information Processing Systems (NIPS), 2014, pp. 1071–1079.
  • [23] A. Paraschos, E. Rueckert, J. Peters, and G. Neumann, “Model-free probabilistic movement primitives for physical interaction,” in Proc. IEEE/RSJ Intl Conf. on Intelligent Robots and Systems (IROS).   IEEE, 2015, pp. 2860–2866.