I Introduction
Intelligent and cooperative robots must be capable of adapting to novel tasks in dynamic, unstructured environments. This is a challenging problem to address; it requires a robot to possess a diverse set of skills that may be difficult to handspecify or preprogram. Learning from demonstration (LfD) has proven an effective tool in approaching such problems [1]. To acquire a desired skill, LfD approaches generally involve learning a skill model from a set of demonstrations provided by a human. The model can then be queried to reproduce the skill in novel reproduction environments with additional skill constraints. Common examples of constraints include new start/goal states, or new obstacle configurations that constrain the set of possible trajectories. LfD techniques generally differ in the manner in which the skill is represented, learned, and reproduced.
Most prior LfD approaches [2, 3, 4, 5] are based on the assumption that demonstrations can be performed in uncluttered, minimally constrained environments. The presence of clutter in the demonstration environments can introduce additional constraints on human demonstrations that are unrelated to the target skill or the underlying human intent. If unaccounted for, this can lead to suboptimal skill models. However, restructuring the world to remove clutter is often impractical, which limits the viability of such approaches.
In this work, we tackle the problem of learning skills from a set of demonstrations, which can be partially or fully influenced by the presence of obstacles (see Fig. 1).
To contend with obstacles during training, we present importance weighted skill learning. Specifically, we adopt and extend the inferencebased view of skill reproduction as proposed by Rana et al. [6] with Combined Learning from Demonstration And Motion Planning (CLAMP). CLAMP provides a principled approach for generalizing robot skills to novel situations, including avoiding unknown obstacles present in the reproduction environment. When reproducing a desired skill, trajectories are generated to be optimal with respect to the demonstrations while remaining feasible in the given reproduction environment.
We extend CLAMP to utilize demonstrations from cluttered environments through importance weighted skill learning (see Fig. 2), which rates the importance of demonstration trajectories while learning the skill model. We propose an importance weighting function that assigns lower importance to parts of demonstrations that are more likely to be influenced by obstacles. We present batch and incremental versions of our algorithm: batch learning is useful when the set of initial demonstrations are sufficient for learning a reasonable skill model, while incremental learning is useful in scenarios that require refinement of the skill model as new demonstrations in new environments become available.
We validate our approach on a 7DOF JACO2 manipulator with reaching and placing skills. In all the experiments, we evaluate the approach by providing demonstrations in cluttered environments and then changing the environments for reproduction.
Ii Related Work
Many existing approaches to trajectorybased LfD address the problem of avoiding obstacles in the reproduction scenario. Some approaches add obstacle avoidance in the skill reproduction phase as a reactive strategy [7, 8, 9], while others carry out motion planning or trajectory optimization [10, 11, 12, 6]. In all these approaches, the skill model is learned from demonstrations that are not affected by obstacles. Any constraints or costs associated with obstacles are typically present during reproduction only. However, in an obstaclerich environment, the demonstrations themselves are likely to be influenced by the presence of obstacles, which could have repercussions during skill reproduction.
There have been a few attempts to address the problem of learning skills from demonstrations in cluttered environments. For example, [13, 14] learn a dynamic movement primitive (DMP) as well as a coupling term for obstacle avoidance from demonstrations. These approaches suffer from two major problems. First, since DMPs follow a single demonstration, they fail to learn potentially different ways of executing the skill, thereby limiting its robustness in new scenarios. Second, due to the reactive nature of the obstacle avoidance strategy, the reproduced trajectory does not necessarily preserve the shape of the motion in the presence of obstacles. Ghalamzan et al. [15]
, proposed an approach based on learning a cost functional from human demonstrations. This cost functional is dependent on two components: the deviation from the mean of the demonstrations, and the distance from obstacles in the environment. Parameters of both these components are estimated from human demonstrations. A major drawback of this approach is the assumption that the mean of the demonstrations sufficiently expresses the demonstrated skill. This assumption however stands invalid for skills which can be executed in multiple ways and hence requires a more expressive skill model.
Our proposed method is based on learning an underlying stochastic dynamical system from demonstrations. Depending on the part of the statespace the robot lies in, this dynamical system is able to generate different ways of executing a learned skill. We make use of importance weighting to discount the effect of obstacles that are present when the demonstrations are provided. Specifically, the parts of demonstrations in the vicinity of obstacles are penalized to account for their deviation from the desired skill or the human intention.
Iii Combined Learning from Demonstration and Motion Planning
We adopt the probabilistic inference view on learning from demonstration which has been previously employed in CLAMP [6].
Iiia Skill Reproduction as Probabilistic Inference
Skill reproduction using CLAMP is performed by maximum a posteriori (MAP) inference given a trajectory prior and event likelihoods in the reproduction environment.
Trajectory Prior
The trajectory prior or the skill model represents a distribution over robot trajectories. A trajectory is defined as a finite collection of dimensional robot states at time ,
. The prior is given by a joint Gaussian distribution over the robot states,
(1) 
where,
The prior enforces optimality by penalizing the optimal trajectory on deviating from the mean of the prior during inference. The trajectory prior is learned from demonstrations.
Event Likelihood
The likelihood function encodes the constraints in the skill reproduction scenario. The constraints are represented as random events that the optimal trajectory should satisfy thus enforcing feasibility during inference i.e. reproduction. These events, for example, may include obstacle avoidance, or a new start/goal state or viapoint. The likelihood function is defined as a distribution in the exponential family,
(2) 
where
is a vectorvalued cost function with covariance matrix
. The reader is referred to [16, 6] for more details on these likelihood functions.MAP Inference
The desired optimal and feasible trajectory that reproduces the skill is then given by,
(3) 
IiiB Trajectory Prior Formulation
It is assumed that in CLAMP that robot trajectories for a desired skill are governed by an underlying linear stochastic skill dynamics,
(4) 
where and are a timevarying transition matrix and a bias term, respectively, and
is additive white noise with timevarying covariance
. The trajectory prior can be generated by taking the first and second order moments of the solution to this dynamics. This Markovian dynamics yields an exactly sparse precision matrix (inverse covariance)
[17, 18] inducing structure in the trajectory prior in (1), which enables efficient learning and inference. The problem of learning the trajectory prior is equivalent to estimating the underlying stochastic dynamics.While learning the trajectory prior, CLAMP assumes all available demonstrations are free from external influences, and therefore captures the true human intent or skill constraints. However, in the presence of such influences, this assumption no longer holds and the learned prior is suboptimal.
Iv Importance weighted skill learning
In this section, we introduce importance weighting when learning the prior to exclude the effects of unwanted influences during demonstrations. We seek to estimate the parameters of the skill dynamics model in (4) from demonstrations. As a preliminary step, lets rewrite (4) as follows,
(5) 
where,
We additionally define an importance weighting function as . The importance weighting function should give higher weights to robot states that are less likely to deviate from the skill constraints or the true human intent. While this importance weighting formulation can be used in other contexts too, in this paper we define a specific form of importance weighting to account for the influence of unwanted obstacles in the demonstration environment. The exact form of this environmentdependent obstacle weighting function is presented in Section V.
Iva Batch Skill Learning
Let’s assume the availability of trajectory demonstrations, with the demonstration defined as . For each discrete time interval , the inputs are collected into a matrix while the corresponding targets into a matrix . Furthermore, the matrix defines a statedependent importance weight matrix.
The batch skill learning formulation seeks to find and , which minimize a regularized squared norm over the provided demonstrations.
(6)  
where defines the error matrix, and is a regularization coefficient.
The solution to the batch skill learning problem in (6
) is given by the weighted ridge regression estimate,
(7)  
(8)  
IvB Incremental Skill Learning
The batch skill learning procedure assumes that there are enough demonstrations available to learn an optimal skill model. However, as more demonstrations are aggregated over time, possibly in different environments, it is desirable to refine the model since more data provides a better estimate of the skill. To achieve this, we propose incremental weighted skill learning.
Our incremental skill learning procedure is based on Bayesian inference. In this formulation, we maintain a joint probability distribution over the unknown skill dynamics parameters. Every time a new demonstration is collected, a posterior over the skill dynamics parameters is calculated
(9) 
where . At any stage, the mode of the posterior distribution provides an estimate of the unknown parameters.
Skill Dynamics Distribution
The joint probability distribution over the unknown parameters and is given by
(10) 
where,
(11)  
(12) 
refers to a matrixnormal distribution with matrixvalued mean
and covariances and for the rows and columns respectively. refers to an inverseWishart distribution with positive definite scale matrix and degrees of freedom. Note that matrixnormal and inverseWishart distributions are generalizations of the normal and inversegamma distributions respectively to the multivariate case.Demonstration Likelihood
The likelihood of observing the inputtarget pair from the demonstration under the stochastic dynamics (5) is given by
(13)  
where and . Note that the likelihood is scaled by the weight in order to incorporate the importance weighting.
Skill Dynamics Inference
The skill dynamics parameters after assimilation of demonstrations is given by the mode of the joint posterior distribution (maximum a posteriori),
(14) 
Due to the properties of matrixnormal and inverse Wishart distributions, the mode of the joint distribution turns out to be equivalent to the product of the modes of the two conditional distributions
[19],(15)  
(16) 
Furthermore, the parameters of the conditional distributions are governed by the following update laws,
The incremental learning procedure is initialized with a prior joint distribution . The Gaussian comonent of the joint prior is selected to be the ridge regression prior, that is, and . The inverse Wishart component is selected to be an uninformed prior, with and . Here and are positive scalars. In our implementation, we set . Note that smaller values of these scalars makes the prior too strict, which restrains the skill model from fitting the data well.
V Environmentdependent importance weighting function
In this section, we define the importance weighting function to enable skill learning from demonstrations, which may be provided in the presence of obstacles in the environment. The weighting function gives lower importance to the parts of a demonstration which are more likely to be influenced by the presence of an obstacle and therefore deviate from the intent of the human.
We hypothesize that the parts of demonstrations closer to obstacles are influenced by the obstacles and therefore fail to satisfy the skill constraints. Conversely, partial trajectories farther away from obstacles are more likely to satisfy the skill constraints and should be given more importance. For a given state , we define the importance weight to be equivalent to the likelihood of staying collisionfree [16]
. For this likelihood function, we first define a hinge loss function
where is the signed distance from the closest obstacle in an environment and specifies the ‘danger area’ around the obstacle. With this hinge loss, we assume that an obstacle affects a state only when it is within the danger area around the obstacle. Outside of this danger area, the obstacle has no influence on the state. The importance weight itself is given by a function in the exponential family,
(17) 
where the parameter dictates the rate of decay of the importance weight for states within the ’danger area’. The smaller the value of , the faster the importance weight will decay down to zero (see Fig 3).
Vi Experiments
We evaluate the performance of our method on two different skills^{1}^{1}1Accompanying video: https://youtu.be/03r8Tblhq7k: 1) the reaching skill, and 2) the placing skill. For both skills, a human provides multiple demonstrations via kinesthetic teaching on a 7DOF JACO2 manipulator. The endeffector positions are recorded and the corresponding instantaneous velocities are estimated by fitting a cubic spline to each demonstration and taking its time derivative. Furthermore, the demonstrations are also timealigned using dynamic time warping (DTW). To setup the trajectory prior in (1), we define the robot states as the vector concatenation of instantaneous robot positions and velocities.
For the reaching skill, the goal is to reach an object from different locations. Hence, all the demonstrations share the same goal state while the initial state varies. In the absence of any obstacles in the path, a demonstration follows a nearly straightline path to the goal. In the presence of obstacles in the path, the demonstrations deviate from this desired path in order to avoid collision with the obstacles. Fig. 6 shows the demonstration environment and the corresponding demonstrations.
In order to learn the trajectory prior for this skill, we use importance weighted skill learning, as described in Section IVA. The demonstrations reaching the target from the uncluttered part of the environment represent the true human intent. Therefore, we expect our trajectory prior to be biased towards these demonstrations. Fig. 9 shows the trajectory distributions (i.e. timeevolving state distributions) encoded in the trajectory priors learned with and without importance weighting. The trajectory distributions are generated by rolling out the stochastic skill dynamics in (5) with an initial state distribution given by a Gaussian over the initial demonstration states. The mean of the trajectory distribution generated with importance weighting deviates less from the intended straightline path, exhibiting the true underlying skill, as compared to the distribution without importance weighting. To enable this, we empirically selected the parameters of the importance weight function in (17), such that the parts of statespace likely to be under obstacle influence can be successfully downplayed while learning the prior. A value of and provided sufficient bounding region around the obstacles in most cases.
skill. The blue line is the mean of the prior, and the blue shaded region shows one standard deviation around the mean.
Fig. 15 shows multiple instances of reproduction for the reaching skill. The skill is reproduced with (3) by conditioning the learned trajectory prior on the likelihood of starting from a desired initial state and the likelihood of staying clear of arbitrarily placed obstacles. We show the trajectories generated from two different initial states in three different environments. When the obstacles are placed at the same location as the demonstration phase or displaced, the reproduced trajectories from the prior without importance weighting take the longer path to the target around the obstacles. This is because the demonstrations on average took a longer path while avoiding obstacles and the prior shown in Fig. 9(a) forces the reproduced trajectories to exhibit a similar behavior. For the same reasons, the deviant nonsmooth trajectories are also observed when no obstacles are present in the vicinity of the robot in the reproduction environment.
The placing skill involves placing an object at different locations on a table. All the demonstrations start from the same location since the object’s initial location is fixed. The end state of the demonstration varies with the target placement location. Initially there is an obstacle present in the desired path, hence all the demonstrations go above the obstacle causing them to be influenced. Fig. 18 (left) plots the human demonstrations provided in this scenario. Since only the influenced demonstrations are available at this stage, the trajectory prior learned from these demonstrations also encodes the influence of obstacles which is undesirable. However, as the environment changes and more demonstrations are available in a cleaner environment, as shown in Fig. 18 (right), the prior is updated using the incremental weighted learning procedure described in Section IVB.
Fig. 23 shows the evolution of the prior as demonstrations are assimilated. The prior initially enforces highly constrained motion causing the trajectories to avoid the obstacle even when it is not present. As more demonstrations are made available in an obstaclefree environment, the high importance weight relative to the influenced demonstrations enables adaptation to the desired underlying motion after just three updates. On the other hand, when the importance weighting is not considered in the incremental learning procedure, the trajectory prior still exhibits the obstacle influence even after all the demonstrations are incorporated. This is shown in Fig. 26. The utility of the incremental learning procedure is high in such scenarios. It is undesirable to keep all the demonstrations and relearn the prior on arrival of each new demonstration, since this can be both timeconsuming as well as memoryintensive.
Vii Conclusion
We have presented importance weighted skill learning, which is a novel technique for learning skills from demonstrations in cluttered environments and generalizing them to new scenarios. Our importance weighting function associates lower weights with parts of demonstrations that are likely to collide with obstacles. We conjecture that demonstrations which are in close proximity to obstacles are more susceptible to not satisfying the constraints of the skill being learned. Hence, those demonstrations should be given lesser importance during the skill learning stage. Our learning approach is also capable of incrementally updating and refining the skill model to incorporate new demonstrations without the need to relearn the model from scratch. Since our learning method is based on extracting the underlying stochastic skill dynamics, it does not share the same disadvantages as approaches that assume a mean trajectory to encode the skill. Furthermore, our reproduction method is capable of generalizing the skill efficiently across various scenarios as demonstrated in the experiments.
Acknowledgements
This research is supported in part by NSF NRI 1637758, NSF CAREER 1750483, NSF IIS 1637562, and ONR N000141612844.
References
 [1] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and autonomous systems, vol. 57, no. 5, pp. 469–483, 2009.
 [2] A. J. Ijspeert, J. Nakanishi, H. Hoffmann, P. Pastor, and S. Schaal, “Dynamical movement primitives: learning attractor models for motor behaviors,” Neural computation, vol. 25, no. 2, pp. 328–373, 2013.
 [3] S. Calinon, F. Guenter, and A. Billard, “On learning, representing, and generalizing a task in a humanoid robot,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 37, no. 2, pp. 286–298, 2007.

[4]
S. M. KhansariZadeh and A. Billard, “Learning stable nonlinear dynamical systems with Gaussian mixture models,”
IEEE Transactions on Robotics, vol. 27, no. 5, pp. 943–957, 2011.  [5] A. Paraschos, C. Daniel, J. R. Peters, and G. Neumann, “Probabilistic movement primitives,” in Advances in neural information processing systems, 2013, pp. 2616–2624.
 [6] M. A. Rana, M. Mukadam, S. R. Ahmadzadeh, S. Chernova, and B. Boots, “Towards robust skill generalization: Unifying learning from demonstration and motion planning,” in Conference on Robot Learning, 2017, pp. 109–118.
 [7] P. Pastor, H. Hoffmann, T. Asfour, and S. Schaal, “Learning and generalization of motor skills by learning from demonstration,” in 2009 IEEE International Conference on Robotics and Automation(ICRA). IEEE, 2009, pp. 763–768.
 [8] D.H. Park, H. Hoffmann, P. Pastor, and S. Schaal, “Movement reproduction and obstacle avoidance with dynamic movement primitives and potential fields,” in 8th IEEERAS International Conference on Humanoid Robots (Humanoids). IEEE, 2008, pp. 91–98.
 [9] S. M. KhansariZadeh and A. Billard, “A dynamical system approach to realtime obstacle avoidance,” Autonomous Robots, vol. 32, no. 4, pp. 433–454, 2012.
 [10] G. Ye and R. Alterovitz, “Demonstrationguided motion planning,” in International Symposium on Robotics Research (ISRR), vol. 5, 2011.
 [11] T. Osa, A. M. G. Esfahani, R. Stolkin, R. Lioutikov, J. Peters, and G. Neumann, “Guiding trajectory optimization by demonstrated distributions,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 819–826, 2017.
 [12] D. Koert, G. Maeda, R. Lioutikov, G. Neumann, and J. Peters, “Demonstration based trajectory optimization for generalizable robot motions,” in 2016 IEEERAS 16th International Conference on Humanoid Robots (Humanoids). IEEE, 2016, pp. 515–522.
 [13] A. Rai, F. Meier, A. Ijspeert, and S. Schaal, “Learning coupling terms for obstacle avoidance,” in Humanoid Robots (Humanoids), 2014 14th IEEERAS International Conference on. IEEE, 2014, pp. 512–518.
 [14] A. Gams, M. Denisa, and A. Ude, “Learning of parametric coupling terms for robotenvironment interaction,” in IEEERAS 15th International Conference on Humanoid Robots (Humanoids). IEEE, 2015, pp. 304–309.
 [15] A. Ghalamzan, C. Paxton, G. D. Hager, and L. Bascetta, “An incremental approach to learning generalizable robot tasks from human demonstration,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 5616–5621.
 [16] M. Mukadam, J. Dong, X. Yan, F. Dellaert, and B. Boots, “Continuoustime Gaussian process motion planning via probabilistic inference,” arXiv preprint arXiv:1707.07383, 2017.
 [17] M. Mukadam, X. Yan, and B. Boots, “Gaussian process motion planning,” in 2016 IEEE International Conference on Robotics and Automation (ICRA), 2016, pp. 9–15.
 [18] T. Barfoot, C. H. Tong, and S. Sarkka, “Batch continuoustime trajectory estimation as exactly sparse Gaussian process regression,” Proceedings of Robotics: Science and Systems, Berkeley, USA, 2014.

[19]
T. Minka, “Bayesian linear regression,” Citeseer, Tech. Rep., 2000.