Metric-Based Imitation Learning Between Two Dissimilar Anthropomorphic Robotic Arms

The development of autonomous robotic systems that can learn from human demonstrations to imitate a desired behavior - rather than being manually programmed - has huge technological potential. One major challenge in imitation learning is the correspondence problem: how to establish corresponding states and actions between expert and learner, when the embodiments of the agents are different (morphology, dynamics, degrees of freedom, etc.). Many existing approaches in imitation learning circumvent the correspondence problem, for example, kinesthetic teaching or teleoperation, which are performed on the robot. In this work we explicitly address the correspondence problem by introducing a distance measure between dissimilar embodiments. This measure is then used as a loss function for static pose imitation and as a feedback signal within a model-free deep reinforcement learning framework for dynamic movement imitation between two anthropomorphic robotic arms in simulation. We find that the measure is well suited for describing the similarity between embodiments and for learning imitation policies by distance minimization.


page 1

page 5


Learning from Imperfect Demonstrations from Agents with Varying Dynamics

Imitation learning enables robots to learn from demonstrations. Previous...

Regression via Kirszbraun Extension with Applications to Imitation Learning

Learning by demonstration is a versatile and rapid mechanism for transfe...

Skeletal Feature Compensation for Imitation Learning with Embodiment Mismatch

Learning from demonstrations in the wild (e.g. YouTube videos) is a tant...

Silver-Bullet-3D at ManiSkill 2021: Learning-from-Demonstrations and Heuristic Rule-based Methods for Object Manipulation

This paper presents an overview and comparative analysis of our systems ...

HandMime: Sign Language Fingerspelling Acquisition via Imitation Learning

Learning fine-grained movements is among the most challenging topics in ...

Visual Imitation Learning with Recurrent Siamese Networks

People solve the difficult problem of understanding the salient features...

Neuroprosthetic decoder training as imitation learning

Neuroprosthetic brain-computer interfaces function via an algorithm whic...

I Introduction

Fig. 1: Static pose imitation between dissimilar antrophomorphic robotic arms using a 7-DOF-Panda expert. (a) 4-DOF-Panda learner; (b) 3-DOF-Panda learner. Dissimilar robots are generated by locking DOFs (expert is shown on the left, the learner on the right, locked joints in red).

Approaches to imitation learning in robotics have delivered huge success ranging from helicopter acrobatics [2], high-speed arm skills [22], haptic control [23, 12], gestures [13], manipulation [10, 24, 33] to legged locomotion [14, 33]

. The machine learning algorithms that make imitation learning possible are well studied and have recently been summarized 

[30]. Surprisingly, despite all of these impressive successes in the acquisition of new robot motor skills, fundamental research questions in imitation learning of central importance have remained open for decades. Among such core questions is the correspondence problem: how can one agent (the learner or imitator) produce a similar behavior - in some aspect - to behavior it perceives in another agent (the expert or demonstrator), given that the two agents obey different kinematics and dynamics (body morphology, degrees of freedom (DOFs), constraints, joints and actuators, torque limits), i.e., occupy different state spaces [11]?

Existing algorithmic approaches towards imitation learning can be divided into two groups: behavioral cloning (BC) and inverse reinforcement learning (IRL) or inverse optimal control (IOC), both can be further subdivided into model-based and model-free approaches depending on whether the system dynamics is available or not [30]

. BC and IRL make different assumptions about the correspondence of learner and expert. In BC, a mapping from states to actions is generated from the demonstrations using supervised learning methods. This mapping can then be used by the learner to reproduce similar behavior, provided that the embodiments of expert and learner are alike, otherwise the method will fail due to lack of correspondence.

Successful implementations of model-based BC algorithms have been obtained for a hitting-a-ball task with an underactuated robot [15], playing video games [35] and controlling a UAV through a real forest [36]. Model-free BC algorithms have been implemented for autonomous RC helicopter flight using experts demonstrations [2] and for learning tasks, such as tennis swings  [21], ball-paddling [22] and human-robot collaborative motions in tool-handover tasks [26], autonomous knot-tying using a surgical robot [29] and character animation [32].

In an IRL framework, the learner infers a reward function for a given task from expert demonstrations of the task. The underlying assumption is that the reward function is a parsimonious and portable representation of the task, which can be transferred and generalized to agents with different embodiments. Thus, IRL implicitly resolves the correspondence problem but has the disadvantage of being computationally expensive and requires reinforcement learning in an inner loop. IRL has been implemented mostly in a model-based approach for tasks, such as learning to drive a car in a simulator [3] and path planning [34, 38, 39]. A few model-free IRL algorithms have been proposed and used to learn policies of robot motion [12, 16], [20].

The correspondence problem results in the following question: what action sequence in the learner is required to produce behavior that is similar to the expert, given that learner and expert have different embodiments and given that a measure of similarity is defined. If we denote the state-action pairs of the expert and learner as and , respectively, with , then we can formulate the correspondence problem in its simplest form as follows: for a given set of demonstrations find actions , so that is similar to for all , where similarity (negative loss) is defined as . Note that the states depend on the actions via the system dynamics, thus, it is and , where and describe the system dynamics of expert and learner, respectively. Several levels of complexity can be added to this formulation. (1) Generally, the dynamics for real systems are stochastic, thus, and , where and

is noise. In this case the distance can be measured between probability distributions of state vectors,

, where and denote the probability distribution for expert and learner, respectively (2) The system dynamics of expert and/or learner are often not known, leading to model-free vs. model-based approaches. (3) The demonstrations are often given in the form , i.e., the actions of the expert are not available. (4) The states of the expert may be only partially observable, for instance, if the environment is observed by cameras. The states must then be inferred from observations only.

In this work we study imitation tasks between two dissimilar anthropomorphic robot arms, which are generated by locking degrees of freedom (DOFs) in the learner. Throughout this work we assume that the dynamics of the learner are not available to the learning agent, thus, we are in a model-free setting. We first introduce our definition of an embodiment state and provide a distance measure to assess the similarity between embodiments. This distance measure is then used to imitate static poses using neural networks (Fig.

1) and as a feedback signal for movement imitation using reinforcement learning.

Ii Related Work

Metric approaches to the correspondence problem in imitation have been developed in a series of studies [28, 5, 7, 6, 8, 27]. In these studies, the correspondence problem was formulated in state-action space with separate metrics for states and actions. Simple global metrics based on the Hamming norm, Euclidean distance and infinity norm (-norms) were used to measure the similarity between expert and learner. Another approach to the correspondence problem is to explicitly learn the forward dynamics of the learner. The actions of the learner are then adapted to the given demonstrations by using the learned forward dynamics. Within this framework, Englert et al. [15]

have used the Kullback-Leibler divergence as similarity measure to compare trajectory distributions from an expert and a robot learner. Similarly, Grimes et al. 

[17, 18]

have used Gaussian Mixture Models to learn a forward model and infer optimal actions of the learner. The model has been used to transfer human motions to a humanoid robot.

Iii Methods

Iii-a Definition of an embodiment

In this study an embodiment consists of a chain of links, which are represented by frames that are attached to each link. Frames are commonly used in robotics to describe the orientation and translation (pose) of rigid bodies with respect to a laboratory frame. Frames are elements of the special Euclidean group , which is a non-Euclidean manifold and group (Lie-group). Frames can be represented as homogeneous matrices defined as


where is a rotation matrix () and is a column vector, describing the orientation and translation of a frame, respectively, with respect to a reference frame. For simplicity we write . The inverse is then defined as . The configuration space of an embodiment consisting of links with attached frames can be described by an element of the direct product space ( copies). The velocity of a frame is described by a twist, which encodes the rotational and translational velocity of the rigid body. A twist is an element of the Lie-algebra , which defines a vector space. The velocity of the embodiment consisting of frames is described by an element of the direct product space ( copies). A twist can be represented by matrices of the form




where (2) defines the body twist and (3) the spatial twist. The notation

denotes a skew symmetric

matrix composed of the components of the angular velocity , that is


Specifically, and define the angular velocity and translational velocity of the origin with respect to the base frame, respectively, both expressed in coordinates of the body frame. A similar but less intuitive physical interpretation can be given to the spatial twist. For simplicity we write . The joint of the first link is always attached to the origin of the base frame, which serves as a reference frame in which all comparisons will be performed. Each joint rotates around one axis. The forward kinematic map is a map from joint angles to frames , where denotes the number of DOFs. For simplicity we first consider a planar manipulator with DOFs. We assume that all links have cylindrical shape and constant mass density. We attach a frame to each link with its origin at a distance of from joint , and with the -axis that is pointing along the link direction. The transformation from link frame to the base frame can then be described by a product of matrix exponentials




for . The homogeneous matrix describes a constant shift of the frame by along the -axis. The screw is a matrix and describes the rotation axis of the revolute joint . Here denotes a unit vector in the direction of the joint axis , is a vector from the base to any point on the joint axis (both expressed in coordinates with respect to the base frame) and denotes the vector cross product. Common choices for are , which corresponds to attaching frames to the center of mass (COM) of each link (our choice) and , which corresponds to attaching frames to the end of each link. Joint angles and joint velocities determine the twists, thus . The (body) twists follow from (2) and (5) as ()


which can be determined recursively leading to


where the adjoint map for frame and twist is defined as and [31]. Note that from the (body) twists the angular velocity and translational velocity of frame with respect to the base frame can be easily obtained by


For further details using frames and twists we refer to [25]. After these derivations we can define the state of an embodiment as the state of all frames


where we surpressed the arguments and denotes the state of frame . Note that via the foward kinematic map, which is assumed to be known in this work, an embodiment state is fully determined by its joint angles and joint velocities , i.e., . A special case of (11) is obtained when ignoring rotational information. In this case, an embodiment can be described by the position and (translational) velocity of a set of candidate points, defined by the origin of each frame (by setting in (1), (2)). The embodiment state is then described by


where denotes the state vector of candidate point . The definition of an embodiment in terms of frames/twists and candidate points is generic and can be applied to any robot. A disadvantage of using frames is that they are not elements of a vector space, but define a non-Euclidean manifold.

Iii-B Similarity between embodiments

Fig. 2: Pose imitation task between planar manipulators. (a) Demonstrator pose; (b, c) Learner poses. Learner (b) generates perfect imitation if similarity is defined by candidate points, e.g., end-effector position. However, for a similarity measure based on frames, the two embodiments do not resemble each other. Learner (c) – consisting only of two links – provides better imitation than learner (b) if similarity is measured between frames attached to each link.

Similarity of embodiments can be assessed by defining a distance measure between frames/twists or candidate points, that is

where and result from different state spaces. We first consider the distance


between two frames, and . The distance consists of a translational and rotational part, which are weighted with factors and . The weights can be either constants or functions of other variables. For the translational part we set the Euclidean distance between the two frame origins


There are various ways to define the rotational distance between frames. We choose to take the angle between the unit vectors pointing along the -axes of the frames, i.e., into the directions of the links. Thus, we define


leading to values in the interval . This definition results in numerical problems when performing gradient descent because the derivative of the -function is


To avoid singularities, a modified rotational distance can be defined by shifting the negated into the interval  as


Note that the direction of the -axis of frame with respect to the laboratory frame can be easily extracted as the first column of the rotation matrix . The distance measure introduced in (13), (14) and (17) includes only the static pose of the embodiment and frames might be considered similar, even though they move into different directions. To include also motion information into the distance measure, the twists of the frames need to be taken into consideration. For dynamic motion imitation we therefore augment the distance measure between two states and by including the translational and angular velocity (9) and (10), that is




Note that the distance measure between two states and , can be extended over all state spaces by defining the sum of all mutual distances


In the next section a weighted version of (20) will be introduced by incorporating link correspondences.

Iii-C Link Correspondences

To measure similarity between links of two embodiments, we first define how correspondence between links of different embodiments can be established. Embodiments may differ in the number and length of links, and thus, a one-to-one assignment between links is often not possible. To establish correspondence between links of different embodiments with possibly different overall size, we first rescale each embodiment by the sum of its link lengths , resulting in a chain length of . To establish correspondence, we assign a weight for every possible link-pair combination. Thus, for two embodiments 1 and 2 with and links, respectively, link correspondence can be represented by a correspondence matrix . Irrelevant combinations result in zero or close to zero entries and higher values indicate higher relevance. Each row of the correspondence matrix contains the correspondence weights of one link of embodiment 1 to all other links of embodiment 2, where the highest value indicates which link is the most relevant from all links of embodiment 2. The elements of the correspondence matrix can either be calculated as a function of embodiment states or pre-calculated for a pair of embodiments, independent of their current state.

State-Dependent Assignment. For state-dependent calculations of the correspondence matrix , weights are calculated using the distance measure between frames. The closer the distance, the higher the weight for this pair of frames. To obtain the correspondence matrix , the mutual distance matrix between all links of the two embodiments is computed using the distance measure in (13). A correspondence matrix can be generated by replacing the smallest element of each row of with 1 and all other elements with 0, resulting in a binary matrix that assigns exactly one link of embodiment 1 to each link of embodiment 2 with a weight 1. The same operation can be applied for each column of , resulting in . Adding  to  results in a correspondence matrix . A correspondence matrix that only uses the minimum for each row and each column is very selective and ignores the fact that more than one link of the other embodiment may be lying in a similar distance and should be taken into consideration. This effect can be mitigated by applying a softmax function to the rows and columns of the correspondence matrix, after multiplying with a constant factor to find soft minima instead of maxima and to adjust the distinctness of the minimum.

Iii-D Calculating Distance Between Embodiments

We define distance between embodiments as element-wise multiplication of the distance function with the correspondence matrix i.e., , where  denotes the Hadamard product. Only distances between corresponding link pairs remain because non-corresponding pairs are weighted with zero or near-zero values. To obtain one single scalar number, the mean of all entries of the resulting matrix is taken. Matrix norms, such as the Frobenius norm, can also be used. For the evaluation of the correspondence matrix and the distance matrix, suitable weights, and , need to be chosen. Different settings are possible here. The pseudo-code for calculating the distance measure are shown in Algorithm 1.

Function distance_measure(, , , )
       Calculate from , using forward kinematics. in Either calculate correspondence matrix or use static correspondence matrix . return
Algorithm 1 Calculating the Distance Measure

Iv Results

We studied static and dynamic imitation tasks in simulation between two dissimilar embodiments using the previously derived distance measure. First, we present results of static pose imitation tasks between planar manipulators with different links using gradient descent. Second, we examined whether a neural network can learn the optimal static pose between two planar manipulators and, furthermore, between two Franka Emika Panda robots for a given expert pose. In addition, we investigated how well the learner generalized to poses it has not seen during training. Third, we present results for a dynamic imitation tasks between two Franka Emika Panda robotic arms. For this purpose, a simulation environment was built in the physics simulator Gazebo. In all simulations, we assumed that the learner has no knowledge of the robot dynamics.

Iv-a Static Pose Imitation Task

In a static pose imitation task, the optimal pose of the learner, is obtained for a given pose of the expert, , by minimization of the distance measure


Minimizing the distance function is a nonlinear optimization problem for which generally no analytical solution exists, in particular, for embodiments with a large number of links. Using mathematical libraries such as TensorFlow, the gradient of the distance function can be computed and local minima can be found numerically via gradient descent. Instead of trying to solve the optimization problem repeatedly for each input, we can learn a mapping from joint angles of the expert to joint angles of the learner , where and . The function can be approximated by a neural network with weight parameters

. The distance measure was implemented as a computation graph in TensorFlow 


Fig. 3: (a-c) Effects of using different distance weighting factors on pose imitation between two planar manipulators (expert: blue, learner: orange).
(d-f) Distance function between planar manipulators with two links. (d) State-dependent weight matrix with ; (e) State-dependent weight matrix with distance-dependent weighting factors; (f) State-independent weight matrix, considering only rotational distance between corresponding links (). The distance is plotted over all possible joint angles of one embodiment, while the other manipulator remains fixed at .

Comparing Link Correspondences. Before training the neural network, we analyzed the behavior of the distance function for a simple toy model. Fig. 3a-c shows an imitation task between planar manipulators with two links. The distance was measured using a state-dependent correspondence matrix , using varying weight factors . Each pose of the learner was found via gradient descent. The choice of weight factors clearly has a strong influence on the quality of the result. Balancing between translational and rotational weights is challenging. One possibility to overcome this difficulty may be to simply use the translational distance as the translational weight and to redefine the rotational weight by subtracting the translational distance from its maximum value and rescaling from to . This way, translational and rotational distance are in the same value range, i.e.,


The maximum distance results from the fact that both embodiments are normalized to be in a sphere of radius or diameter , which is the maximum Euclidean distance between two points in this sphere. Equation (22) ensures that the sum of and  is always , excluding the transformation factor . The problem with the above approaches is that there exist local minima in the distance function, as can be observed in (d) and 2(e) for two-link manipulators. (f) on the other hand, shows that when considering the only the orientational between frames (setting ) using a static, precalculated correspondence matrix, only one minimum remains . This approach is less flexible but more robust and resulted in parallel alignment of corresponding links. Therefore, all following experiments were conducted using this distance measure.

Pose Imitation Mapping Using a Neural Network. We next implemented a neural network to map joint angles of the expert to corresponding joint angles of the learner, leading to a more efficient method for static pose imitation than conducting a gradient descent search for each state. To generate this nonlinear map, network parameters


need to be determined, which minimize the distance between the states of the expert and the states of the learner for a given training set , where the map is represented by a neural network and  are the network parameters. The training dataset can be generated randomly because it contains only expert angles that do not need to be labeled. The network structure consists of three hidden layers of size

with LReLU activation functions. The output layer uses the

tanh as activation function, resulting in output values in . These values can then be mapped to angular values in

. Having generated a training dataset and another dataset for validation, the network was trained using a minibatch-based stochastic gradient descent method. After dividing the training set into minibatches, the update step is performed for each of the minibatches. Afterwards, the whole training set is shuffled and the procedure repeated.

Fig. 4: Imitation between two planar manipulators using a neural network. (a-b) Expert: 7-DOF, learner: 4-DOF; (c-d) Expert: 4-DOF, learner: 7-DOF.

We trained a neural network to find a mapping from joint angles of a 7-DOF expert manipulator to angles of a 4-DOF learner manipulator and another network for the same pair of manipulators but with switched expert/learner roles. Each time we used a training set of 1024 expert demonstrations, dividing it into 32 minibatches in each episode. Fig. 4 shows the learned poses of the trained network for given expert angles that were not included in the training set.

Pose Imitation Between Three-Dimensional Embodiments. We next applied the method from the previous section to two Franka Emika Panda robotic arms in simulation. Dissimilar embodiments were generated by locking individual DOFs of the learner to . For example, to simulate a four-link-Panda robot with four DOFs, joint 3, 6, and 7 were locked. To simulate a three-link-Panda robot, additionally link 2 was locked (see Fig. 1). We first studied static pose imitation between identical embodiments with all full 7 DOFs enabled (Video 1111 (a) shows an example of how the trained network solves the task between dissimilar embodiments (expert: 7 DOFs, learner: 4 DOFs). The locked joints 6 and 7 (indicated in red) lie at the very end of the embodiment and therefore do not contribute much to the overall configuration of the embodiment, in contrast to the locked joint 3 at the center of the embodiment. In (b), the learner only has 3 DOFs. The learner tries to establish similarity by rotating joint 4 as the second joint is locked. The results can be seen in Video 2222

The experiments have shown that training a neural network with a distance-based loss function worked reasonably well for static pose imitation. Local minima in the distance function and over-fitting on the training set posed some problems. While the former is more difficult to solve, the latter can be solved by increasing the size of the training set and stopping the training process at a suitable time. Another possibility to improve the network’s performance may lie in the structure and training of the network. We used the same structure for all pose imitation tasks without employing techniques that decrease the probability of over-fitting, such as dropout or regularization [4].

Iv-B Dynamic Motion Imitation

In this section, we a apply a reinforcement learning algorithm for motion imitation by using the online distance measure between embodiments as a feedback signal. Reinforcement learning has shown great success in learning motor tasks, for example, [19, 9]. We study the transfer of motions from one Panda robot with maximally seven DOFs to another one in simulation. As before, different embodiments are generated by locking DOFs in the learner. We assume that the dynamics of the robots are unavailable (model-free) and that the learner is controlled by joint torques . Consequently, the agent needs to control the joint positions but also, implicitly, learn the the robot dynamics.

Simulation Environment.

The manufacturer (Franka Emika) of the Panda robot provided a good integration of the robot into the ROS ecosystem, which we augmented by the Gazebo physics simulator. Unfortunately, no exact inertia values for the Panda were provided, which were needed to simulate the dynamics. The CoR-Lab group from the Universit t Bielefeld published some estimates of inertia values on their GitHub repository

333 We used these estimates and manually adjusted them by using the guide given in the Gazebo manual. The simulated robot is controlled via joint torque commands. To create trajectories of the expert, simple PID-controllers were configured for each joint. Due to the lack of sophisticated controllers and to facilitate the task for the RL agent, gravity was turned off in the simulation.

RL Environment and Agent. The next task consisted in the implementation of the reinforcement learning agent and the interface for interaction with the simulation. The state space consisted of the expert and learner states, thus, the environment’s state is defined by the tuple . The actions are given by the torque commands of the learner, , subject to the torque limits for each corresponding joint. The control of the expert is not observable by the RL agent. For training and testing, multiple random trajectories of similar duration were recorded. The step size was set to s and a step consisted of the following transitions: The agent executed an action in the environment by sending a torque command to the learner. The torques were then applied for the duration of the simulation time . The simulation then paused and returned the next observed state together with the reward , which was calculated from the state as the negative distance measure between the embodiment states. To train the agent, we used Proximal Policy Optimization (PPO) [37]

, which is a state-of-the art actor-critic DRL algorithm. One big advantage of PPO is its robustness towards hyperparameter tuning. We employed the GPU-optimized version (PPO2) of the

Stable-Baselines repository, based on OpenAI’s implementations.

Simulation 1: Imitation of a Single Trajectory. We first tested, whether motion imitation using reinforcement learning is feasible by transferring a single trajectory between two Panda robots with all 7 DOFs activated, i.e., expert and learner had identical embodiments. Both expert and learner started in their zero pose, in which all joint angles and joint velocities are set to zero. The trajectory of the expert was recorded off-line by moving each joint to an arbitrarily chosen goal position. The environment was reset whenever the trajectory ended. Each trajectory had a total duration of 5 s, which leads to 50 steps per episode. The discount factor was set to implying that the agent acted myopically. This value was chosen because high values led to very slow training progress. The weights for the frame-distance function were set to in all simulation experiments. Training time was about hours on a desktop computer using GPU-accelerated computations. The simulation showed that the learned trajectory resembles the expert’s trajectory very closely and that imitation of motions is possible using a reinforcement learning framework with a distance related reward function. The results can be seen in Video 3444

Simulation 2: Generalization Between Trajectories.

Fig. 5: Generalization capabilities of two different agents. Shown is the distance measure for two 7-DOF-Panda learners while imitating the same, previously unseen trajectory. One learner (”Specialized“) was trained on a single trajectory, whereas the other (”Generalized“) was trained on 124 different trajectories.

It was examined next, whether the agent can imitate trajectories it had not seen before. For this purpose, we trained the agent with different numbers of training data (one vs. 124 trajectories). Trajectories were recorded as before, but each time with different final poses. The trajectory of the agent trained on a single trajectory barely resembles the trajectory of the expert. The imitation of the agent trained on the larger data set is not perfect, but resembles the trajectories from the expert more closely (see Video 4555 This improvement shows in a significantly smaller distance between expert and learner along the trajectory (see Fig. 5).

Simulation 3: Imitation Between Dissimilar Embodiments. In the next experiment, we studied motion transfer between dissimilar Panda robots. Towards this goal, we trained the agent again on the same 124 trajectories but this time the learner had only 4 or 3 DOFs, respectively. As the learning robot was restricted in its DOFs, some trajectories could could not be imitated well, resulting in higher values of the distance measure. We found that the restricted learner moved its links in similar directions as the expert but the restrictions prevented a more similar imitation. Examples can be seen in Video 5666

V Conclusions

Our main contributions with this work are threefold: First, we have introduced a definition of embodiment states in terms of frames/twists and candidate points. Second, we have povided a distance measure between dissimilar embodiments using correspondences between frames of expert and learner. Third, we have applied this distance measure to static pose and movement imitation tasks between manipulators. All tasks have been performed in simulation. In all experiments we could show that the agent was able to learn the imitation task, even though no dynamic model has been provided to the learner. The framework that we have developed is generic and flexible and not limited to our choice of parameters, distance measures and type of robots. Depending on the correspondence matrix calculation, the topology of the embodiments is not crucial. Possibly even free topologies like swarms of flying objects could be compared and brought into similarity.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng (2016) TensorFlow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. External Links: ISBN 978-1-931971-33-1 Cited by: §IV-A.
  • [2] P. Abbeel, A. Coates, and A. Y. Ng (2010) Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research 29 (13), pp. 1608–1639. Cited by: §I, §I.
  • [3] P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, Cited by: §I.
  • [4] C. C. Aggarwal (2018)

    Neural networks and deep learning

    Vol. 10, Springer. Cited by: §IV-A.
  • [5] A. Alissandrakis, C. Nehaniv, and K. Dautenhahn (2002-07) Imitating with alice: learning to imitate corresponding actions across dissimilar embodiments. IEEE Transactions on Systems, Man, & Cybernetics, Part A: Systems and Humans 32, pp. 482–496. External Links: Document Cited by: §II.
  • [6] A. Alissandrakis, C. L. Nehaniv, K. Dautenhahn, and J. Saunders (2005) Achieving corresponding effects on multiple robotic platforms: imitating in context using different effect metrics. In In: Proceedings of the Third International Symposium on Imitation in Animals and Artifacts, Cited by: §II.
  • [7] A. Alissandrakis, C. L. Nehaniv, and K. Dautenhahn (2002) Do as I do: correspondences across different robotic embodiments. In Procs. 5th German Workshop on Artificial Life. Lubeck, Cited by: §II.
  • [8] A. Alissandrakis, C. L. Nehaniv, and K. Dautenhahn (2007) Correspondence mapping induced state and action metrics for robotic imitation. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 37, pp. 299–307. Cited by: §II.
  • [9] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. (2020) Learning dexterous in-hand manipulation. The International Journal of Robotics Research 39 (1), pp. 3–20. Cited by: §IV-B.
  • [10] T. Asfour, P. Azad, F. Gyarfas, and R. Dillmann (2008) Imitation learning of dual-arm manipulation tasks in humanoid robots. International Journal of Humanoid Robotics 5 (02), pp. 183–202. Cited by: §I.
  • [11] A. Billard, S. Calinon, and R. Dillmann (2016-01) Learning from humans. Springer Handbook of Robotics, pp. 1995–2014. External Links: Document Cited by: §I.
  • [12] A. Boularias, J. Kober, and J. Peters (2011) Relative entropy inverse reinforcement learning. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    pp. 182–189. Cited by: §I, §I.
  • [13] S. Calinon, F. D’halluin, E. L. Sauser, D. G. Caldwell, and A. G. Billard (2010) Learning and reproduction of gestures by imitation. IEEE Robotics & Automation Magazine 17 (2), pp. 44–54. Cited by: §I.
  • [14] R. Chalodhorn, D. B. Grimes, K. Grochow, and R. P. Rao (2010) Learning to walk by imitation in low-dimensional subspaces. Advanced Robotics 24 (1-2), pp. 207–232. Cited by: §I.
  • [15] P. Englert, A. Paraschos, M. P. Deisenroth, and J. Peters (2013) Probabilistic model-based imitation learning. Adaptive Behavior 21 (5), pp. 388–403. Cited by: §I, §II.
  • [16] C. Finn, S. Levine, and P. Abbeel (2016-03) Guided cost learning: deep inverse optimal control via policy optimization. Proceedings of the 33Rd International Conference on Machine Learning, pp. . Cited by: §I.
  • [17] D. B. Grimes, R. Chalodhorn, and R. P. Rao (2006) Dynamic imitation in a humanoid robot through nonparametric probabilistic inference.. In Robotics: science and systems, pp. 199–206. Cited by: §II.
  • [18] D. B. Grimes and R. P. Rao (2009) Learning actions through imitation and exploration: towards humanoid robots that learn from humans. In Creating Brain-Like Intelligence, pp. 103–138. Cited by: §II.
  • [19] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. Eslami, et al. (2017) Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286. Cited by: §IV-B.
  • [20] J. Ho, J. Gupta, and S. Ermon (2016) Model-free imitation learning with policy optimization. In International Conference on Machine Learning, pp. 2760–2769. Cited by: §I.
  • [21] A. J. Ijspeert, J. Nakanishi, and S. Schaal (2002) Movement imitation with nonlinear dynamical systems in humanoid robots. In Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), Vol. 2, pp. 1398–1403. Cited by: §I.
  • [22] J. Kober and J. Peters (2010) Imitation and reinforcement learning. IEEE Robotics & Automation Magazine 17 (2), pp. 55–62. Cited by: §I, §I.
  • [23] P. Kormushev, S. Calinon, and D. G. Caldwell (2011) Imitation learning of positional and force skills demonstrated via kinesthetic teaching and haptic input. Advanced Robotics 25 (5), pp. 581–603. Cited by: §I.
  • [24] M. Lopes, F. S. Melo, and L. Montesano (2007) Affordance-based imitation learning in robots. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1015–1021. Cited by: §I.
  • [25] K. M. Lynch and F. C. Park (2017) Modern robotics. Cambridge University Press. Cited by: §III-A.
  • [26] G. Maeda, M. Ewerton, G. Neumann, R. Lioutikov, and J. Peters (2017) Phase estimation for fast action recognition and trajectory generation in human–robot collaboration. The International Journal of Robotics Research 36 (13-14), pp. 1579–1594. Cited by: §I.
  • [27] C. L. Nehaniv and K. E. Dautenhahn (2007) Imitation and social learning in robots, humans and animals: behavioural, social and communicative dimensions.. Cambridge University Press. Cited by: §II.
  • [28] C. L. Nehaniv and K. Dautenhahn (2001) Like me?- Measures of correspondence and imitation. Cybernetics and Systems 32, pp. 11–51. Cited by: §II.
  • [29] T. Osa, K. Harada, N. Sugita, and M. Mitsuishi (2014) Trajectory planning under different initial conditions for surgical task automation by learning from demonstration. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 6507–6513. Cited by: §I.
  • [30] T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, J. Peters, et al. (2018) An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics 7 (1-2), pp. 1–179. Cited by: §I, §I.
  • [31] F.C. Park, J.E. Bobrow, and S.R. Ploen (1995) A lie group formulation of robot dynamics. The International Journal of Robotics Research 14 (6), pp. 609–618. Cited by: §III-A.
  • [32] X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne (2018) Deepmimic: example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG) 37 (4), pp. 1–14. Cited by: §I.
  • [33] N. Ratliff, J. A. Bagnell, and S. S. Srinivasa (2007) Imitation learning for locomotion and manipulation. In 2007 7th IEEE-RAS International Conference on Humanoid Robots, pp. 392–397. Cited by: §I.
  • [34] N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich (2006) Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pp. 729–736. Cited by: §I.
  • [35] S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. Cited by: §I.
  • [36] S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wendel, D. Dey, J. A. Bagnell, and M. Hebert (2013) Learning monocular reactive uav control in cluttered natural environments. 2013 IEEE International Conference on Robotics and Automation. Cited by: §I.
  • [37] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §IV-B.
  • [38] D. Silver, J. Bagnell, and A. Stentz (2010-10) Learning from demonstration for autonomous navigation in complex unstructured terrain. I. J. Robotic Res. 29, pp. 1565–1592. External Links: Document Cited by: §I.
  • [39] B. Ziebart, A. Maas, J. Bagnell, and A. Dey (2008) Maximum entropy inverse reinforcement learning.. pp. 1433–1438. Cited by: §I.