I Introduction
Approaches to imitation learning in robotics have delivered huge success ranging from helicopter acrobatics [2], highspeed arm skills [22], haptic control [23, 12], gestures [13], manipulation [10, 24, 33] to legged locomotion [14, 33]
. The machine learning algorithms that make imitation learning possible are well studied and have recently been summarized
[30]. Surprisingly, despite all of these impressive successes in the acquisition of new robot motor skills, fundamental research questions in imitation learning of central importance have remained open for decades. Among such core questions is the correspondence problem: how can one agent (the learner or imitator) produce a similar behavior  in some aspect  to behavior it perceives in another agent (the expert or demonstrator), given that the two agents obey different kinematics and dynamics (body morphology, degrees of freedom (DOFs), constraints, joints and actuators, torque limits), i.e., occupy different state spaces [11]?Existing algorithmic approaches towards imitation learning can be divided into two groups: behavioral cloning (BC) and inverse reinforcement learning (IRL) or inverse optimal control (IOC), both can be further subdivided into modelbased and modelfree approaches depending on whether the system dynamics is available or not [30]
. BC and IRL make different assumptions about the correspondence of learner and expert. In BC, a mapping from states to actions is generated from the demonstrations using supervised learning methods. This mapping can then be used by the learner to reproduce similar behavior, provided that the embodiments of expert and learner are alike, otherwise the method will fail due to lack of correspondence.
Successful implementations of modelbased BC algorithms have been obtained for a hittingaball task with an underactuated robot [15], playing video games [35] and controlling a UAV through a real forest [36]. Modelfree BC algorithms have been implemented for autonomous RC helicopter flight using experts demonstrations [2] and for learning tasks, such as tennis swings [21], ballpaddling [22] and humanrobot collaborative motions in toolhandover tasks [26], autonomous knottying using a surgical robot [29] and character animation [32].
In an IRL framework, the learner infers a reward function for a given task from expert demonstrations of the task. The underlying assumption is that the reward function is a parsimonious and portable representation of the task, which can be transferred and generalized to agents with different embodiments. Thus, IRL implicitly resolves the correspondence problem but has the disadvantage of being computationally expensive and requires reinforcement learning in an inner loop. IRL has been implemented mostly in a modelbased approach for tasks, such as learning to drive a car in a simulator [3] and path planning [34, 38, 39]. A few modelfree IRL algorithms have been proposed and used to learn policies of robot motion [12, 16], [20].
The correspondence problem results in the following question: what action sequence in the learner is required to produce behavior that is similar to the expert, given that learner and expert have different embodiments and given that a measure of similarity is defined. If we denote the stateaction pairs of the expert and learner as and , respectively, with , then we can formulate the correspondence problem in its simplest form as follows: for a given set of demonstrations find actions , so that is similar to for all , where similarity (negative loss) is defined as . Note that the states depend on the actions via the system dynamics, thus, it is and , where and describe the system dynamics of expert and learner, respectively. Several levels of complexity can be added to this formulation. (1) Generally, the dynamics for real systems are stochastic, thus, and , where and
is noise. In this case the distance can be measured between probability distributions of state vectors,
, where and denote the probability distribution for expert and learner, respectively (2) The system dynamics of expert and/or learner are often not known, leading to modelfree vs. modelbased approaches. (3) The demonstrations are often given in the form , i.e., the actions of the expert are not available. (4) The states of the expert may be only partially observable, for instance, if the environment is observed by cameras. The states must then be inferred from observations only.In this work we study imitation tasks between two dissimilar anthropomorphic robot arms, which are generated by locking degrees of freedom (DOFs) in the learner. Throughout this work we assume that the dynamics of the learner are not available to the learning agent, thus, we are in a modelfree setting. We first introduce our definition of an embodiment state and provide a distance measure to assess the similarity between embodiments. This distance measure is then used to imitate static poses using neural networks (Fig.
1) and as a feedback signal for movement imitation using reinforcement learning.Ii Related Work
Metric approaches to the correspondence problem in imitation have been developed in a series of studies [28, 5, 7, 6, 8, 27]. In these studies, the correspondence problem was formulated in stateaction space with separate metrics for states and actions. Simple global metrics based on the Hamming norm, Euclidean distance and infinity norm (norms) were used to measure the similarity between expert and learner. Another approach to the correspondence problem is to explicitly learn the forward dynamics of the learner. The actions of the learner are then adapted to the given demonstrations by using the learned forward dynamics. Within this framework, Englert et al. [15]
have used the KullbackLeibler divergence as similarity measure to compare trajectory distributions from an expert and a robot learner. Similarly, Grimes et al.
[17, 18]have used Gaussian Mixture Models to learn a forward model and infer optimal actions of the learner. The model has been used to transfer human motions to a humanoid robot.
Iii Methods
Iiia Definition of an embodiment
In this study an embodiment consists of a chain of links, which are represented by frames that are attached to each link. Frames are commonly used in robotics to describe the orientation and translation (pose) of rigid bodies with respect to a laboratory frame. Frames are elements of the special Euclidean group , which is a nonEuclidean manifold and group (Liegroup). Frames can be represented as homogeneous matrices defined as
(1) 
where is a rotation matrix () and is a column vector, describing the orientation and translation of a frame, respectively, with respect to a reference frame. For simplicity we write . The inverse is then defined as . The configuration space of an embodiment consisting of links with attached frames can be described by an element of the direct product space ( copies). The velocity of a frame is described by a twist, which encodes the rotational and translational velocity of the rigid body. A twist is an element of the Liealgebra , which defines a vector space. The velocity of the embodiment consisting of frames is described by an element of the direct product space ( copies). A twist can be represented by matrices of the form
(2) 
or
(3) 
where (2) defines the body twist and (3) the spatial twist. The notation
denotes a skew symmetric
matrix composed of the components of the angular velocity , that is(4) 
Specifically, and define the angular velocity and translational velocity of the origin with respect to the base frame, respectively, both expressed in coordinates of the body frame. A similar but less intuitive physical interpretation can be given to the spatial twist. For simplicity we write . The joint of the first link is always attached to the origin of the base frame, which serves as a reference frame in which all comparisons will be performed. Each joint rotates around one axis. The forward kinematic map is a map from joint angles to frames , where denotes the number of DOFs. For simplicity we first consider a planar manipulator with DOFs. We assume that all links have cylindrical shape and constant mass density. We attach a frame to each link with its origin at a distance of from joint , and with the axis that is pointing along the link direction. The transformation from link frame to the base frame can then be described by a product of matrix exponentials
(5) 
where
(6) 
for . The homogeneous matrix describes a constant shift of the frame by along the axis. The screw is a matrix and describes the rotation axis of the revolute joint . Here denotes a unit vector in the direction of the joint axis , is a vector from the base to any point on the joint axis (both expressed in coordinates with respect to the base frame) and denotes the vector cross product. Common choices for are , which corresponds to attaching frames to the center of mass (COM) of each link (our choice) and , which corresponds to attaching frames to the end of each link. Joint angles and joint velocities determine the twists, thus . The (body) twists follow from (2) and (5) as ()
(7) 
which can be determined recursively leading to
(8) 
where the adjoint map for frame and twist is defined as and [31]. Note that from the (body) twists the angular velocity and translational velocity of frame with respect to the base frame can be easily obtained by
(9)  
(10) 
For further details using frames and twists we refer to [25]. After these derivations we can define the state of an embodiment as the state of all frames
(11)  
where we surpressed the arguments and denotes the state of frame . Note that via the foward kinematic map, which is assumed to be known in this work, an embodiment state is fully determined by its joint angles and joint velocities , i.e., . A special case of (11) is obtained when ignoring rotational information. In this case, an embodiment can be described by the position and (translational) velocity of a set of candidate points, defined by the origin of each frame (by setting in (1), (2)). The embodiment state is then described by
(12) 
where denotes the state vector of candidate point . The definition of an embodiment in terms of frames/twists and candidate points is generic and can be applied to any robot. A disadvantage of using frames is that they are not elements of a vector space, but define a nonEuclidean manifold.
IiiB Similarity between embodiments
Similarity of embodiments can be assessed by defining a distance measure between frames/twists or candidate points, that is
where and result from different state spaces. We first consider the distance
(13) 
between two frames, and . The distance consists of a translational and rotational part, which are weighted with factors and . The weights can be either constants or functions of other variables. For the translational part we set the Euclidean distance between the two frame origins
(14) 
There are various ways to define the rotational distance between frames. We choose to take the angle between the unit vectors pointing along the axes of the frames, i.e., into the directions of the links. Thus, we define
(15) 
leading to values in the interval . This definition results in numerical problems when performing gradient descent because the derivative of the function is
(16) 
To avoid singularities, a modified rotational distance can be defined by shifting the negated into the interval as
(17) 
Note that the direction of the axis of frame with respect to the laboratory frame can be easily extracted as the first column of the rotation matrix . The distance measure introduced in (13), (14) and (17) includes only the static pose of the embodiment and frames might be considered similar, even though they move into different directions. To include also motion information into the distance measure, the twists of the frames need to be taken into consideration. For dynamic motion imitation we therefore augment the distance measure between two states and by including the translational and angular velocity (9) and (10), that is
(18) 
with
(19) 
Note that the distance measure between two states and , can be extended over all state spaces by defining the sum of all mutual distances
(20) 
In the next section a weighted version of (20) will be introduced by incorporating link correspondences.
IiiC Link Correspondences
To measure similarity between links of two embodiments, we first define how correspondence between links of different embodiments can be established. Embodiments may differ in the number and length of links, and thus, a onetoone assignment between links is often not possible. To establish correspondence between links of different embodiments with possibly different overall size, we first rescale each embodiment by the sum of its link lengths , resulting in a chain length of . To establish correspondence, we assign a weight for every possible linkpair combination. Thus, for two embodiments 1 and 2 with and links, respectively, link correspondence can be represented by a correspondence matrix . Irrelevant combinations result in zero or close to zero entries and higher values indicate higher relevance. Each row of the correspondence matrix contains the correspondence weights of one link of embodiment 1 to all other links of embodiment 2, where the highest value indicates which link is the most relevant from all links of embodiment 2. The elements of the correspondence matrix can either be calculated as a function of embodiment states or precalculated for a pair of embodiments, independent of their current state.
StateDependent Assignment. For statedependent calculations of the correspondence matrix , weights are calculated using the distance measure between frames. The closer the distance, the higher the weight for this pair of frames. To obtain the correspondence matrix , the mutual distance matrix between all links of the two embodiments is computed using the distance measure in (13). A correspondence matrix can be generated by replacing the smallest element of each row of with 1 and all other elements with 0, resulting in a binary matrix that assigns exactly one link of embodiment 1 to each link of embodiment 2 with a weight 1. The same operation can be applied for each column of , resulting in . Adding to results in a correspondence matrix . A correspondence matrix that only uses the minimum for each row and each column is very selective and ignores the fact that more than one link of the other embodiment may be lying in a similar distance and should be taken into consideration. This effect can be mitigated by applying a softmax function to the rows and columns of the correspondence matrix, after multiplying with a constant factor to find soft minima instead of maxima and to adjust the distinctness of the minimum.
IiiD Calculating Distance Between Embodiments
We define distance between embodiments as elementwise multiplication of the distance function with the correspondence matrix i.e., , where denotes the Hadamard product. Only distances between corresponding link pairs remain because noncorresponding pairs are weighted with zero or nearzero values. To obtain one single scalar number, the mean of all entries of the resulting matrix is taken. Matrix norms, such as the Frobenius norm, can also be used. For the evaluation of the correspondence matrix and the distance matrix, suitable weights, and , need to be chosen. Different settings are possible here. The pseudocode for calculating the distance measure are shown in Algorithm 1.
Iv Results
We studied static and dynamic imitation tasks in simulation between two dissimilar embodiments using the previously derived distance measure. First, we present results of static pose imitation tasks between planar manipulators with different links using gradient descent. Second, we examined whether a neural network can learn the optimal static pose between two planar manipulators and, furthermore, between two Franka Emika Panda robots for a given expert pose. In addition, we investigated how well the learner generalized to poses it has not seen during training. Third, we present results for a dynamic imitation tasks between two Franka Emika Panda robotic arms. For this purpose, a simulation environment was built in the physics simulator Gazebo. In all simulations, we assumed that the learner has no knowledge of the robot dynamics.
Iva Static Pose Imitation Task
In a static pose imitation task, the optimal pose of the learner, is obtained for a given pose of the expert, , by minimization of the distance measure
(21) 
Minimizing the distance function is a nonlinear optimization problem for which generally no analytical solution exists, in particular, for embodiments with a large number of links. Using mathematical libraries such as TensorFlow, the gradient of the distance function can be computed and local minima can be found numerically via gradient descent. Instead of trying to solve the optimization problem repeatedly for each input, we can learn a mapping from joint angles of the expert to joint angles of the learner , where and . The function can be approximated by a neural network with weight parameters
. The distance measure was implemented as a computation graph in TensorFlow
[1].(df) Distance function between planar manipulators with two links. (d) Statedependent weight matrix with ; (e) Statedependent weight matrix with distancedependent weighting factors; (f) Stateindependent weight matrix, considering only rotational distance between corresponding links (). The distance is plotted over all possible joint angles of one embodiment, while the other manipulator remains fixed at .
Comparing Link Correspondences. Before training the neural network, we analyzed the behavior of the distance function for a simple toy model. Fig. 3ac shows an imitation task between planar manipulators with two links. The distance was measured using a statedependent correspondence matrix , using varying weight factors , . Each pose of the learner was found via gradient descent. The choice of weight factors clearly has a strong influence on the quality of the result. Balancing between translational and rotational weights is challenging. One possibility to overcome this difficulty may be to simply use the translational distance as the translational weight and to redefine the rotational weight by subtracting the translational distance from its maximum value and rescaling from to . This way, translational and rotational distance are in the same value range, i.e.,
(22) 
The maximum distance results from the fact that both embodiments are normalized to be in a sphere of radius or diameter , which is the maximum Euclidean distance between two points in this sphere. Equation (22) ensures that the sum of and is always , excluding the transformation factor . The problem with the above approaches is that there exist local minima in the distance function, as can be observed in (d) and 2(e) for twolink manipulators. (f) on the other hand, shows that when considering the only the orientational between frames (setting ) using a static, precalculated correspondence matrix, only one minimum remains . This approach is less flexible but more robust and resulted in parallel alignment of corresponding links. Therefore, all following experiments were conducted using this distance measure.
Pose Imitation Mapping Using a Neural Network. We next implemented a neural network to map joint angles of the expert to corresponding joint angles of the learner, leading to a more efficient method for static pose imitation than conducting a gradient descent search for each state. To generate this nonlinear map, network parameters
(23) 
need to be determined, which minimize the distance between the states of the expert and the states of the learner for a given training set , where the map is represented by a neural network and are the network parameters. The training dataset can be generated randomly because it contains only expert angles that do not need to be labeled. The network structure consists of three hidden layers of size
with LReLU activation functions. The output layer uses the
tanh as activation function, resulting in output values in . These values can then be mapped to angular values in. Having generated a training dataset and another dataset for validation, the network was trained using a minibatchbased stochastic gradient descent method. After dividing the training set into minibatches, the update step is performed for each of the minibatches. Afterwards, the whole training set is shuffled and the procedure repeated.
We trained a neural network to find a mapping from joint angles of a 7DOF expert manipulator to angles of a 4DOF learner manipulator and another network for the same pair of manipulators but with switched expert/learner roles. Each time we used a training set of 1024 expert demonstrations, dividing it into 32 minibatches in each episode. Fig. 4 shows the learned poses of the trained network for given expert angles that were not included in the training set.
Pose Imitation Between ThreeDimensional Embodiments. We next applied the method from the previous section to two Franka Emika Panda robotic arms in simulation. Dissimilar embodiments were generated by locking individual DOFs of the learner to . For example, to simulate a fourlinkPanda robot with four DOFs, joint 3, 6, and 7 were locked. To simulate a threelinkPanda robot, additionally link 2 was locked (see Fig. 1). We first studied static pose imitation between identical embodiments with all full 7 DOFs enabled (Video 1 ^{1}^{1}1https://youtu.be/UPZclkFoFXQ). (a) shows an example of how the trained network solves the task between dissimilar embodiments (expert: 7 DOFs, learner: 4 DOFs). The locked joints 6 and 7 (indicated in red) lie at the very end of the embodiment and therefore do not contribute much to the overall configuration of the embodiment, in contrast to the locked joint 3 at the center of the embodiment. In (b), the learner only has 3 DOFs. The learner tries to establish similarity by rotating joint 4 as the second joint is locked. The results can be seen in Video 2 ^{2}^{2}2https://youtu.be/BmFH6Nr9F1Y.
The experiments have shown that training a neural network with a distancebased loss function worked reasonably well for static pose imitation. Local minima in the distance function and overfitting on the training set posed some problems. While the former is more difficult to solve, the latter can be solved by increasing the size of the training set and stopping the training process at a suitable time. Another possibility to improve the network’s performance may lie in the structure and training of the network. We used the same structure for all pose imitation tasks without employing techniques that decrease the probability of overfitting, such as dropout or regularization [4].
IvB Dynamic Motion Imitation
In this section, we a apply a reinforcement learning algorithm for motion imitation by using the online distance measure between embodiments as a feedback signal. Reinforcement learning has shown great success in learning motor tasks, for example, [19, 9]. We study the transfer of motions from one Panda robot with maximally seven DOFs to another one in simulation. As before, different embodiments are generated by locking DOFs in the learner. We assume that the dynamics of the robots are unavailable (modelfree) and that the learner is controlled by joint torques . Consequently, the agent needs to control the joint positions but also, implicitly, learn the the robot dynamics.
Simulation Environment.
The manufacturer (Franka Emika) of the Panda robot provided a good integration of the robot into the ROS ecosystem, which we augmented by the Gazebo physics simulator. Unfortunately, no exact inertia values for the Panda were provided, which were needed to simulate the dynamics. The CoRLab group from the Universit t Bielefeld published some estimates of inertia values on their GitHub repository
^{3}^{3}3https://github.com/corlab/cogimongazebomodels/blob/master/franka/robots/panda_arm.xacro. We used these estimates and manually adjusted them by using the guide given in the Gazebo manual. The simulated robot is controlled via joint torque commands. To create trajectories of the expert, simple PIDcontrollers were configured for each joint. Due to the lack of sophisticated controllers and to facilitate the task for the RL agent, gravity was turned off in the simulation.RL Environment and Agent. The next task consisted in the implementation of the reinforcement learning agent and the interface for interaction with the simulation. The state space consisted of the expert and learner states, thus, the environment’s state is defined by the tuple . The actions are given by the torque commands of the learner, , subject to the torque limits for each corresponding joint. The control of the expert is not observable by the RL agent. For training and testing, multiple random trajectories of similar duration were recorded. The step size was set to s and a step consisted of the following transitions: The agent executed an action in the environment by sending a torque command to the learner. The torques were then applied for the duration of the simulation time . The simulation then paused and returned the next observed state together with the reward , which was calculated from the state as the negative distance measure between the embodiment states. To train the agent, we used Proximal Policy Optimization (PPO) [37]
, which is a stateofthe art actorcritic DRL algorithm. One big advantage of PPO is its robustness towards hyperparameter tuning. We employed the GPUoptimized version (PPO2) of the
StableBaselines repository, based on OpenAI’s implementations.Simulation 1: Imitation of a Single Trajectory. We first tested, whether motion imitation using reinforcement learning is feasible by transferring a single trajectory between two Panda robots with all 7 DOFs activated, i.e., expert and learner had identical embodiments. Both expert and learner started in their zero pose, in which all joint angles and joint velocities are set to zero. The trajectory of the expert was recorded offline by moving each joint to an arbitrarily chosen goal position. The environment was reset whenever the trajectory ended. Each trajectory had a total duration of 5 s, which leads to 50 steps per episode. The discount factor was set to implying that the agent acted myopically. This value was chosen because high values led to very slow training progress. The weights for the framedistance function were set to in all simulation experiments. Training time was about hours on a desktop computer using GPUaccelerated computations. The simulation showed that the learned trajectory resembles the expert’s trajectory very closely and that imitation of motions is possible using a reinforcement learning framework with a distance related reward function. The results can be seen in Video 3 ^{4}^{4}4https://youtu.be/dLN314VJTHg.
Simulation 2: Generalization Between Trajectories.
It was examined next, whether the agent can imitate trajectories it had not seen before. For this purpose, we trained the agent with different numbers of training data (one vs. 124 trajectories). Trajectories were recorded as before, but each time with different final poses. The trajectory of the agent trained on a single trajectory barely resembles the trajectory of the expert. The imitation of the agent trained on the larger data set is not perfect, but resembles the trajectories from the expert more closely (see Video 4 ^{5}^{5}5https://youtu.be/2Q7jiY9DRUg). This improvement shows in a significantly smaller distance between expert and learner along the trajectory (see Fig. 5).
Simulation 3: Imitation Between Dissimilar Embodiments. In the next experiment, we studied motion transfer between dissimilar Panda robots. Towards this goal, we trained the agent again on the same 124 trajectories but this time the learner had only 4 or 3 DOFs, respectively. As the learning robot was restricted in its DOFs, some trajectories could could not be imitated well, resulting in higher values of the distance measure. We found that the restricted learner moved its links in similar directions as the expert but the restrictions prevented a more similar imitation. Examples can be seen in Video 5 ^{6}^{6}6https://youtu.be/Fytw8sz0pG0.
V Conclusions
Our main contributions with this work are threefold: First, we have introduced a definition of embodiment states in terms of frames/twists and candidate points. Second, we have povided a distance measure between dissimilar embodiments using correspondences between frames of expert and learner. Third, we have applied this distance measure to static pose and movement imitation tasks between manipulators. All tasks have been performed in simulation. In all experiments we could show that the agent was able to learn the imitation task, even though no dynamic model has been provided to the learner. The framework that we have developed is generic and flexible and not limited to our choice of parameters, distance measures and type of robots. Depending on the correspondence matrix calculation, the topology of the embodiments is not crucial. Possibly even free topologies like swarms of flying objects could be compared and brought into similarity.
References
 [1] (2016) TensorFlow: a system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. External Links: ISBN 9781931971331 Cited by: §IVA.
 [2] (2010) Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research 29 (13), pp. 1608–1639. Cited by: §I, §I.
 [3] (2004) Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, Cited by: §I.

[4]
(2018)
Neural networks and deep learning
. Vol. 10, Springer. Cited by: §IVA.  [5] (200207) Imitating with alice: learning to imitate corresponding actions across dissimilar embodiments. IEEE Transactions on Systems, Man, & Cybernetics, Part A: Systems and Humans 32, pp. 482–496. External Links: Document Cited by: §II.
 [6] (2005) Achieving corresponding effects on multiple robotic platforms: imitating in context using different effect metrics. In In: Proceedings of the Third International Symposium on Imitation in Animals and Artifacts, Cited by: §II.
 [7] (2002) Do as I do: correspondences across different robotic embodiments. In Procs. 5th German Workshop on Artificial Life. Lubeck, Cited by: §II.
 [8] (2007) Correspondence mapping induced state and action metrics for robotic imitation. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 37, pp. 299–307. Cited by: §II.
 [9] (2020) Learning dexterous inhand manipulation. The International Journal of Robotics Research 39 (1), pp. 3–20. Cited by: §IVB.
 [10] (2008) Imitation learning of dualarm manipulation tasks in humanoid robots. International Journal of Humanoid Robotics 5 (02), pp. 183–202. Cited by: §I.
 [11] (201601) Learning from humans. Springer Handbook of Robotics, pp. 1995–2014. External Links: Document Cited by: §I.

[12]
(2011)
Relative entropy inverse reinforcement learning.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, pp. 182–189. Cited by: §I, §I.  [13] (2010) Learning and reproduction of gestures by imitation. IEEE Robotics & Automation Magazine 17 (2), pp. 44–54. Cited by: §I.
 [14] (2010) Learning to walk by imitation in lowdimensional subspaces. Advanced Robotics 24 (12), pp. 207–232. Cited by: §I.
 [15] (2013) Probabilistic modelbased imitation learning. Adaptive Behavior 21 (5), pp. 388–403. Cited by: §I, §II.
 [16] (201603) Guided cost learning: deep inverse optimal control via policy optimization. Proceedings of the 33Rd International Conference on Machine Learning, pp. . Cited by: §I.
 [17] (2006) Dynamic imitation in a humanoid robot through nonparametric probabilistic inference.. In Robotics: science and systems, pp. 199–206. Cited by: §II.
 [18] (2009) Learning actions through imitation and exploration: towards humanoid robots that learn from humans. In Creating BrainLike Intelligence, pp. 103–138. Cited by: §II.
 [19] (2017) Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286. Cited by: §IVB.
 [20] (2016) Modelfree imitation learning with policy optimization. In International Conference on Machine Learning, pp. 2760–2769. Cited by: §I.
 [21] (2002) Movement imitation with nonlinear dynamical systems in humanoid robots. In Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), Vol. 2, pp. 1398–1403. Cited by: §I.
 [22] (2010) Imitation and reinforcement learning. IEEE Robotics & Automation Magazine 17 (2), pp. 55–62. Cited by: §I, §I.
 [23] (2011) Imitation learning of positional and force skills demonstrated via kinesthetic teaching and haptic input. Advanced Robotics 25 (5), pp. 581–603. Cited by: §I.
 [24] (2007) Affordancebased imitation learning in robots. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1015–1021. Cited by: §I.
 [25] (2017) Modern robotics. Cambridge University Press. Cited by: §IIIA.
 [26] (2017) Phase estimation for fast action recognition and trajectory generation in human–robot collaboration. The International Journal of Robotics Research 36 (1314), pp. 1579–1594. Cited by: §I.
 [27] (2007) Imitation and social learning in robots, humans and animals: behavioural, social and communicative dimensions.. Cambridge University Press. Cited by: §II.
 [28] (2001) Like me? Measures of correspondence and imitation. Cybernetics and Systems 32, pp. 11–51. Cited by: §II.
 [29] (2014) Trajectory planning under different initial conditions for surgical task automation by learning from demonstration. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 6507–6513. Cited by: §I.
 [30] (2018) An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics 7 (12), pp. 1–179. Cited by: §I, §I.
 [31] (1995) A lie group formulation of robot dynamics. The International Journal of Robotics Research 14 (6), pp. 609–618. Cited by: §IIIA.
 [32] (2018) Deepmimic: exampleguided deep reinforcement learning of physicsbased character skills. ACM Transactions on Graphics (TOG) 37 (4), pp. 1–14. Cited by: §I.
 [33] (2007) Imitation learning for locomotion and manipulation. In 2007 7th IEEERAS International Conference on Humanoid Robots, pp. 392–397. Cited by: §I.
 [34] (2006) Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pp. 729–736. Cited by: §I.
 [35] (2011) A reduction of imitation learning and structured prediction to noregret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. Cited by: §I.
 [36] (2013) Learning monocular reactive uav control in cluttered natural environments. 2013 IEEE International Conference on Robotics and Automation. Cited by: §I.
 [37] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §IVB.
 [38] (201010) Learning from demonstration for autonomous navigation in complex unstructured terrain. I. J. Robotic Res. 29, pp. 1565–1592. External Links: Document Cited by: §I.
 [39] (2008) Maximum entropy inverse reinforcement learning.. pp. 1433–1438. Cited by: §I.