Nowadays, most of the robots used in the industry are pre-programmed and necessitate a well-defined and controlled environment. Reprogramming these robots is often an expensive process necessitating an expert. Enabling the robot to learn tasks by demonstrating them would streamline the robot installation and task reprogramming. That is why, Robot Learning from Demonstrations (RLfD)  is one of the key research areas in the field of robotics. However, constructing a robot that is able to learn by observation is still a challenging problem. Although, prototype platforms for robot learning by demonstration have been around for more than a decade, many complications have restrained the robots to operate only in restricted laboratory environments. Some of the key challenges are perception, task recognition, task generalization, motion planning, and object manipulation.
RLfD refers to the technique of teaching skills to a robot by giving examples of the desired behavior through human demonstrations. It is similar to the way human beings learn a new skill from demonstrations performed by the teacher. One of the great advantage of this technique is that it eliminates the need of expert level technical knowledge to program a robot. Moreover, it captures key features from the demonstration provided by a teacher who is expert in the specific task, which a robotic programmer might not be able to program. Thus, this technique has great potential in industrial as well as home robotic applications.
Learning the mapping between world state and actions is called as a policy. This allows a robot to choose an action based on its current state. In RLfD, a robot learns a new policy from demonstrations provided by the human teacher. A demonstration is defined as sequence of state–action pairs that are recorded throughout the performance of the required robot behavior by a human teacher. RLfD uses the dataset of recorded demonstrations to form policy to reproduce the demonstrated behavior.
We focus on learning, representing, and generalizing of trajectory based skills from demonstrations. We propose a novel RLfD framework by which a collaborative robot can learn trajectory based skills from kinesthetic demonstrations performed by a human teacher. The prosed RLfD framework has three fundamental phases: data collection from demonstrations, learning behavior by deriving a policy, and robot execution or reproduction of the demonstrated behavior. It extracts key information from multiple demonstrations to learn a state-action policy, which is used to produce a generalized trajectory based skill.
RLfD is a technique to enable a robot to perform new tasks autonomously. Instead of necessitating users to logically decompose and manually program a robot for a desired behavior, RLfD enables to derive robot controller by observing human’s own performance. The goal is to easily extend and adapt robot capabilities to novel situations, even by users who lag programming ability. Robot learns a model of a task based on the demonstrations performed by the teacher. In the last two decades, several RLfD approaches have been developed . However, until date most of the robots are tediously hand programmed for each of the task they have to perform. RLfD tries to minimize this challenging step by allowing users to teach their robot to fit in their specific needs. RLfD techniques are user friendly and enable robots to be employed more widely in daily activities with no need of specialist humans. Moreover, by using expert knowledge of human teacher, in form of demonstrations, the overall learning process would be faster as compared to hand-engineered control policy and learning by trial-and-error.
, in which a probabilistic representation of the demonstrations is built using Gaussian Mixture Model (GMM) and a smooth trajectory is reproduced using Gaussian Mixture Regression. In another work by Schneideret al. introduced a Gaussian process regression model that clusters the input space from performed demonstrations into smaller subsets and work individually on these subsets . Both of these Gaussian techniques generalized over a set of demonstrated trajectories, but the GMM is computationally expensive.
Rozo et al. proposed an end-to-end RLfD framework for teaching force-based manipulation tasks . The demonstrations are encoded using a Hidden Markov Model (HMM), and the reproduction is done using a modified version of Gaussian Mixture Regression. Asfour et al. used continuous HMM to generalize movements demonstrated to a robot multiple times. The experiments were performed using a kinematics model of the human upper body to simulate the reproduction of arm movements .
In a recent work, Ahmadzadeh et al. proposed an approach that generates a continuous representation of the demonstrated trajectory based on the concept of canal surfaces . Their experimental results show that the proposed approach can reproduce a wide range of trajectories that achieve the goal of the task. Koskinopoulou et al. developed a framework for Human Robot Collaborative (HRC) task execution . Akgun and Thomaz developed an algorithm to simultaneously learning actions and goals from naïve human teacher demonstrations .
Konidaris et al. presented an online algorithm for constructing skill trees from demonstration trajectories in a dynamic continuous domain . They evaluated their algorithm on the uBot-5 mobile manipulator and showed that it was able to learn skills from both expert demonstration and learned control sequences. Other techniques like linguistic transfer of an assembly task from human to robot have also been developed .
Iii Problem Statement
We can consider RLfD as a subset of Supervised Learning. In supervised learning, the agent takes the labelled training data and the algorithm learns an approximate model to fit this data. Similarly, in RLfD, the training data is collected from demonstrations provided by the teacher. Figure2 (top) shows training data being acquired from demonstrations by teacher to derive a policy.
The robot world comprises of states and actions . Mapping between the states, is defined by a probabilistic transition function called state transition matrix given by:
The set A contains high-level behaviors as well as low-level motions. We make an assumption that the states are not fully-observable or hidden. However, the learning algorithm has access to the observed state , with the mapping . The policy selects the next action based on the current robot state. Figure 2 (bottom) shows one cycle of policy execution.
Iv Prposed Approach
The proposed RLfD system consists of three fundamental phases. Figure 3 shows a block diagram of the proposed approach. First phase is the data collection phase, in which the data is acquired from the demonstrations performed by the teacher. Real-time joint angle values and gripper state are received by the system, which applies Forward Kinematics to find the position and orientation of the manipulator. This process is repeated for all the demonstrations and the time series of position and orientation are stored.
The second phase is the task learning phase, in which the stored data from the first phase is used to learn a model of the task or the behavior. From each demonstration, key-points of trajectories are extracted and then k
-means clustering is used to cluster the key-points from all demonstrations. Centers of each cluster is calculated and mapped to a unique symbol representing a state. Baum-Welch algorithm is used to learn a policy from the demonstrations to find the probability matrix of state-action pairs. Then, Viterbi algorithm is used to find the most probable sequence of states. Dynamic Time Warping (DTW) is used to align the time vector of the learned trajectory.
Third phase is the robot execution phase, in which the saved model is used to produce the learned behavior. Cubic smoothing-spline regression is used to determine a generalized trajectory from the most probable sequence of states. Inverse kinematics is applied to obtain a generalized trajectory suitable for the robot.
V Data Collection
Data collection is the first phase of the system, where the robot collects data from the physical demonstrations performed by human teacher. In the next three sub-sections, we discuss the main steps of this phase.
In our work, we have used kinesthetic teaching to demonstrate the task or trajectories to the robot. Kinesthetic teaching is a way of teaching in which, the teacher physically holds the robot arm or manipulator, and demonstrates the task by moving the robotic arm. The main advantage of this technique is that there is no need to map the real world space to robot world space, as the performed demonstrations are with respect to robot world.
One major roadblock for this type of teaching is that it requires a robot that has a feedback system to record the joint position in real-time. However, we do not face this problem as we are using one of the most advanced research robot called Baxter, which has inbuilt feedback system to measure the joint angles from all fourteen joints and manipulator state of the dual-arm robot. This allows us to read these joint values from joint trajectory server in real time. To get more precision for fast movements, we have increased the default rate (100Hz) of reading current joint angles from the robot to 1000 Hz. These values are stored with a timestamp and sent to the forward kinematics function.
Figure 1 shows an example demonstration being performed using kinesthetic teaching to demonstrate a pick and place task to Baxter. In this, the teacher holds the force sensor at the manipulator of each robotic arm and guides the robot through the task. To change the state of the gripper, physical buttons near the manipulator are used. One button closes the gripper and the other opens the gripper. The system records the gripper states along with the joint angles throughout the demonstration, and for each demonstration.
V-B Forward Kinematics
‘Forward kinematics refers to the use of the kinematic equations of a robot to compute the position of the end-effector from specified values for the joint parameters’. Kinematics equations are series of transformation matrices to characterize the relative movement at each joint with respect to the previous joint. Instead of going through the complex matrix multiplications for calculating the relative movements at each joint, we use Denavit-Hartenberg parameters or DH parameters.
Each individual homogeneous transformation is calculated as:
where and are known as DH parameters associated with link and joint .
We use Baxter as our robot for learning, thus we calculated the DH parameters for both the limbs of the robot to perform the forward kinematics. We follow the standard convention of placing the frames at each joint to calculate the DH table. Figure 4 shows an example of placing the frames on first two revolute joints and the physical parameters used to mathematically modeling Baxter’s arm. Table I shows the derived DH parameters for all the joints and links for Baxter. These parameters are used to calculate the individual transformation matrices for each of the link.
Now, we use the DH parameters to calculate . The homogeneous transformation matrix that expresses the position and orientation of the manipulator is called a transformation matrix , and is calculated as:
For each measurement, the calculated position and orientation are stored into a data file, along with the time stamp and the gripper state. This marks the end of the first phase.
Vi Task Learning
Task learning is the second phase of the system. In this phase, the data stored during the first phase is used to learn a model of the task.
Vi-a Key-point Extraction
We detect the characteristic features of each demonstration, and we call it key-points. By doing this, we avoid having large number of hidden states to be trained by HMM. We represent every recorded movement during the demonstration by a set of time-discrete sequence. For each arm we have a sequence that defines the positions of manipulator, and a sequence that defines the orientations of manipulator over time, where is defined as the length of demonstration . is a three dimensional vector, whereas is a four dimensional vector. A point is selected as a key-point in the sequence of points if:
where, denotes the time-stamp of demonstration and key-point. This means that if the angle between the vector that goes from point to its successor and the vector that goes from point to its predecessor is less than , then point is selected as a key-point. Figure 5 shows an example of finding angles between the two vectors in a two dimensional plane. To detect only sharp corners in the manipulator trajectory as key-points should be high.
To detect the key-points of the orientation angles of the manipulator, similar selection criterion for key-points is used. A point ,i is selected as a key-point in the sequence of points if:
A major challenge in detection of the key-points using this approach is to tune the values of thresholds , which will decide the number of key-points extracted from the recorded task or trajectories. If the number of detected key-points is low, some relevant characteristics of the task or trajectories can be missed from the generalization step. Whereas, if the detection of key-points is high, over fitting of the recorded task or trajectories can take place.
Keeping in mind the trade-off between generalization and over-fitting, reasonable values for were experimentally determined. Figure 6 shows the result of key-point extraction algorithm on multiple demonstrations. It shows the key-points extracted from three demonstrations. The key-points are marked with circles on each original trajectory of the demonstration.
In contrast to the common key point approach used by Asfour et al. , we cluster the key-points derived from all the demonstrations and calculated the centroids of each cluster. To solve this problem, we use k-means clustering algorithm. The k-means clustering algorithm takes the number of clusters as an input to generate k clusters, from a set of observation vectors. It returns the centroids for each of the k clusters formed. For our system, we set the number of clusters k equal to average number of key-point in each demonstration.
A key-point vector is classified with a cluster if the centroid of the cluster is closest to it, i.e. closer to centroid than any other centroids.k-means clustering algorithm attempts to minimize the distortion, which is the sum of the squared distances between each key-point vector and its dominating centroid. At each step, k-means refines the choice of centroids and tries to reduce the distortion. When this distortion change gets below a threshold, the algorithm stops. Figure 6 shows the result of the implement k-means clustering algorithm to cluster the key-points and find the centroids of the clusters. In the plot, each centroid is marked with big red circles. In the next section, we discuss how these centroids are used to learn a model of the task or trajectories using a HMM.
Vi-C Model Learning using HMM
A HMM (Hidden Markov Model) is a directed graphical model that is a statistical Markov chain of a sequence of unobserved or hidden states and a corresponding sequence of observation variables. It has been widely applied in the field of handwriting recognition, speech recognition, DNA sequence analysis, etc.
Figure 7 shows a graphical representation of HMM, where denotes the hidden states and denotes the observed variables at time instants . The probability of going to state at time , given that the current is state at time , is denoted by:
Using these probabilities, we form the state transition matrix, given by:
where, denotes the number of hidden states in the HMM. In this work, the number of hidden states was set equal to the number of centroids derived from the demonstrations.
The initial state probabilities is defined as the probability of model being in state at time , and is given by:
The observation probability is defined as the probability of observing a symbol at time given the model is in state , and is denoted as:
Using these probabilities, we form the observation probability matrix, given by:
where, is the number of observation symbols. A complete HMM is described as:
In our research, we used a discrete-HMM for learning a model of the demonstrated trajectories, which required the recorded continuous trajectories to be mapped to discrete values. The key-point extraction and clustering technique applied in section VI-A Key-point Extraction and section VI-B Clustering were for vector quantization. As it is required to use most of the k clusters in order to preserve most of the features in the continuous trajectory, we mapped each cluster into a discrete symbol to from a codebook of symbols , where k is the number of clusters formed. The sequences of these observations form the set, and is given by:
where, M denotes the total number of demonstrations.
The efficiency of learning with HMM depends on the number of available observations. As in RLfD, it is preferred to keep the number of observations or demonstrations low, an appropriate initialization of the model parameters is imperative. Next, we discuss the initialization process of the HMM parameters.
We implemented the Bakis left–right topology  to model the demonstrated task. The forward-transition probabilities and self-transition probabilities were initialized in the state transition matrix as:
where, is the amount of time spent in state , and is a normalizing constant to make sure .
All other state transition probabilities were set as zero, to make sure that transition to those states is impossible. Output probabilities were initialized as:
where, denotes the number of times is observed in state of . The state probabilities , were initialized as . This ensures that the learning always starts from the starting point.
Once the model parameters were initialized, Baum–Welch algorithm  was used to train on all demonstrations or observations . Later, Viterbi algorithm  was used to determine the most probable sequence of hidden state. Due to the nature of the implemented Bakis left-right topology, some hidden states or centroid points were not present in all the observed trajectories. We called them as zero-points and were ignored in the generalized trajectory.
Vi-D Dynamic Time Warping
Since the length and velocities of the demonstrated trajectories differ, the time frames of extracted key-points of each demonstration are different. To produce the learned trajectory, we need to align the set of key-points along a common time vector. To solve this problem, we use DTW , in which the temporal alignment of the clusters made by key-points is done by aligning the entire trajectory with respect to a reference trajectory.
DTW sequence alignment technique forms a matrix that consists of distances between two time-series, which is then used to find an optimal path that minimizes the distance between the two time-series. For a given test sequence of length , and a reference sequence of length , the distance matrix is calculated as:
In our research, to calculate the distance between two time-‘series, Euclidean l2-norm is used. The optimal alignment path is calculated as:
For selection of the reference sequence, forward algorithm was used to find the demonstration that has the highest probability for the learned model. As our data received from demonstration was eight dimensional data, we implemented multidimensional DTW, and the distance matrix is calculated as:
where, is the number of dimensions, and in our case, .
Vi-E Generalized Trajectory
After we have a common time-line for the model learned for the sequence of centroids, we need to connect these centroids to generate a generalized trajectory. To solve this problem, we used cubic smoothing-spline regression to determine a generalized trajectory from the most probable sequence of centroids. This method is widely applied for fitting a smooth curve to large set of scattered data. In our case, we want to generate a smooth trajectory from the set of scattered centroid points. The spline curve was interpolated at intervals equal to the period size of clusters, which produced a generalized trajectory suitable for Baxter.
Vii Robot Execution
Once the model of the task is saved, we use the stored files to perform the task. The files contains the centroids, which are the position and orientation of the manipulator. For the robot to reproduce the learned task, we need to convert these position and orientation values into joint angles values that are suitable for the robot. To solve this problem, we use Inverse Kinematics. This technique is just the opposite of Forward Kinematics. It provides the set of joint angle values for a given the Cartesian pose of the manipulator.
We use the Baxter’s IK server to perform the inverse kinematics during the execution of the task. It is a build-in service from Baxter SDK. It takes the position and orientation values as the input and returns seven valid joint angles values to achieve it. These values are sent to the robot and it moves to that state using Joint Trajectory Action Server (JTAS).
To evaluate the approach, we performed the experiments on the research robot by Rethink Robotics called Baxter. It is a dual-arm 7-DoF robot. A modular approach was used to create the framework and all the algorithms were developed as separate packages in python. This made it easier to test and debug each part of the framework separately. In this section, we discuss experiments performed and the results of those experiments.
The key-point extraction algorithm and clustering algorithm plays a vital role in the efficiency of the model. The variables for tuning the key-point extraction algorithm were experimentally determined. To make sure this algorithm works well on wide variety of task and complex trajectories, we validated them on different tasks. In the next two sub-sections, we discuss the details of the two complex tasks that the robot learned through multiple demonstrations.
Viii-a Stacking cup task
The first task is to place a cup on a stack of cups. In this task, the robot takes the cup from a human and puts it on a stack of cups. Figure 1 shows robot learning the stacking cup task from a human teacher through demonstration. Three demonstrations were performed to teach this task. Table II shows the data statistics for this task.
|Demonstration #||# of recorded values||# of key-points extracted|
The robot was able to reproduce the task successfully. This task involved learning of precise motion of the manipulator to precisely drop the cup on top of the stacked cups. Figure 9 shows the manipulator position for the demonstrated task and the learned trajectory. The small triangles on the trajectories marks the key-points extracted from that demonstration, and the red circles show the centroids of the clusters formed. In total, 19 clusters were formed and they acted as states for the HMM. Numbers from 0-18 named these centroids. The cyan line shows the generalized trajectory. This plot only shows the position of the manipulator only. They complete learning process involved the learning of the orientation of the manipulator as well. Robot performing the learned stacking cup task can be seen in figure 8.
Viii-B Pick and place task
The second task is a pick and place task, in which the robot picks up a block of wood and places it at some other spot and then pick the block from that spot and places it back to the original spot. This is a much more complex task than the previous task. Table 4 shows the data statistics for this task.
It can observed in Figure 10 that the three demonstrations were a lot different from each other, though they were performing the same task. The small triangles on the trajectories marks the key-points extracted from that trajectory, and the red circles show the centroids of the clusters formed. In total, 21 clusters were formed and they acted as states for the HMM. Numbers from 0-20 named these centroids. The cyan line shows the generalized trajectory. Please note that the trajectory in the graph is not the actual trajectory the robot went through. The derived trajectory in the plot is just by connecting the states with linear lines to illustrate the flow of states. The actual trajectory of the robot used JTAS spline interpolation and was much smoother than the one in the plot.
|Demonstration #||# of recorded values||# of key-points extracted|
In both the experiments, the robot was able to learn the complete task and successfully reproduce the learned task.
In this paper, we presented a framework for RLfD. This framework has three phases: Data collection, Task Learning and Robot Execution. The data collection module collects and stores the eight dimensional time-series data from each demonstration. The Task Learning module then uses this data to learn a model of the task. As we can observe from the experiments, the key-points are extracted from each of the recorded demonstration. This was done to limit the number of observed states for the HMM. As we can observe from the data statistics of the two experiments, the number of extracted key-points is much less than the recorded values. This reduction in the number of states lead to a faster learning of the model. Initially, we tried to learn the model of the task based on the joint angles values, and it avoided the implementation of the forward and inverse kinematics. Then, we implemented FK and IK and used the position and orientation of the manipulator, which made it easier to visualize the trajectories of the task. In addition, the key-points extracted from both the methods were the same in most of the task. Thus, we decided to move forward with the second approach using the position and orientation, as it is easier to plot and visualize the position of the manipulator.
The model learning part of our approach is similar to , but the key point extraction technique we implemented allowed us to precisely extract key-points from the demonstrations. By adjusting the threshold values , we were able to fine-tune the key point extraction algorithm to extract only sharp corners from the demonstrated trajectory. The k-means clustering technique that we used for clustering the key points allowed us to include all the key points from all the demonstrations performed. Thus, we were able to preserve the key features of the task that were missing from some demonstrations. Thus, we got more key features of the task as compared to Asfour et al. , where they used the common key points concept, i.e. they just use the key points that are found in all demonstrations. Therefore, the generalized trajectory only includes the features that are present in all demonstrations.
It was experimentally determined that three or four demonstrations for most of the task, tends to produce a generalized trajectory closest to the most probable demonstrated trajectory. The learning of the orientation played a crucial role as well. For many tasks like picking up the wooden block, the orientation of the gripper has to be perfectly aligned with the object to grab the object from the right place. The model learning also involved learning of the gripper states. The model learned should not only learn whether the gripper should be closed or open, but also the width it should open or close to grasp or release an object. This ensures that the robot does not break delicate objects by learning the grasping width of the gripper during the task.
In this paper, a novel RLfD framework was introduced, which used HMM to model the trajectory skill demonstrated to a collaborative robot. Our results show that the robot can collaborate with the human teacher and learn variety of trajectory based task in short time and reproduce the task while working with a human user.
In future, other collaborative robots can be trained to learn complex task using the developed RLfD framework. With the advancements in the field of image processing and visual servoing, RLfD can be made more robust by allowing the robot to learn the position and orientation of the task relevant objects in addition to the position and orientation of the manipulator. This way, the robot will be able to adapt to the changing environment, as it will learn the environment parameters like distance from task relevant objects. This will enable the robot to learn a task and then reproduce it in a dynamic environment.
C. G. Atkeson and S. Schaal, “Robot Learning From Demonstration,” in
Proceedings of the Fourteenth International Conference on Machine Learning, pp. 12–20, 1997.
-  B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and autonomous systems, vol. 57, no. 5, pp. 469–483, 2009.
-  S. Calinon, F. Guenter, and A. Billard, “On Learning, Representing, and Generalizing a Task in a Humanoid Robot,” IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), vol. 37, pp. 286–298, 4 2007.
-  S. Calinon, I. Sardellitti, and D. G. Caldwell, “Learning-based control strategy for safe human-robot interaction exploiting task and robot redundancies,” in 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 249–254, IEEE, 10 2010.
-  S. Calinon, “Robot Learning with Task-Parameterized Generative Models,” in Proc. Intl Symp. on Robotics Research (ISRR), 2015.
-  M. Schneider and W. Ertel, “Robot Learning by Demonstration with local Gaussian process regression,” in 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 255–260, 10 2010.
-  L. Rozo, P. Jiménez, and C. Torras, “A robot learning from demonstration framework to perform force-based manipulation tasks,” Intelligent Service Robotics, vol. 6, pp. 33–51, 1 2013.
T. Asfour, P. Azad, F. Gyarfas, and R. Dillmann, “Imitation learning of dual-arm manipulation tasks in humanoid robots,”International Journal of Humanoid Robotics, vol. 5, no. 02, pp. 183–202, 2008.
-  S. R. Ahmadzadeh, R. Kaushik, and S. Chernova, “Trajectory learning from demonstration with canal surfaces: A parameter-free approach,” in 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids), pp. 544–549, IEEE, 11 2016.
-  M. Koskinopoulou, S. Piperakis, and P. Trahanias, “Learning from Demonstration facilitates Human-Robot Collaborative task execution,” in 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 59–66, IEEE, 3 2016.
-  B. Akgun and A. Thomaz, “Simultaneously learning actions and goals from demonstration,” Autonomous Robots, vol. 40, pp. 211–227, 2 2016.
-  G. Konidaris, S. Kuindersma, R. Grupen, and A. Barto, “Robot learning from demonstration by constructing skill trees,” The International Journal of Robotics Research (IJRR), vol. 31, pp. 360–375, 3 2012.
-  N. Dantam, I. Essa, and M. Stilman, “Linguistic transfer of human assembly tasks to robots,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 237–242, 10 2012.
-  L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
-  E. Keogh and C. A. Ratanamahatana, “Exact indexing of dynamic time warping,” Knowledge and Information Systems, vol. 7, pp. 358–386, 3 2005.
-  A. Vakanski, I. Mantegh, A. Irish, and F. Janabi-Sharifi, “Trajectory learning for robot programming by demonstration using hidden Markov model and dynamic time warping,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 4, pp. 1039–1052, 2012.