I Introduction
Robot learning has facilitated the programming of a robot using Learning from Demonstrations (LfD) and therefore has gained its popularity in the past decades [atkeson1997robot, argall2009survey]. One objective of robot learning is to endow the robot with the ability to learn tasks through its own observations. By performing a task by a teacher, a robot can learn, and thus replicate what has been demonstrated. Compared to approaches that learn from scratch, higher learning efficiency can be realized through this approach. Provision of human demonstration transfers the knowledge/skills from the teacher to the learner, avoiding the robot to relearn those already acquired by and can be transferred from the demonstrator.
However, providing a high quality demonstration can be expensive and may not always be possible, not to mention the necessity of large number of demonstrations required to cover enough stateaction pairs for policy learning. LfD implies that the robot policy may inherit the underlying motion characteristics of the teacher and as such, the poor demonstrations may affect the final performance of the robot. Besides, due to the difference in the morphological structure between the teacher and the learner, direct replication of the demonstrated task by the robot may not be efficient and sometimes not feasible. Hence, with the advances in robot learning, many researches have moved from simply teaching the robot, to exceeding the performance of human [van2010superhuman].
Improving the learning performance from imperfect demonstration can be formulated as an optimization problem. Reinforcement Learning (RL) can be a good candidate to serve for this purpose. By defining a reward function, the learning agent can find the optimal policy such that it maximizes the cumulative rewards received. However, a typical problem is to find a good balance between exploitation and exploration during the learning process. A good exploitation of data may not be possible without sufficient exploration while too much exploration may degrade the learning efficiency. Furthermore, for a surgical robot, as an example, improper or too wild exploration may raise in safety concerns. Therefore, in this context, we propose a framework that can learn from suboptimal demonstrations using RL and performs policy learning within a bounded discretized space, which is constrained by variance of the human demonstrated trajectory. This is shown to shorten the exploration time and enhance the policy learning performance.
Ii Related Work
In robot learning, LfD has been used in an intuitive way to transfer human knowledge to a robot. For example, Mueller et. al [mueller2018robust] proposed an approach to repair errors in acquired skills through additional demonstrations while Osa et. al [osa2017guiding] incorporated human demonstrated trajectories in addition to handcrafted constraints to avoid the need for motion planning from scratch. In this way, not only the learning process through LfD can be made more efficient, but also the robot has the potential to perform a task with quality that is on par with that of an expert.
A general technique for policy derivation is to train a classifier or regressor on a set of stateaction pairs obtained from the demonstrated examples
[argall2009survey]. Taking the state as training input and the action as the output label, this approach aims to approximate a mapping function directly from a state to an action. Gaussian Mixture Models and Gaussian Mixture Regression (GMM/GMR) derive the approximation by first encoding each motion primitive using GMM then applying GMR to concatenate models and generate smooth trajectories
[reiley2010motion, lin2012learning]. Dynamic Movement Primitives (DMP) proposed by Ijspeert et al. [ijspeert2002movement], on the other hand, use a set of differential equations and a nonlinear term to model each motion primitive as a nonlinear dynamical system. Multiple equations describing motion primitives are combined to form a smooth trajectory [pastor2009learning, ijspeert2013dynamical]. Other work used variations of local regression based on kNearest Neighbors, locally weighted regression [atkeson1991using], receptive field weighted regression [schaal1997receptive] and locally weighted projection regression [vijayakumar2000locally] to map nonlinear high dimensional problems. These approximators attempt to model and learn the demonstrator’s behaviour to a state from the demonstrations, so they could carry out an action similar to that of the teacher. However, to fully capture a demonstrator’s motion characteristics, an approximator would have to rely heavily on multiple critical demonstrations, which in practice, are hard to obtain.Various application problems in robotics can be formed and solved in the realm of RL [kober2013reinforcement, rusu2016sim]
. Through iterative interaction with an environment, different states, actions and the associated rewards can be explored, and hence, an optimal policy can be learned by solving the corresponding Markov Decision Process (MDP)
[sutton1998introduction]. Without the need for excessive demonstrations, RL has shown its potential in finding an optimal policy in many applications [silver2016mastering, ebrahimi2017gradient, gao2018reinforcement, hester2018deep]. Vecerik et. al [vevcerik2017leveraging] leveraged this concept and proposed a Deep Deterministic Policy Gradient from Demonstrations (DDPGfD). The framework incorporated the experience of demonstrations to its replay buffer and prioritized it such that the more important transitions are sampled more frequently. Similar work proposed by Nair et. al [nair2018overcoming], also included the experience of demonstrations in replay buffer, but additionally, introduced an auxiliary loss and Qfilter to enhance the learning performance. An introduction of RL avoids the need for large scale demonstrations, but to some extent, it still requires sufficient high quality demonstrations in order to find the optimal policy.Instead of using the demonstrations to initialize the policy, Kim et. al [kim2013learning], on the other hand, proposed Approximate Policy Iteration with Demonstration (APID) using the information from a few and/or suboptimal demonstrations as imposed constraints to an optimization problem, making it less prone to noisy demonstrations. Kang et. al [kang2018policy] proposed a Policy Optimization from Demonstration (POfD), which utilized the scarce and imperfect demonstrations to limit the exploration region to be around that of the expert policy. By including the demonstrationguided exploration term to its learning objective, the author has shown that a better policy could be achieved. Considering the merits of RL and taking the idea of imposed constraints [osa2017guiding] from the scarce and suboptimal demonstrations, we proposed in this paper a framework based on Deep Reinforcement Learning (DRL) approach to find an optimal policy from these demonstrations without losing the safety guarantees. The contributions of the proposed DRL approach are summarized as follows:

The proposed DRL framework learns an optimal skill for a sewing task from scarce and imperfect human demonstrations

The statistical distributions of the multiple human demonstrations are utilized to construct the confined space that satisfies the safety requirement for optimization.

The RL agent is trained within the constrained space to improve the learning performance and efficiency.
The organization of the rest of the paper is as follows. In Section III and IV, the preliminaries and the methodology of the proposed framework are introduced. The experimental setup, the validation of the method and the discussion are then presented in Section V followed by the conclusions and future works in Section VI.
Iii Preliminaries
In this section, the preliminaries of MDPs, Qlearning and deep Qlearning will be presented.
Iiia Markov Decision Processes
A MDP is a mathematical framework based on state set associated with finite action set . If and only if state in captures all the relative information from history, state is Markovian. The transition function between two Markov states can be defined as: . In addition, an Markov Reward Process (MRP) is defined as the tuple , where is a scalar reward function and is a discount factor. The scalar reward function is defined as . Parameter determines how one values the importance of current reward and future rewards. Besides, also stabilizes the total return in infinite timestep case. An MRP with control input/action is defined as an MDP: . The objective of an RL agent is to find the optimal policy that maximizes the accumulated rewards (return) . Considering the observed states and admissible control input , the overall learning problem can be formed into MDPs. To evaluate the value function of the policy, the RL agent will try to solve the MDP. When the value function is obtained, the RL agent can make the optimal decisions according to it, and thus to fulfill the decisionmaking objective.
IiiB Qlearning
In RL, the behavior of the RL agent is determined by the policy defined as . The policy
represents the probability distribution of the action picking
according to the observed system state . is the reward obtained after yielding the action during the th timestep. The RL agent takes the observed state as input and gives out the action , at the same time, receives the reward at timestep . Associated with the reward function, the stateaction value in RL is defined as the expectation of return which starting with state and action . The stateaction value function under policy can be calculated as follows:(1) 
and the stateaction value function can be also rewritten as the Bellman Equation (BE):
(2) 
The goal of Qlearning is to find the optimal decisionmaking policy , which maximizes the stateaction value function. The optimal stateaction value function can be defined as:
(3) 
To maximize the return obtained by the RL agent, the Qlearning algorithm updates the stateaction values according to the reward function in an offpolicy style as:
(4) 
where is the learning rate, which determines the learning speed of Qlearning.
To ensure that Qlearning explores extensively without diverging, the adaptive greedy policy is adopted in this paper. To change the value of the exploration rate in an adaptive way, we start from large value of and then reduce it gradually after episodes being completed.
(5) 
To summarize, the Qlearning algorithm can be viewed in Algorithm 1.
IiiC Deep Qlearning
When the state set and action set are too large or continuous, calculation of the exact stateaction value function
becomes difficult and may be impossible in many cases. One alternative is to use a function approximator to replace the exact stateaction value function. As an universal approximator, neural networks with nonlinear activation functions can be used for the Qnetwork to approximate
[hornik1990uaf]. When the stateaction value function is approximated, RL now becomes approximate reinforcement learning.In the above algorithm, the observed state is the input of the deep neural network while the output approximates the exact stateaction value , , , , . By choosing the action according to the largest stateaction value , the RL agent is able to make the optimal decision.
To simplify the notations, the set of all parameters (including all the weights and biases) in the deep Qnetwork will be written as in the rest of this paper. The deep Qnetwork implicitly determines the policy, in which the action is picked according to the observation at timestep . The chosen action drives the current state to the next state , and the reward will be received from the environment according to the reward function. Then the target network takes the state and the control input set as the input to calculate the output
to estimate largest stateaction value for the next state
. When and are available, the stateaction value of the current stateaction pair can be updated according to the target network and obtained reward. Considering the importance of experience replay during the training of the RL agent [mnih2015human], the state transitions associated with rewards are store in the experience buffer as the tuple . is the current state, is the chosen action according to the current state, is the next state and is the reward received. The algorithm for updating of the weights of the deep Qnetwork is presented in detail in Algorithm 2.Iv Methodology
Based on the DRL algorithm presented above, we propose a framework that extends from the existing LfD approach to enhance the overall robot performance. From a small set of human demonstrations, the modelling is done at the trajectory level to capture the underlying characteristics and distributions of human motions. To further refine the skill learned from human demonstrations, the DRL approach mentioned above is adopted for the optimization. Considering the safety requirement on the robot, the mean and the variance of the trajectory serve as the constraints to the unexplored state space within which the optimal trajectory will be found. The RL agent is then trained and the corresponding policy is optimized through interaction with the physical simulator based on the premodelled trajectories from the previous step.
Iva Human Demonstration and Problem Formulation
For this paper, the setup of human demonstration emulates a robot manufacturing task for stent graft [huang2017vision, tsai2019transfer], which consists of three components: a stereo camera, a suturing device [hu2019design] and a mandrel device as illustrated in Fig. 2. The stereo camera is to capture the motions of the suturing device and the mandrel device during the demonstration. The suturing device is a hand held device, that facilitates singlehanded stitching by passing its needle between two ends at the tip. The mandrel device consists of a cylindrical supporting structure, covered by a piece of fabric and a metal stent, with slot windows to allow for needle piercing.
The task is to teach a robot to stitch on a specified slot such that it binds the stent and the fabric tightly. A human demonstration manipulates the suturing device around the desire location while its motion is tracked using an ArUco marker and the stereo camera. The stitching process consists of three motion primitives, approach the stitch slot, pierce in and pass the needle, and return back to the initial pose. The overall trajectory is illustrated in Fig. 3.
The recording of demonstrations was done at a fixed sampling rate and it captured the movements of the suturing device and the mandrel device. As the mandrel is subject to movements under the camera frame, by capturing its motion allows finding the relative pose between the two devices.
Each demonstration records the trajectory, , which forms from the poses of the mandrel and the suturing device under the camera frame. The action performs a stitch on the same location of the mandrel over time. is a 6 DoF trajectory, which consists of the translation component, and the rotation component, , in EulerRodrigues representation. A collection of human demonstrations is recorded and used to capture the underlying characteristics of demonstrator’s hand motions.
IvB Trajectory Modelling
Multiple trajectories are preprocessed before modelling. The first part of the preprocessing is to filter the noises and sudden hand movements through smoothing the trajectories. Then, the trajectory of the suturing device is transformed to the mandrel frame. A suturing device’s trajectory is oriented around the mandrel and is subject to changes if the pose of the mandrel changes in the camera frame. Therefore, for modelling, the trajectory of interest is the device’ trajectory with respect to that of the mandrel device. This forms a 6 DoF trajectory consists of the translation component, and the rotation component, , in EulerRodrigues representation for each demonstration.
Trajectory modelling is done temporally and the purpose is to find the mean and variance throughout all the demonstrated data. Multiple trajectories possess different temporal length resulted from variations in device’s manipulation. They are aligned temporally using Dynamic Time Warping (DTW) and all the DoF. DTW is a commonly used distancebased method to compute similarity among data set and perform temporal sequences alignment such that the distance score is minimized. After which, the mean and standard deviation is found at each temporal point throughout the aligned trajectories. This forms a 3D space bounded by a standard deviation and within which the optimal trajectory will be found.
IvC Trajectory Optimization
In practice, motion redundancy in a human demonstrated trajectory and the morphological differences between the demonstrator as well as the learner make the demonstration suboptimal and prevent it from direct replay on a robot. Motion redundancy is unnecessary movement from tracking errors or human factors that can result in sudden unexpected motion or longer trajectory execution duration. The morphological difference, on the other hand, is the imperfect mapping between the performer and the learner, hence without optimization, the demonstrated trajectory may result in unsmoothed joint movements in a robot even if the demonstration was perfect, which may raise safety and collision concern.
The simulation shown in Fig. 5 is the replicate setup of the real environment where the demonstrated trajectory will be applied after the optimization. The simulation provides collision detection in addition to visualization and inverse kinematic computation. For a given pose, the simulation computes the associated joint movement and detects potential collisions if the action was to be carried out. This information will be used to determine the optimal policy.
The modelled trajectory is first used to create a 3D discretized space within which encloses all the demonstrated trajectories. The discretization is defined by a userspecified step size, , and it represents the unit length for each grid in space. This creates a 3D space of size L, W, and H in x, y and z directions respectively, where .
The mean and variance of the trajectory are adopted to construct the constrained space for the optimization problem. Motion primitives with larger variance imply multiple possible trajectories and therefore the policy will aim to find the optimal path within the bounded space. On the other hand, the ones with smaller variance imply movement with higher precision, hence the policy should deviate less from the mean trajectory. By constraining the search area within the variance and applying a teacher’s understanding of the subtrajectory to the learning policy, it can avoid the exploration of suboptimal region and action and thus improving the learning efficiency.
The goal of the RL agent is to find the optimal path such that it can complete a demonstrated task with minimal overall joint and endeffectors movements and without collision while ensuring the smoothness of the trajectory. To this end, the RL agent is trained under the guidance of the defined reward function as follows
The reward function consists of 4 terms. The first term is the average of which is the sum of the 7 joint angle difference of the robot between the current state and the next state. The second term compares the pose length between the current state, , and the next states, , to that of the mean trajectory, . The third term,
, is the angle between two vectors, obtained from the previous state, the current state and the next state.
As the optimal trajectory is defined based on its overall trajectory length in both the joint space and the endeffector as well as the smoothness of the endeffector. The first two terms aim to penalize for action that resulted in higher joint changes and larger displacement between two consecutive states. The third term penalizes actions that resulted in sharp turn between states. The last term was determined heuristically to facilitate the convergence of the agent.
The RL agent is represented by a deep neural network which takes the current state as the input and outputs the 126 possible actions. The state considered is only the translation term. The inclusion of orientation in the optimization will largely increase the number of possible states and the complexity of the policy and thus will be addressed in the future work. Each action corresponds to a point in space around the mean of the next point. 125 of them are uniformly sampled within the standard deviations of axes around the next mean point and the additional action represents the current state, i.e. stationary. For each step, the RL agent either takes a random valid action or the optimal action according the current policy. The rotation component is realized by finding the closest position on the mean trajectory to the next state and using the corresponding orientation as its rotation component. The location and the orientation are the pose which is used to calculate the inverse kinematics for the robot, before running in simulation for visualization and collision detection. Episode begins from the starting pose and ends at the terminal pose of the mean trajectory. The RL agent is trained at every step through minibatch of samples from the experience buffer. When collision is detected, the current iteration is restarted.
After training, the optimal trajectory is generated by continuously taking the optimal action of the current state from the start point to the end of the mean trajectory. The corresponding joint trajectory is determined through inverse kinematics.
V Experiments and Results
Va Experimental Setup
To validate the proposed framework, we first collected a set of human demonstrated trajectories on singlehanded stitching. During the demonstration, suturing was performed multiple times on the same location of the mandrel while the entire motion was recorded through stereo camera as described in the previous context. In total, 10 demonstrations were performed, each of which contained 6 DoF temporal poses of the mandrel and the suture device. A simple moving average smoothing with a window size of 10 was applied to smooth out sudden hand movements and/or vision misdetection.
Demonstrations were first transformed to the mandrel frame of reference, and the DTW was performed to align temporally all the trajectories. The mean and variance of the trajectories were computed to construct a constrained space for exploration and exploitation during optimization. After which, the modelled mean trajectory was served as the basis for the optimization while the variance was used as the constraints for exploration.
The optimization process was done using Gazebo simulation and visualizing using Rviz on the Robotics Operating System (ROS). The physical properties and location of different components in the simulation were mimicked to the real world setting. Gazebo was used to simulate the physical properties of the systems. In specific, it was mainly used to simulate the interaction between objects, i.e. collision. The robot used to carry out the single handed task was chosen as the ABB IRB14000 YuMi robot. For a given trajectory in the frame of its endeffectors, the inverse kinematic of the suturing hand was computed and the joint movement and objects interactions were visualized through Rviz.
During optimization, the RL training algorithm started from the mean trajectory, and the RL agent interacted with the simulation world to learn the policy at every step based on the reward function defined earlier. It identified the optimal path by exploring within the space where the variance was large while strictly following the mean of all the demonstrated trajectory when the variance was small.
There were a few hyperparameters to be determined for tuning purpose during policy learning, in which the process was done mostly through trial and error and with some experience. This included, the learning rate, discount factor, minibatch size, number of episodes and number of timesteps in one episode. The parameters of the neural network are summarized in Tab.
I. During policy learning, if a collision occurred, or the boundary condition, i.e. the variance, was met, the simulation would restart. The whole learning process stopped until convergence.The RL agent was represented by a deep neural network to map between states and actions. It consisted of 2 fullyconnected hidden layers with 400 neurons in the first layer and 200 in the second, both with ReLU activation functions. The input to the agent was the current state, while the outputs were the Qvalues of all possible actions. The update of the parameters in the neural network was done through minibatch gradient fashion. A minibatch of training data was randomly selected from the experience buffer to update the network until convergence. The reward and the target Qvalue was given and estimated based on its current state and the defined reward function. At the end of the process, the optimal path was generated by continuously taking the action with the highest Qvalue until the end pose was reached. The start and the end pose were mainly given such that they corresponded to the first and the last poses of the mean trajectory.
VB Results and Discussion
Parameter  Symbol  Values 

Step Size  0.002  
Discount Factor  0.99  
Episode  200  
Steps  284  
Exploration Rate  
MiniBatch Size  32 
To validate the performance of the proposed framework, we performed quantitatively analyses on two spaces, one in the Cartesian space and the other in the joint space, in terms of the overall trajectory length and the smoothness. The target robot was the ABB IRB14000 YuMi robot and we utilized the Gazebo simulation to detect for collision during the execution.
We compared the pose and joint trajectories of the optimal one with all demonstrations from the training set and the three best human demonstrations from the testing dataset as well as the mean trajectory. The results are shown in Fig. 6. The optimized trajectory showed the best results among all the demonstrations. The optimized had trajectory lengths of 0.21m in Pose and 796.98 in the joint, while the best human demonstration had that of 0.24m in Pose and 814.68 in joint angle.
In addition to the trajectory length, we also compared the smoothness of a pose trajectory. The smoothness of a pose trajectory was found by computing the mean of the change in the angle between three consecutive points along a trajectory. From the definition, the lower the changes, the smoother a trajectory is. We also compared the smoothness with the three best demonstrations. The results for the them were 27.19, 27.56 and 28.68 respectively while the optimized one was 22.55.
The visual comparisons between the best demonstrations and the optimized trajectory are shown in Fig. 7. As clearly shown, the optimized trajectory not only appeared to have shorter trajectory, but also appeared to be smoother among the three. The quantitative analysis has validated the proposed framework is capable of learning an optimal trajectory from the suboptimal trajectories. In the results, it has shown that the learned trajectory outperforms any of the demonstrated ones in terms of the trajectory length and the smoothness. The shorter length and the smoother trajectory in the optimized trajectory implied shorter and smoother execution of the learned task.
Vi Conclusions
In summary, we proposed a framework that extends from the general LfD to teach a robot to perform a task using scarce and suboptimal demonstrations. In this paper, we wanted a robot to learn a task from human demonstrated trajectories. Instead of directly learn the stateaction mapping from them, the demonstrations were served as constraints to our optimization framework. During the training process, the RL agent interacted with the simulation environment. The training of the RL agent was guided by a carefully engineered reward function. In our experiments, we have shown that the proposed framework was able to find the optimal trajectory that outperformed any of the human demonstrated trajectory in terms of trajectory length and smoothness. In addition, the trajectory also took into account the collision when performed joint trajectory planning which would not be generally considered if the conventional mapping method was adopted to the learn a trajectory.
Using only a limited amount of demonstrations, not only we avoided the need to learn from scratch, but also shortened the amount of time needed for RL agent training by constraining the unexplored space. The current work can be served as a basis for a range of potential extensions. We validated the framework on a suturing task, which involves fine movement, but this could be easily extended to other domain. Besides, the robot adopted was the ABB YuMi robot and the proposed framework has shown its applicability to work on other types of robots.
In this work, we used a discretized approach to perform the optimization and we only considered the positional information during training. Finer discretization and taking the orientation into consideration would result in longer training time. Therefore, in the future work, it is necessary to consider continuous RL approaches such as DDPG, using the orientation information for training and incorporating bimanual task execution.
Comments
There are no comments yet.