Imitation Learning (IL) is a machine learning technique in which an agent learns to perform a task using example demonstrations. This eliminates the need for humans to pre-program the required behavior for a task, instead utilizing the more intuitive mechanism of demonstrating it . Advancements in Imitation Learning techniques have led to successes in learning tasks such as robot locomotion , helicopter flight  and learning to play games . There have also been research efforts to make training easier for demonstrators. This is done by allowing them to interact with the agent by providing feedback as it performs the task, also known as Interactive IL [3, 6, 5].
One limitation of current IL and Interactive IL techniques is that they typically require demonstrations or feedback in the action-space of the agent. Humans commonly learn behaviors by understanding the required state transitions of a task, not the precise actions to be taken . Additionally, providing demonstration or feedback in the action-space can be difficult for demonstrators. For instance, teaching a robotic arm manipulation task with joint level action commands requires considerable demonstrator expertise. It would be easier to instead provide state-space information such as the Cartesian position of either the end effector or the object to be manipulated (although this requires an inverse kinematics/dynamics model).
In this paper, a novel Interactive Learning method is proposed that utilizes feedback in state-space to learn behaviors. The performance of the proposed method (TIPS) is evaluated for various control tasks as part of the OpenAI Gym toolkit and for manipulation tasks using a KUKA LBR iiwa robot arm. The method compares favorably to other Imitation and Interactive Learning methods in non-expert demonstration scenarios.
2 Related Work
In recent literature, several Interactive Imitation Learning methods have been proposed that enable demonstrators to guide agents by providing corrective action labels , corrective feedback [2, 5] or evaluative feedback [13, 7]. For non-expert demonstrators, providing corrective feedback in the form of adjustments to the current states/actions being visited/executed by the agent is easier than providing exact state/action labels . Moreover, evaluative feedback methods require demonstrators to score good and bad behaviors, which could be ambiguous when scoring multiple sub-optimal agent behaviors.
Among corrective feedback learning techniques, a typical approach is to utilize corrections in the action-space [5, 18] or to use predefined advice-operators [3, 2] to guide agents. However, providing feedback in the action-space is often not intuitive for demonstrators (e.g., action-space as joint torques or angles of a robotic arm). Further, defining advice-operators requires significant prior knowledge about the environment as well as the task to be performed, thus limiting the generalizability of such methods. This work proposes an alternative approach of using corrective feedback in state-space to advise agents.
There has been recent interest in Imitation Learning methods that learn using state/observation information only. This problem is termed as Imitation from Observation (IfO) and enables learning from state trajectories of humans performing the task. To compute the requisite actions, many IfO methods propose using a learnt Inverse Dynamics Model (IDM) [15, 21] which maps state transitions to the actions that produce those state transitions. However, teaching agents using human interaction in an IfO setting has not been studied.
In our approach, we combine the concept of state transition to action mapping by learning inverse dynamics with an Interactive Learning framework. The demonstrator provides state-space corrective feedback to guide the agent’s behavior towards desired states. Meanwhile, an inverse dynamics scheme is used to ensure the availability of the requisite actions to learn the policy.
3 Teaching Imitative Policies in State-space (TIPS)
The principle of TIPS is to allow the agent to execute its policy while a human demonstrator observes and suggests modifications to the state visited by the agent at any given time. This feedback is advised and used to update the agent’s policy online, i.e., during the execution itself.
3.1 Corrective Feedback
Human feedback (, at time step ) is in the form of binary signals implying an increase/decrease in the value of a state (i.e., , where zero implies no feedback). Each dimension of the state has a corresponding feedback signal. The assumption is that non-expert human demonstrators, who may not be able to provide accurate correction values could still provide binary signals which show the trend of state modification. To convert these signals to a modification value, an error constant hyper-parameter is chosen for each state dimension. Thus, the human desired state () is computed as:
The feedback (), error constant and the desired modification can be both in the full state or partial state. Thus, the demonstrator is allowed to only suggest modifications in the partial state dimensions that are well understood or easy to observe for the demonstrator. Moreover, even though the change in state computed using binary feedback may be larger/smaller than what the demonstrator is suggesting, previous methods [5, 18] have shown that it is sufficient to capture the trend of modification. If a sequence of feedback provided in a state is in the same direction (increase/decrease), the demonstrator is suggesting a large magnitude change. Conversely, if the feedback alternates between increase/decrease, a smaller change around a set-point is suggested . To obtain this effect when updating the policy, information from past feedback is also used via a replay memory mechanism as in [18, 17].
3.2 Mapping state transitions to actions
To realize the transition from the current state to the desired state, i.e., , an appropriate action () needs to be computed. For this, some methods have proposed using or learning an Inverse Dynamics Model (IDM) [15, 21]. In this work we assume that an IDM is not already available, which can be the case in environments with dynamics that are unknown or difficult to model. Moreover, IDMs are ill-suited in our case for two main reasons. Firstly, the feedback provided by the demonstrator can be in the partial state-dimension, leading to ambiguity regarding the desired state transition in the remaining dimensions. Secondly, the desired state transition () may be infeasible. There may not exist an action that leads to the human suggested state transition in a single time step.
We propose to instead use an indirect inverse dynamics method to compute requisite actions. Possible actions are sampled () and a learnt Forward Dynamics Model (FDM) () is used to predict the next states () for these actions. The action that results in a subsequent state that is closest to the desired state is chosen. The desired and predicted states can be in the full or partial state dimensions. Mathematically, we can write the action computation as:
where with uniform samples.
3.3 Training Mechanism
We represent the policy
using a feed-forward artificial neural network and use a training mechanism inspired by D-COACH. This involves an immediate training step using the current state-action sample as well as a training step using a batch sampled from a demonstration replay memory. Lastly, to ensure sufficient learning iterations to train the neural network, a batch replay training step is also carried out periodically every time-steps.
Crucially, the computed action is also executed immediately by the agent. This helps speed up the learning process since further feedback can be received in the demonstrator requested state to learn the next action to be taken. The overall learning framework of TIPS can be seen in Figure 1.
The overall TIPS method consists of two phases:
In an initial model-learning phase, samples are generated by executing an exploration policy (random policy implementation) and used to learn an initial FDM . The samples are added to an experience buffer that is used later when updating the model.
In the teaching phase, the policy is trained using an immediate update step every time feedback is advised as well as a periodic update step using past feedback from a demonstration buffer . Moreover, to improve the FDM, it is trained after every episode using the consolidated new and previous experience gathered in .
The pseudo-code of TIPS can be seen in Algorithm 1. In our implementation of TIPS (github.com/sjauhri/Interactive-Learning-in-State-space
), the FDM and policy are represented using neural networks and the Adam variant of stochastic gradient descent is used for training.
4 Experimental Setting
Experiments are set up to evaluate TIPS and compare it to other methods when teaching simulated tasks with non-expert human participants as demonstrators (section 4.1). We also validate the method on a real robot by designing two manipulation tasks with a robotic arm (section 4.2).
For the evaluation of TIPS, we use three simulated tasks from the OpenAI gym toolkit , namely: CartPole, Reacher and LunarLanderContinuous. The action-space is discrete in the CartPole environment and continuous in the others. A simplified version of the Reacher task with a fixed target position is used. The cumulative reward obtained by the agent during execution is used as a performance metric. The parameter settings for each of the domains/tasks in the experiments can be seen in Table 1.
The performance of a TIPS agent is compared against the demonstrator’s own performance when executing the task via tele-operation, and against other agents trained via IL techniques using the tele-operation data. It is also of interest to highlight the differences between demonstration in state-space versus action-space. For this, both tele-operation and corrective feedback learning techniques in state and action spaces are compared.
The following techniques are used for comparison.
Tele-operation in Action-space: Demonstrator executes task using action commands.
Tele-operation in State-space: Demonstrator executes task by providing state-space information (as per Table 1) with actions computed using inverse dynamics in a similar way as TIPS.
Behavioral Cloning (BC):Supervised learning to imitate the demonstrator using state-action demonstration data recorded during tele-operation. (Only successful demonstrations are used, i.e., those with a return of at least 40% in the min-max range).
D-COACH : Interactive IL method that uses binary corrective feedback in the action space. The demonstrator suggests modifications to the current actions being executed to train the agent as it executes the task.
|Number of exploration samples ()||500||10000||20000||4000||4000|
|States for feedback||Pole tip||x-y position||Vertical, angular||x-z position||x-y position|
|position||of end effector||position||of end effector||of laser point|
|Error constant ()||0.1||0.008||0.15||0.05||0.02|
|Number of action samples ()||10||500||500||1000||1000|
|Periodic policy update interval ()||10||10||10||10||10|
|FDM Network () layer sizes||16, 16||64, 64||64, 64||32, 32||32, 32|
|Policy Network ( ) layer sizes||16, 16||32, 32||32, 32||32, 32||32, 32|
Experiments were run with non-expert human participants (age group 25-30 years) who have no prior knowledge of the tasks. A total of 22 sets of trials are performed (8, 8 and 6 participants for the CartPole, Reacher, and LunarLander tasks respectively). Participants performed four experiments: Tele-operation in action-space and state-space, training an agent using D-COACH and training an agent using TIPS. To compensate for learning effect, the order of the experiments was changed for every participant. Participants used a keyboard input interface to provide demonstration/feedback to the agent. When performing tele-operation, the demonstrated actions and the corresponding states were recorded. Tele-operation was deemed to be complete once no new demonstrative information can be provided. When training interactively using D-COACH and TIPS, the demonstrators provided feedback until no more agent performance improvement was observed.
To compare the demonstrator’s task load, participants were also asked to fill out the NASA Task Load Index Questionnaire  after each experiment wherein ratings on the mental demand, physical demand, time pressure, performance, effort and frustration during the experiment were obtained.
4.2 Validation tasks on robot
To validate the application of TIPS on a real robot, two manipulation tasks were designed: a ‘Fishing’ and a ‘Laser Drawing’ task. The tasks were performed with a velocity controlled KUKA LBR iiwa 7 robot arm.
In the Fishing task, a ball is attached to the end-effector of the robot by a thread, and the objective is to move a swinging ball into a nearby cup (similar to placing a bait attached to a fishing rod). To reduce the complexity of the task, the movement of the robot end-effector (and ball) is restricted to a 2-D x-z Cartesian plane. An illustration of the task can be seen in Figure 2(a). To teach the task using TIPS, a keyboard interface is used to provide feedback in the x-z Cartesian robot end-effector position. A learnt forward dynamics model is used to predict the position of the end-effector based on the joint commands (actions) requested to the robot. To measure task performance, a reward function is defined which penalizes large actions as well as the distance () between the ball and the center of the cup ().
In the Laser Drawing task, a laser pointer is attached to the end-effector of the robot and the objective is to ‘draw’ characters on a whiteboard (i.e. move the camera-tracked laser point in an appropriate trajectory) by moving two of the robot’s joints (3rd and 5th). An illustration of the task can be seen in Figure 2(b). To teach the task, feedback is provided in the x-y position of the laser point in the plane of the whiteboard. In this case, learning the dynamics/kinematics of just the robot joints is insufficient. We thus learn a forward dynamics model that predicts the position of the laser point on the whiteboard based on the joint commands (actions), but with coordinates in the frame of the whiteboard image observed by the camera. The reward function used to measure task performance is based on the Hausdorff distance  between the shape drawn by the robot and a reference shape/trajectory.
Note that since these experiments are run only to validate the application of TIPS to a real system, comparisons are not made with other learning methods.
Performance: Figure 1(a) shows the performance obtained for the tasks (averaged over all participants) using tele-operation, agents trained via IL techniques and agents trained using TIPS. It is noticed that tele-operation is challenging for the demonstrator, especially for time-critical tasks such as CartPole and LunarLander where the system is inherently unstable. Agents trained using IL techniques (BC and GAIL) suffer from inconsistency as well as lack of generalization of the demonstrations. For the CartPole task, this problem is not as significant given the small state-action space. Interactively learning via TIPS enables continuous improvement over time and leads to the highest agent performance.
Figure 1(b) compares the performance of state-space interactive learning (TIPS) and action-space interactive learning (D-COACH) over training episodes. The advantage of state-space feedback is significant in terms of learning efficiency for the CartPole and Reacher tasks and a significant increase in final performance is observed for the Reacher task. For the LunarLander task, no performance improvement is observed, although training with TIPS takes less time to achieve similar performance. While state-space feedback provides a stabilizing effect on the lander and leads to fewer crashes, participants struggle to teach it to land and thus the agent ends up flying out of the frame.
Demonstrator Task Load: The NASA Task Load Index ratings are used to capture demonstrator task load when teaching using state-space (TIPS) and action-space (D-COACH) feedback and the results can be seen in Table 2 (Significant differences in rating are highlighted).
When teaching using TIPS, participants report lower ratings for the CartPole and Reacher tasks with the mental demand rating reduced by about 40% and 16% and participant frustration reduced by about 35% and 25% respectively. Thus, the merits of state-space interactive learning are clear. However, these advantages are task specific. For the LunarLander task, demonstration in state and action-spaces is equally challenging, backed up by little change in the ratings.
It is noted that actions computed based on feedback using TIPS can be irregular due to inaccuracies in model learning. This was observed for the Reacher and LunarLander tasks where model learning is relatively more complex as compared to CartPole. Since handling such irregular action scenarios requires demonstrator effort, this can diminish the advantage provided by state-space feedback.
5.2 Validation Tasks
The agent performance and demonstrator feedback rate over learning episodes can be seen in Figure 4.
In our experiments for the Fishing task, the demonstrator’s strategy is to move the end effector towards a position above the cup and choose the appropriate moment to bring the end effector down such that the swinging ball falls into the cup. The agent successfully learns to reliably place the ball in the cup after 60 episodes of training (each episode is 30 seconds long). After about 90 episodes, the agent performance is further improved in terms of speed at which the task is completed (improvement in return from -15 to -10). The feedback rate reduces over time as the agent performs better and only some fine-tuning of the behavior is needed after 60 episodes (Figure4).
For the Laser Drawing task, the demonstrator teaches each character separately and uses a reference drawn on the whiteboard as the ground truth. The agent successfully learns to draw characters that closely resemble the reference (Figure 2(c)) after 80 episodes of training (each episode is 5 seconds long). The feedback rate reduces over time as the basic character shape is learnt and the behavior is fine-tuned to closely match the reference character.
A video of the training and learnt behavior for both tasks is available at: youtu.be/uTcOMPN-sFw.
In experiments with non-expert human demonstrators, our proposed method TIPS outperforms IL techniques such as BC and GAIL  as well as Interactive Learning in action-space (D-COACH ). Moreover, the state-space feedback mechanism also leads to a significant reduction in demonstrator task load. With these results we have illustrated the viability of TIPS to non-expert demonstration scenarios and have also highlighted the merits of state-space Interactive Learning. Our method also has the benefit of being applicable to both continuous and discrete action problems, unlike action-space feedback methods such as COACH  (continuous actions only).
To compute actions, we learn an FDM and assume no prior knowledge of the dynamics of the environment. While this is advantageous in environments with dynamics that are unknown or difficult to model, a caveat is that learning the FDM from experience can be challenging and can require a large number of environment interactions. A solution to this could be to use smarter exploration strategies when acquiring experience samples to learn a more accurate FDM. Another consideration is that, for action selection, candidate actions are sampled from the entire action-space. This is not scalable to high-dimensional continuous action spaces. Despite these caveats, we have demonstrated the practical viability of TIPS on a real system by training agents for two robotic manipulation tasks.
-  (2010) Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research 29 (13), pp. 1608–1639. Cited by: §1.
-  (2011) Teacher feedback to scaffold and refine demonstrated motion primitives on a mobile robot. Robotics and Autonomous Systems 59 (3-4), pp. 243–255. External Links: Cited by: §2, §2.
-  (2008) Learning robot motion control with demonstration and advice-operators. In 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 399–404. External Links: Cited by: §1, §2.
-  (2016) OpenAI gym. External Links: Cited by: §4.1.
-  (2019) An interactive framework for learning continuous actions policies based on corrective feedback. Journal of Intelligent & Robotic Systems 95 (1), pp. 77–97. External Links: Cited by: §1, §2, §2, §3.1, §6.
Interactive policy learning through confidence-based autonomy.
Journal of Artificial Intelligence Research34, pp. 1–25. External Links: Cited by: §1.
Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pp. 4299–4307. External Links: Cited by: §2.
-  (1988) Development of nasa-tlx (task load index): results of empirical and theoretical research. In Advances in Psychology, Vol. 52, pp. 139–183. External Links: Cited by: §4.1, Table 2.
-  (2018) Stable baselines. GitHub. Note: https://github.com/hill-a/stable-baselines Cited by: 4th item.
-  (2016) Generative adversarial imitation learning. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4565–4573. External Links: Cited by: 4th item, §6.
-  (1993) Comparing images using the hausdorff distance. IEEE Transactions on pattern analysis and machine intelligence 15 (9), pp. 850–863. External Links: Cited by: §4.2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. External Links: Cited by: §3.3.
-  (2009) Interactively shaping agents via human reinforcement: the tamer framework. In Proceedings of the Fifth International Conference on Knowledge Capture, K-CAP ’09, New York, NY, USA, pp. 9–16. External Links: Cited by: §2.
-  (2018) Imitation from observation: learning to imitate behaviors from raw video via context translation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1118–1125. External Links: Cited by: §1.
-  (2017) Combining self-supervised learning and imitation for vision-based rope manipulation. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2146–2153. External Links: Cited by: §2, §3.2.
-  (2018) An algorithmic perspective on imitation learning. arXiv preprint arXiv:1811.06711. Cited by: §1.
-  (2020) Interactive learning of temporal features for control: shaping policies and state representations from human feedback. IEEE Robotics & Automation Magazine. Cited by: §3.1.
-  (2019) Continuous control for high-dimensional state spaces: an interactive learning approach. In 2019 International Conference on Robotics and Automation (ICRA), pp. 7611–7617. External Links: Cited by: §2, §3.1, §3.3, 5th item, §6.
-  (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. Cited by: §2.
-  (2016-01) Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–489. External Links: Cited by: §1.
-  (2018-07) Behavioral cloning from observation. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 4950–4957. External Links: Cited by: §2, §3.2.
-  (2011) Optimization and learning for rough terrain legged locomotion. The International Journal of Robotics Research 30 (2), pp. 175–191. Cited by: §1.