A Handheld robot shares properties of a handheld tool while being enhanced with autonomous motion as well as the ability to process task-relevant information and user signals. Earlier work in this field explored the communication between user and robot to improve cooperation  . Such one-way communication of task planning, however, is limited in that the robot has to lead the user. But as users exert their will and decisions, task conflicts emerge which in turn inflict user frustration and decrease cooperative task performance.
As a starting point of addressing this problem, extended user perception can be introduced to allow the robot to estimate the user’s point of attention via eye gaze in 3D space during task execution. An estimate of users’ visual attention informs the robot about areas of users’ interest. While introducing attention was preferred, particularly for temporal demanding tasks, it is still limiting. What is necessary is a model that goes beyond where the user is attending to but rather what is the user intending to do. A model of intention would allow the robot to infer the user’s goal in the proximate future and go beyond reacting to immediate decisions only.
Intention inference has caught researcher’s attention in recent years and promising solutions have been achieved through observing user’s eye gaze , body motion  or task objects . These contributions target safe interactions between humans and sedentary robots with shared workspaces. Thus, the question remains open whether there is a model which suits the setup of a handheld robot which is characterised by close shared physical dependency and a working together rather than a turn taking cooperative strategy.
Our work is guided by the following research questions
How can user intention be modelled in the context of a handheld robot task?
To what extent does intention prediction affect the cooperation with a handheld robot?
For our study, we use the open robotic platform1113D CAD models available from handheldrobotics.org, introduced in  in combination with an eye tracking system as reported in . Within a simulated assembly task, eye gaze information is used to predict subsequent user actions. The two principal parts of this study consist of modelling user intention in the first place followed by testing it through an assistive pick and place task. Our contribution is an intention prediction model with real-time capabilities that allows for human-robot collaboration through online plan adaptation in assistive tasks. Figure 2 shows an overview of our proposed system.
Ii Background and Related Work
In this section, we deliver a summary of earlier work on handheld robots and its control based on user perception. Furthermore, we review existing methods for intention inference with a focus on human gaze behaviour.
Ii-a Handheld Robots
Early handheld robot work  used a trunk-shaped robot with 4-DoF to explore issues of autonomy and task performance. This was later upgraded to a 6-DoF (joint space) mechanism  and used gestures, such as pointing, to study user guidance. These earlier works demonstrate how users benefit from the robot’s quick and accurate movement while the robot profits from the human’s tactical motion. Most importantly, increased cooperative performance was measured with an increased level of the robot’s autonomy. It was furthermore found that cooperative performance significantly increases when the robot communicates its plans e.g. via a robot-mounted display .
Within this series of work, another problem was identified: the robot does not sense the user’s intention and thus potential conflicts with the robot’s plan remain unsolved. For example, when the user would point the robot towards a valid subsequent goal, the robot might have already chosen a different one and keep pointing towards it rather than adapting its task plan. This led to irritation and frustration in users on whom the robot’s plan was imposed on.
Efforts towards involving user perception in the robot’s task planning were made in our recent work on estimating user attention. The method was inspired by work from Land et al. on how human’s eye gaze is closely related to manual actions 
. The attention model measures the current visual attention to bias the robot’s decisions. In a simulatedspace invader styled task, different levels of autonomy were tested over varying configurations of speed demands. It was found that both the fully autonomous mode (robot makes every decision) and the attention driven mode (robot decides based on gaze information) outperform manual task execution. Notably, for high-speed levels, the increased performance was most evident for the attention-driven mode which was also rated more helpful and perceived rather cooperative than the fully autonomous mode.
As opposed to an intention model, the attention model would react to the current state of eye gaze information only, rather than using its history to make predictions about the user’s future goals. We suggest that this would be required for cooperative task solving for complex tasks like assembly where there is an increased depth of subtasks.
Ii-B Intention Prediction
Intention estimation in robotics is in part driven by the demand for safe human-robot interaction and efficient cooperation.
Ravichandar et al. investigated intention inference based on human body motion. Using Microsoft Kinect motion tracking as an input for a neural network, reaching targets where successfully predicted within an anticipation time of approximatelyprior to the hand touching the object. Similarly, Saxena et al. introduced a measure of affordance to make predictions about human actions and reached 84.1%/74.4% accuracy / in advance, respectively. Later, Ravichandar et al. added human eye gaze tracking to their system and used the additional data for pre-filtering to merge it with the existing motion-based model . The anticipation time was increase to .
Huang et al. used gaze information from a head-mounted eye tracker to predict customers’ choices of ingredients for sandwich making. Using a support vector machine (SVM), an accuracy of approximately 76% was achieved with an average prediction time ofprior to the verbal request . In subsequent work, Huang & Mutlu used the model as a basis for a robot’s anticipatory behaviour which led to more efficient collaboration compared to following verbal commands only .
We note that the above work targets intention inference purposed for external robots which are characterised by a shared workspace with a human but can move independently. It is unclear whether these methods are suitable for close cooperation as it can be found in the handheld robot setup.
Ii-C Human Gazing Behaviour
The intention model presented in this paper is mainly driven by eye gaze data. Therefore, we review work on human gaze behaviour to inform the underlying assumptions of our model.
Land et al. found that fixations towards an object often precede a subsequent manual interaction by around . Subsequent work revealed that the latency between eye and hand varies between different tasks . Similarly, Johansson et al.  found that objects are most salient for human’s when they are relevant for tasks planning and preceding saccades were linked to short-term memory processes in .
Iii Prediction of User Intention
In this section, we describe how intention prediction is modelled for the context of a handheld robot on the basis of an assembly task.
Iii-a Data Collection
We chose a simulated version of a block copying task which has been used in the context of work in hand-eye coordination [15, 16]. Participants of the data collection trials were asked to use the handheld robot (cf. figure 3) to pick blocks from a stock area and place them in the workspace area at one of the associated spaces indicated by a shaded model pattern. The task was simulated on a LCD TV display and the robot remained motionless during the data collection task to avoid distraction. We drew inspiration from a block design IQ test  and decided to use black and white patterns instead of colours. That way, a match with the model would, in addition, depend on the block’s orientation which adds further complexity. An overview of the task can be seen in figure 4, figure 5 shows examples of possible picking and placing moves.
In order to pick or place pieces, users have to point the robot’s tip towards and close to the desired location and pull/release a trigger in the handle. The position of the robot and its tip is measured via a motion tracking system222Opti Track: https://optitrack.com. The handle houses another button which can be used to rotate the grabbed piece. The opening or closing process of the virtual gripper takes which is animated in the screen. If the participant tries to place a mismatch, the piece goes back to the stock and has to be picked up again. Participants are asked to solve the task swiftly and it is completed when all model pieces are copied. Throughout the task execution, we kept track of the user’s eye gaze using a robot-mounted remote eye tracker in combination with a 3D gaze model from . Figure 1 shows an example of a participant solving the puzzle.
For the data collection, 16 participants (7 females, = 25, SD = 4) were recruited. Each completed one practice trial to get familiar with the procedure, followed by another three trials for data collection, where stock pieces and model pieces were randomised prior to execution. The pattern consists of 24 parts with an even count of the 4 types. The task starts with 5 pre-completed pieces to increase the diversity of solving sequences leaving 19 pieces to be completed by the participant. That way, a total amount of 912 episodes of picking and dropping were recorded.
Iii-B User Intention Model
In the context of our handheld robot task, we define intention as the user’s choice of which object to interact with next i.e. which stock piece to pick and on which pattern field to place it.
Based on our literature review, our modelling is guided by the following assumptions.
An intended object attracts the users’ visual attention prior to interaction.
During task planning, the users’ visual attention is shared between the intended object and other (e.g. subsequent) task-relevant objects.
As a first step towards feature construction, the gaze information for an individual object was used to extract a visual attention profile (VAP) which is defined as the continuous probability of an object being gazed. Letbe the 2D point of intersection between the gaze ray and the TV screen surface and the 2D position of the -th object in the screen. Then the gaze position can be compared to each object using the Euclidean distance:
As a decrease of implies an increased visual intention, the distance profile can be converted to a visual attention profile (VAP) using the following equation:
Where defines the gaze distance resulting in a significant drop of and it was set to based on the pieces’ size and tracking tolerance. The intention model uses the VAP of the last before the point in time of the prediction. Due to the data update frequency of the profile is discretised into a vector of 300 entries (cf. example in figure 6).
The prediction for picking and placing actions was modelled separately as they require different feature sets. As mentioned above, earlier studies about gaze behaviour during block copying  and assembly 
suggest that the eye gathers information about both what to pick and where to place it prior to picking actions. For this reason, we combined pattern and stock information for picking predictions for each available candidate, resulting in the features selection:
The VAP of the object itself.
The VAP of the matching piece in the pattern. If there are several, the one with the maximum visual attention is picked.
This goes in line with our assumptions 1, 2. Both features are vectors of real numbers between 0 and 1 with a length of . For the prediction of the dropping location, 2 is not applicable as the episode finishes with the placing of the part hence why only (a vector with length ) is used for prediction. Note that this feature contains information about fixation durations as well as saccade counts.
An SVM 
was chosen as a prediction model as this type of supervised machine learning model was used for similar classification problems in the past, e.g.. We divided the sets of VAPs into two categories, one where the associated object was the intended object (labelled as chosen = 1) and another one for the objects that were not chosen for interaction (labelled as chosen = 0). Training and validation of the models were done through 5-fold cross validation .
The accuracy of predicting the chosen label for individual objects is 89.6% for picking actions and 98.3% for placing. However, sometimes the combined decision is conflicting e.g when several stock pieces are predicted to be the intended ones. This is resolved by selecting the one with the highest probability chosen in a one-vs-all setup . This configuration was tested for scenarios with the biggest choice e.g. when all 4 stock parts (random chance = 25%) would be a reasonable choice to pick or when the piece to be placed matches 4 to 6 different pattern pieces (random chance = 17-25%). This results in a correct prediction rate of 87.9% for picking and 93.25% for placing actions when the VAPs of the time up to just before the action time is used.
Iv Results of Intention Modelling
Having trained and validated the intention prediction model for the case where VAPs range from to 0 seconds prior to the interaction with the associated object, we are now interested in knowing to what extent the intention model predicts accurately at some time prior to interaction. To answer this question, we extend our model analysis by calculating a -dependent prediction accuracy. Within a 5-fold cross validation setup, the -anticipation window is iteratively moved away from the time of interaction and the associated VAPs are used to make a prediction about the subsequent user action using the trained SVM models. The validation is based on the aforementioned low-chance subsets, so that the chance of correct prediction through randomly selecting a piece would be . The shift of the anticipation window over the data set is done with a step width of 1 frame (). This is done for both the case of predicting which piece is picked up next as well as inferring intention concerning where it is going to be placed. For the time offsets = 0, 0.5 and 1 seconds, the prediction of picking actions yields an accuracy of 87.94%, 72.36% and 58.07%. The performance of the placing intention model maintains a high accuracy over a time span of with an accuracy of 93.25%, 80.06% and 63.99% for the times
= 0, 1.5 and 3 seconds. In order to interpret these differences in performance, we investigated whether there is a difference between the mean duration of picking and placing actions. We applied a two-sample t-test and found that the picking time (mean =, SD = ) is significantly smaller than the placing time (mean = , SD = ), with .
As the prediction model of the picking actions implements the novel aspect of adding the VAPs of related objects, its comparison to existing methods is of particular interest. Figure 7 shows a comparison of our proposed model (where both features and are used) to the case where is the single basis for a prediction such as the model recently explored by Huang et al. . It can be seen that both models well exceed the chance of picking randomly. Notably, the proposed model outperforms the existing one shortly after the subject ends the preceding move and presumably starts planning the next one. To further investigate the effect of the chosen model on the prediction performance, a two-factorial ANOVA was applied where the prediction time relative to the action and the model were set as the independent factors and the performance as dependent variable which reveals that the correct prediction rate of the proposed model is significantly higher () than the one of the existing model.
Iv-a Qualitative Analysis
For an in-depth understanding of how the intention models respond to different gaze patterns, we investigate the prediction profile i.e. the change of the prediction over time, for a set of typical scenarios.
Iv-A1 One Dominant Type
A common observation was that the target object perceived most of the user’s visual attention prior to interaction which goes in line with our assumption 1. An example of these one type dominant samples can be seen in figure 7(a). A subset of this category is the case where the user’s eye gaze alters between the piece to pick and the matching place in the pattern i.e. where to put it (cf. figure 7(b)) which supports our assumption 2.
For the majority of these one type dominant samples both the picking and placing prediction models predict correctly.
Iv-A2 Trending Choice
While the anticipation time of the pick up prediction model lies within a second and is thus rather reactive, the placing intention model is characterised by a slow increase of likelihood during the task i.e. it shows a low-pass characteristic. Figure 9 demonstrates that the model is robust against small attention gaps and intermediate glances at competitors, however, the model requires an increased time window to build up confidence.
Iv-A3 Incorrect Predictions
There is a number of reasons for an incorrect prediction. Most commonly, a close by neighbour received more visual attention and was falsely classified as the intended object. In other cases, it was impossible to predict the intended object using our model due to missing saccades towards it or faulty gaze tracking.
V Discussion of Intention Modelling
In addressing research question 1, we proposed a user intention model based on gaze cues for the prediction of actions which was assessed in a pick and place task. As a novel aspect introduced through this study, the predictions are not only based on saccades and fixation durations of an individual object but also on those of related objects. In other words, assessing the attention on objects in the workspace helps to predict which piece outside the current workspace is needed next. When the subject turns his/her attention towards the piece, the model interprets this as a confirmation rather than the start of a selection process. This helps to cut the time required for the model to gather relevant gaze information and makes predictions more reliable than traditional models.
We showed that, within this task, the prediction of different actions has different anticipation times i.e. dropping targets are identified quicker than picking targets. This can partially be explained by the fact that picking episodes are shorter than placing episodes. But more importantly, we observed that users planned the entire pick-place cycle rather than planning picking and placing actions separately. This becomes evident through the qualitative analysis which shows altering fixations between the piece to pick and where to place it. That way, the placing prediction model is able to already gather information at the time of picking.
The proposed model allows predictions prior to picking actions (71.6% accuracy) and prior to dropping actions (80.06% accuracy). These numbers are encouraging for testing the prediction model in a real-time application. Therefore, we proceed with an experimental study where the intention model is used for cooperative behaviour.
Vi Intention Prediction Model Validation
In the second part of our study, we validate the proposed intention model for the case where it is used to control the robot’s behaviour and motion. While the aforementioned experiments and analysis demonstrate that the intention model is capable of predicting users’ short term goals while having full control over the robot’s tip, it is unclear whether this is true for the case where the robot reacts to these predictions. For example, users might adapt their intention to the robot’s plans just by seeing it moving towards a target which might differ from their initially intended move. That way, labelling the robot’s predictions as being correct or incorrect in the same way as we did in the first study becomes invalid due to the lack of ground truth. For this reason, we propose to assess the intention model in an indirect way instead by observing users’ reactions to the predictions with a focus on frustration. We hypothesise that a mismatch between the robot’s and the user’s plans would inflict user frustration and that frustration is reduced when the robot follows the true user intention compared to avoiding it.
Vi-a Intention Affected Robot Behaviour
For the experimental validation of the intention model, we used the aforementioned block copy task and introduced an assistive behaviour to the robot which is controlled based on the predictions of a user’s intended subsequent move i.e. which piece the user wants to pick up next or at which location the user wants to drop it. We created 3 different behaviour modes: Follow intention, Rebel and Random. For each, the robot retreats to a crouched position while there is a low probability for each available target. When the probability of the target with the highest probability reaches a threshold, the robot reacts as follows in the different modes:
The robot moves towards the target with the highest predicted intention.
The robot avoids the target with the highest prediction and moves towards the target with the lowest predicted intention instead.
The robot moves towards a random target.
We set a maximum decision time of after which the robot executes the above-mentioned behaviour for the rare case where no probability exceeds the threshold. This prevents the robot from getting stuck in the crouched position e.g. when there is a time gap in the gaze tracking stream.
Vi-B Experiment Execution
We recruited 20 new participants (6 females, = 26, SD = 4) for the validation study of which 2 were later removed from the set for data analysis due to malfunctioning gaze tracking. Each was asked to first complete the task without the robot moving for familiarisation with the rules and the robot handling. This practice session was followed by 3 trials where, for each, the robot’s behaviour was set to a different behaviour mode. The block pattern to complete as well as the order of the behaviour modes were randomised. Furthermore, 5 (out of 24) randomly chosen blocks were pre-completed to stimulate some diversity in solving strategies e.g. to prevent repeated line-by-line completion.
The participants were told to solve the trial tasks swiftly and that their performance was recorded. They did not receive any information about the behaviour modes but were told that the robot will move and try to help them with the task. Each trial was followed by the completion of a NASA Task Load Index (TLX) form  and resting time.
Vii Results and Discussion: Model Validation
To determine the effect of the robot’s behaviour mode on the subjects’ frustration level, we performed an analysis of variance (ANOVA) with the mode as the independent variable and the frustration component of the TLX as a dependent variable. As the analysis yielded a significant effect (), it was further explored using post-hoc pairwise t-test with applied Bonferroni correction. The frustration mean for the Rebel group was identified as being significantly higher than in the Follow Intention group (). No significant mean differences were found when comparing the Random group to the others. The results can be seen in table 1 and figure 10.
We extended our analysis to both, the combined TLX results which serve as an indicator for perceived task load and the measured performance which is defined as the number of completed blocks per minute. However, an applied ANOVA did not yield an effect of the robot’s behaviour mode, neither on the combined TLX nor on the performance.
As part of a qualitative review of the robot’s behaviour we found that in the Rebel mode, participants perform an increased number of corrective moves compared to the Follow Intention scenario. Figure 11 shows how the robot’s aim matches the user’s intention in the Follow Intention mode whereas in the Rabel example, the user rushes towards the intended aim but needs to correct his move as the robot aims for a different piece.
Some participants commented on the behaviour modes. The Follow Intention mode was often preferred (e.g. “I liked being in charge and the robot was helpful” and “The robot followed my decisions”) whereas the Random mode lead to irritation in some users (e.g. “First I thought it would go where I wanted but then it started moving in an unpredictable way”). For the Rebel mode, we observed divergent reactions. While some subjects struggled because of the mismatch between the robot’s motion and their plans, others started following the robot’s lead. This was also reflected in the comments e.g. “Now the robot does its own thing, I don’t like it” versus “It was easier because I did not have to think much”.
The observed difference in frustration ratings between the mode where the robot supports the user’s predicted intention versus avoiding it is evidence for most of the intention predictions matching the true intention. With regards to 2, our interpretation of the results is that during the Follow Intention trials, the robot did follow the users’ preferred sequence rather than the users adapting it to the robotic motion which validates the proposed intention model and its application in assisted reaching.
The fact that the mean frustration for the Random mode lies between the other two modes is expected given their effect on frustration outlined above. However, the effect is too subtle to be compared to random motion and the sample size too small for a reliable distinction.
Our analysis furthermore shows that user frustration is more sensitive to the robot’s intention prediction than perceived task load or performance. We suggest that robotic systems should follow user intention when there are subtasks with similar priorities for enhanced cooperation.
We investigated the use of gaze information to infer user intention within the context of a handheld robot. A pick and place task was used to collect gaze data as a basis for an SVM-based prediction model. Results show that, depending on the anticipation time, picking actions can be predicted with up to 87.94% accuracy and dropping actions with an accuracy of 93.25%. Furthermore, the model allows action anticipation prior to picking and prior to dropping. We show that merging gaze information with respect to objects that are linked to the same task in a single model helps to increase the prediction performance.
The developed intention model can be used to make predictions in real-time enabling the robot to align its plans to the user’s preferred goals making it a cooperative tool for complex tasks.
The proposed model performs particularly well for tasks where several objects connect to the same subtasks. This opens its applicability to other tasks in assembly and assisted living.
Acknowledgements To the German Academic Scholarship Foundation and UK’s EPSRC. Opinions are the ones of the authors and not of the funding organisations.
-  Austin Gregg-Smith and Walterio W Mayol-Cuevas. The design and evaluation of a cooperative handheld robot. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1968–1975. IEEE, 2015.
-  Austin Gregg-Smith and Walterio W Mayol-Cuevas. Investigating spatial guidance for a cooperative handheld robot. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 3367–3374. IEEE, 2016.
-  Janis Stolzenwald and Walterio Mayol-Cuevas. I Can See Your Aim: Estimating User Attention From Gaze For Handheld Robot Collaboration. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1–8, July 2018.
-  Chien-Ming Huang and Bilge Mutlu. Anticipatory robot control for efficient human-robot collaboration. In 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 83–90. IEEE, 2016.
-  Harish chaandar Ravichandar and Ashwin Dani. Human intention inference through interacting multiple model filtering. In 2015 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), 2015.
Tingting Liu, Jiaole Wang, and Max Q H Meng.
Evolving hidden Markov model based human intention learning and inference.In 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO), pages 206–211. IEEE, 2015.
-  Austin Gregg-Smith and Walterio W Mayol-Cuevas. Inverse Kinematics and Design of a Novel 6-DoF Handheld Robot Arm. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 2102–2109. IEEE, 2016.
-  Michael Land, Neil Mennie, and Jennifer Rusted. The Roles of Vision and Eye Movements in the Control of Activities of Daily Living. Perception, 28(11):1311–1328, 1999.
-  Hema S Koppula and Ashutosh Saxena. Anticipating Human Activities Using Object Affordances for Reactive Robotic Response. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1):14–29, 2016.
-  Harish chaandar Ravichandar, Avnish Kumar, and Ashwin Dani. Bayesian Human Intention Inference Through Multiple Model Filtering with Gaze-based Priors. In th International Conference on Information Fusion FUSION, pages 1–7, June 2016.
-  Chien-Ming Huang, Sean Andrist, Allison Sauppé, and Bilge Mutlu. Using gaze patterns to predict task intent in collaboration. Frontiers in Psychology, 6(1016):3, July 2015.
-  Michael F Land and Mary Hayhoe. In what ways do eye movements contribute to everyday activities? Vision Research, 41(25-26):3559–3565, November 2001.
-  Roland S Johansson, Göran Westling, Anders Bäckström, and J Randall Flanagan. Eye–Hand Coordination in Object Manipulation. Journal of Neuroscience, 21(17):6917–6932, September 2001.
-  Neil Mennie, Mary Hayhoe, and Brian Sullivan. Look-ahead fixations: anticipatory eye movements in natural tasks. Experimental Brain Research, 179(3):427–442, December 2006.
-  Dana H Ballard, Mary M Hayhoe, and Jeff B Pelz. Memory Representations in Natural Tasks. Journal of Cognitive Neuroscience, 7(1):66–80, 1995.
-  Jeff Pelz, Mary Hayhoe, and Russ Loeber. The coordination of eye, head, and hand movements in a natural task. Experimental Brain Research, 139(3):266–277, 2001.
-  Joseph C Miller, Joelle C Ruthig, April R Bradley, Richard A Wise, Heather A Pedersen, and Jo M Ellison. Learning effects in the block design task: A stimulus parameter-based approach. Psychological Assessment, 21(4):570–577, 2009.
-  M A Hearst, S T Dumais, E Osuna, J Platt, and B Scholkopf. Support Vector Machines. IEEE Intelligent Systems and their Applications, 13(4):18–28, 1998.
A Study of Cross-Validation and Bootstrap for Accuracy Estimation
and Model Selection.
International Joint Conference on Artificial Intelligence IJCAI, pages 1–7, 1995.
-  Ryan Rifkin and Aldebaro Klautau. In Defense of One-Vs-All Classification. Journal of Machine Learning Research, pages 101–141, June 2004.
-  Sandra G Hart and Lowell E Staveland. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. In Human Mental Workload, pages 139–183. Elsevier, 1988.