A Handheld robot shares properties of a handheld tool while being enhanced with autonomous motion as well as the ability to process task-relevant information and user signals. Earlier work in this new field introduced first prototypes which demonstrate that robotic guiding gestures  as well as on-display visual feedback  lead to a level of cooperative performance that exceeds manual performance. This one-way communication of task planning, however, is limited to the constraint that the robot has to lead the user. That way, the introduction of user decisions can result in conflicts of plans with the robot which in turn can inflict frustration in users and decreases cooperative task performance. Furthermore, this concept does not go in line with the users’ idea of cooperation as the robot’s behaviour was sometimes hard to predict e.g. users would not know where the robot would move next.
As a starting point of addressing this problem, extended user perception was introduced to the robot which allows the estimation of the user’s eye gaze in 3D space during task execution. An estimate of users’ visual attention was then used to inform the robot’s decisions when there was an alternative e.g. the user could prioritise subsequent goals. While this feature was preferred, particularly for temporal demanding tasks, we lag a rather sophisticated model that could be used for tasks with higher complexity. Such a model would allow the robot to infer user intention i.e. predict users’ goal in the proximate future rather than reacting to immediate decisions only.
Intention inference has caught researcher’s attention in recent years and promising solutions have been achieved through observing user’s eye gaze , body motion  or task objects . These contributions target safe interactions between humans and stationary robots with shared workspaces. Thus, the question remains open whether there is a model which suits the setup of a handheld robot which is characterised by close shared physical dependency and a working together rather than turn taking cooperative strategy.
Our work is guided by the following research questions
How can user intention be modelled in the context of a handheld robot task?
To what extent can handheld robot user actions like picking and placing be predicted in advance?
For our study, we use the open robotic platform1113D CAD models available from handheldrobotics.org, introduced in  in combination with an eye tracking system as reported in . Within a simulated assembly task, eye gaze information is used to predict subsequent user actions. The two principal parts of this study consist of the design of the experiment used for data collection in the first place and secondly the method of modelling user intention followed by a detailed evaluation.
Ii Background and Related Work
In this section, we deliver a summary of earlier work on handheld robots and its control based on user perception. Furthermore, we review existing methods for intention inference with a focus on human gaze behaviour.
Ii-a Handheld Robots
Early handheld robot work  used a trunk-shaped robot with 4-DoF to explore issues of autonomy and task performance. This was later upgraded to a 6-DoF (joint space) mechanism  and used gestures, such as pointing, to study user guidance. These earlier works demonstrate how users benefit from the robot’s quick and accurate movement while the robot profits from the human’s tactical motion. Most importantly, increased cooperative performance was measured with an increased level of the robot’s autonomy. They furthermore found that cooperative performance significantly increases when the robot communicates its plans e.g. via a robot-mounted display .
Within this series of work, another problem was identified: the robot does not sense the user’s intention and thus potential conflicts with the robot’s plan remain unsolved. For example, when the user would point the robot towards a valid subsequent goal, the robot might have already chosen a different one and keep pointing towards it rather than adapting its task plan. This led to irritation and frustration in users on whom the robot’s plan was imposed on.
Efforts towards involving user perception in the robot’s task planning were made in our recent work on estimating user attention. The method was inspired by work from Land et al. who found that human’s eye gaze is closely related to their manual actions 
. The attention model measures the current visual attention to bias the robot’s decisions. In a simulatedspace invader styled task, different levels of autonomy were tested over varying configurations of speed demands. It was found that both the fully autonomous mode (robot makes every decision) and the attention driven mode (robot decides based on gaze information) outperform manual task execution. Notably, for high-speed levels, the increased performance was most evident for the attention-driven mode which was also rated more helpful and perceived rather cooperative than the fully autonomous mode.
As opposed to an intention model, the attention model would react to the current state of eye gaze information only, rather than using its history to make predictions about the user’s future goals. We suggest that this would be required for cooperative task solving for complex tasks like assembly where there is an increased depth of subtasks.
Ii-B Intention Prediction
Human intention estimation in the field of robotics is in part driven by the demand for safe human-robot interaction and efficient cooperation and here we review recent contributions with a broad variety of approaches.
Ravichandar et al. investigated intention inference based on human body motion. Using Microsoft Kinect motion tracking as an input for a neural network, reaching targets where successfully predicted within an anticipation time of approximatelyprior to the hand touching the object. Similarly, Saxena et al. introduced a measure of affordance to make predictions about human actions and reached 84.1%/74.4% accuracy / in advance, respectively. Later, Ravichandar et al. added human eye gaze tracking to their system and used the additional data for pre-filtering to merge it with the existing motion-based model . The anticipation time was increase to .
Huang et al. used gaze information from a head-mounted eye tracker to predict customers’ choices of ingredients for sandwich making. Using a support vector machine (SVM), an accuracy of approximately 76% was achieved with an average prediction time ofprior to the verbal request . In subsequent work, Huang & Mutlu used the model as a basis for a robot’s anticipatory behaviour which led to more efficient collaboration compared to following verbal commands only .
We note that the above work targets intention inference purposed for external robots which are characterised by a shared workspace with a human but can move independently. It is unclear whether these methods are suitable for close cooperation as it can be found in the handheld robot setup.
Ii-C Human Gazing Behaviour
The intention model we present in this paper is mainly driven by eye gaze data. Therefore, we review work on human gaze behaviour to inform the underlying assumptions of our model.
One of the main contributions in this field is the work by Land et al. who found that fixations towards an object often precede a subsequent manual interaction by around . Subsequent work revealed that the latency between eye and hand varies between different tasks . Similarly, Johansson et al.  found that objects are most salient for human’s when they are relevant for tasks planning and preceding saccades were linked to short-term memory processes in .
Iii Prediction of User Intention
In this section, we describe how intention prediction is modelled for the context of a handheld robot on the basis of an assembly task. The first part is about how users’ gaze behaviour is captured and quantified within an experimental study. In the second part, we describe how this data is converted into features and how these were used to predict user intent.
Iii-a Data Collection
As an example task for data collection, we chose a simulated version of a block copying task which has been used in the context of work about hand-eye motion before [16, 17]. Participants of the data collection trials were asked to use the handheld robot (cf. figure 2) to pick blocks from a stock area and place them in the workspace area at one of the associated spaces indicated by a shaded model pattern. The task was simulated on a LCD TV display and the robot remained motionless to avoid distraction. Rather than using coloured blocks, we drew inspiration from an block design IQ test  and decided to use ones that are distinguished by a primitive black and white pattern. That way, a match with the model would in addition depend on the block’s orientation which adds some complexity to the tasks; plus, the absence of colours aims to level the challenge for people with colour blindness. An overview of the task can be seen in figure 3, figure 4 shows examples of possible picking and placing moves.
In order to pick or place pieces, users have to point the robot’s tip towards and close to the desired location and pull/release a trigger in the handle. The position of the robot and its tip is measured via a motion tracking system222Opti Track: https://optitrack.com. The handle houses another button which users can use to turn the grabbed piece by for each activation. The opening or closing process of the virtual gripper takes which is animated in the screen. If the participant tries to place a mismatch, the piece goes back to the stock and has to be picked up again. Participants are asked to solve the task swiftly and it is completed when all model pieces are copied. Figure 1 shows an example of a participant solving the puzzle.
For the data collection, 16 participants (7 females, = 25, SD = 4) were recruited to complete the block copy task, mostly students from different fields. Participation was on a voluntary basis and there was no financial compensation for their time, however, many considered the task as a fun game. Each completed one practice trial to get familiar with the procedure, followed by another three trials for data collection, where stock pieces and model pieces were randomised prior to execution. The pattern consists of 24 parts with an even count of the 4 types. The task starts with 5 pre-completed pieces to increase the diversity of solving sequences leaving 19 pieces to complete by the participant. That way, a total amount of 912 episodes of picking and dropping were recorded.
Throughout the task execution, we kept track of the user’s eye gaze using a robot-mounted remote eye tracker in combination with a 3D gaze model as introduced in . That information was used to measure the Euclidean gaze distance for each object in the scene over time , that is the distance between an object’s centre point and the intersection between eye gaze and screen. In the following, we call the record of over a time for an object gaze history. Moreover, we recorded the times for picking and placing actions for later use in auto-segmentation.
With 912 recorded episodes, 4 available stock pieces and 24 pattern parts, we collected 3648 gaze histories for stock parts prior to picking actions and 21888 for pattern pieces.
Iii-B User Intention Model
In the context of our handheld robot task, we define intention as the user’s choice of which object to interact with next i.e. which stock piece to pick and on which pattern field to place it.
Based on our literature review, our modelling is guided by the following assumptions.
An intended object attracts the users’ visual attention prior to interaction.
During task planning, the users’ visual attention is shared between the intended object and other (e.g. subsequent) task-relevant objects.
As a first step towards the feature construction, the recorded gaze history (measured as a series of distances as described above) is converted into a visual attention profile (VAP) through the following equation:
where is the time dependent gaze distance of the i-th object (cf. figure 1). defines the gaze distance resulting into a significant drop of and was set to based on the pieces’ size and tracking tolerance. Here, we define an object’s gaze profile at the time as the collection of over a anticipation window prior to . Due to the data update frequency of the profile is discretised into a vector of 300 entries. An example can be seen in figure 5.
The prediction for picking and placing actions was modelled separately as they require different feature sets. As studies about gaze behaviour during block copying  suggest that the eye gathers information about both what to pick and where to place it prior to picking actions, we combined pattern and stock information for picking predictions for each potential candidate to chose, resulting in the features:
The VAP of the object itself.
The VAP of the matching piece in the pattern. If there are several, the one with the maximum visual attention is picked.
For the prediction of the dropping location, 2 is not applicable as the episode finishes with the placing of the part. However, we hypothesise that it would be unlikely that a participant would drop the piece on a mismatch or on a match that is already completed. This results in the following feature set which is calculated for each dropping location:
The VAP of the object itself (vector, ).
Whether or not the location matches with what is hold by the gripper (boolean).
Whether or not this pattern space is already completed (boolean).
As prediction models for picking and placing intention, we used SVMs 
as this type of supervised machine learning model was used for similar classification problems in the past, e.g.. We divided the sets of VAPs into two categories, one where the associated object was the intended object (labelled as chosen = 1) and another one for the objects that were not chosen for interaction (labelled as chosen = 0). Training and validation of the models were done through 5-fold cross validation .
The accuracy of predicting the chosen
label for individual objects is 89.6% for picking actions and 98.3% for placing intent. However, sometimes the combined decision is conflicting e.g when several stock pieces are predicted to be the intended ones. This is resolved by selecting the one with the highest probabilitychosen in a one-vs-all setup . This configuration was tested for scenarios with the biggest choice e.g. when all 4 stock parts (random chance = 25%) would be a reasonable choice to pick or when the piece to be placed matches 4 to 6 different pattern pieces (random chance = 17-25%). This results in a correct prediction rate of 87.9% for picking and 93.25% for placing actions when the VAPs of the time up to just before the action time is used.
Having trained and validated the intention prediction model for the case where VAPs range from -4 to 0 seconds prior to the interaction with the associated object, we are now interested in knowing to what extent the intention model predicts accurately at some time prior to interaction. To answer this question, we extend our model analysis by calculating a -dependent prediction accuracy. Within a 5-fold cross validation setup, the -anticipation window is iteratively moved away from the time of interaction and the associated VAPs are used to make a prediction about the subsequent user action using the trained SVM models. The validation is based on the formentioned low-chance subsets, so that the chance of correct prediction through randomly selecting a piece would be . The shift of the anticipation window over the data set is done with a step width of 1 frame (). This is done for both the case of predicting which piece is picked up next as well as inferring intention concerning where it is going to be placed. For the time offsets = 0,0.5 and 1 seconds, the prediction of picking actions yields an accuracy of 87.94%, 72.36% and 58.07%. The performance of the placing intention model maintains a high accuracy over a time span of with an accuracy of 93.25%, 80.06% and 63.99% for the times
= 0,1.5 and 3 seconds. In order to interpret these differences in performance, we investigated whether there is a difference between the mean duration of picking and placing actions. We applied a two-sample t-test and found that the picking time (mean =, SD = ) is significantly smaller than the placing time (mean = , SD = ), with . A detailed profile of the time-dependent prediction performance for each model can be seen in figure 6.
Iv-a Qualitative Analysis
For an in-depth understanding of how the user intention models respond to different gaze patterns, we investigate the prediction profile i.e. the change of the prediction over time, for a set of typical scenarios.
Iv-A1 One Dominant Type
A common observation was that the target object perceived most of the user’s visual attention prior to interaction which goes in line with our assumption 1. We call this pattern one type dominant and a set of examples can be seen in Figure 7. A subset of this category is the case where the user’s eye gaze alters between the piece to pick and the matching place in the pattern i.e. where to put it (cf. figure 6(c)), which supports our assumption 2.
Furthermore, we note that the prediction remains stable fo for the event of a short break of visual attention i.e. the user glances away and back to the same object (cf. figure 6(b)). This is a contrast to an intention inference based on the last state of visual attention only, which would result in an immediate change of the prediction.
For the majority of these one type dominant samples both the picking and placing prediction models predict accurately.
Iv-A2 Trending Choice
While the anticipation time of the pick up prediction model lies within a second and is thus rather reactive, the placing intention model is characterised by an increase of prediction during the task i.e. low-pass characteristic. Figure 8 shows examples with different fixation durations (figure 7(a),7(b)) and how the user’s gaze alters between two competing places (figure 7(c)). The prediction model is robust for these cases, however, the anticipation time is reduced in comparison to the one type dominant samples.
Iv-A3 Incorrect Predictions
A common reason for an incorrect prediction is that a close competitor is chosen, for example when the user’s gaze goes there and back between two potential placing candidates (figure 8(a)) and the incorrect choice is favoured by the model. In some rare cases there were no intended fixations recorded for the candidate prior to the interaction (cf. figure 8(b)). In other few samples that led to faulty predictions, the eye tracker could not recognise the eyes e.g. when the robot is held so that the head was outside the trackable volume or outside the head angle range. In that case, the tracking system is unable to update the gaze model which led to over/underestimation of perceived visual attention as it can be seen in 7(c).
In addressing research question 1, we proposed a user intention model based on gaze cues for the prediction of actions within a pick and place task user study as an example for handheld robot interaction. 3D user gaze was used to quantify visual attention for task-relevant objects in the scene. The derived profiles of visual intention were used as features for SVMs to predict which object will be picked up next and where it will be placed with an accuracy of 87.94 and 93.25 percent, respectively.
The prediction performance was furthermore investigated with respect to the time distance prior to the time of action to answer the research question 2. The proposed model allows predictions prior to picking actions (71.6% accuracy) and prior to dropping actions (80.06%) accuracy.
A qualitative analysis was conducted which shows that the prediction model performs robustly for long gaze fixations on the intended object as well as for the case where users divide their attention between the intended object and related ones. Furthermore, the analysis shows the growth of the model’s confidence about the prediction while the user’s decision process unfolds as indicated by glances among a set of competing candidates to choose from.
We showed that, within this task, the prediction of different actions has different anticipation times i.e. dropping targets are identified quicker than picking targets. This can partially be explained by the fact that picking episodes are shorter than placing episodes. But more importantly, we observed that users planned the entire work cycle rather than planning picking and placing actions separately. This becomes evident through the qualitative analysis which shows altering fixations between the picking targets and where to place it. That way, the placing prediction model is already able to gather information at the time of picking.
Within this work, we investigated the use of gaze information to infer user intention within the context of a handheld robot scenario. A pick and place task was used to collect gaze data as a basis for an SVM-based prediction model. The results show that, depending on the anticipation time, picking actions can be predicted with up to 87.94% accuracy and dropping actions with an accuracy of 93.25%. Furthermore, the model allows action anticipation prior to picking and prior to dropping.
The proposed model could be used in online anticipation scenarios to infer user intention in real time for more fluid human-robot collaboration, particularly, for the case where objects can be related to a task sequence e.g. pick and place or assembling.
Acknowledgement This work was partially supported by the German Academic Scholarship Foundation and by the UK’s Engineering and Physical Sciences Research Council. Opinions are the ones of the authors and not of the funding organisations.
-  Austin Gregg-Smith and Walterio W Mayol-Cuevas. The design and evaluation of a cooperative handheld robot. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1968–1975. IEEE, 2015.
-  Austin Gregg-Smith and Walterio W Mayol-Cuevas. Investigating spatial guidance for a cooperative handheld robot. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 3367–3374. IEEE, 2016.
-  Janis Stolzenwald and Walterio Mayol-Cuevas. I Can See Your Aim: Estimating User Attention From Gaze For Handheld Robot Collaboration. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1–8, July 2018.
-  Chien-Ming Huang and Bilge Mutlu. Anticipatory robot control for efficient human-robot collaboration. In 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 83–90. IEEE, 2016.
-  Harish chaandar Ravichandar and Ashwin Dani. Human intention inference through interacting multiple model filtering. In 2015 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), 2015.
Tingting Liu, Jiaole Wang, and Max Q H Meng.
Evolving hidden Markov model based human intention learning and inference.In 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO), pages 206–211. IEEE, 2015.
-  Austin Gregg-Smith and Walterio W Mayol-Cuevas. Inverse Kinematics and Design of a Novel 6-DoF Handheld Robot Arm. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 2102–2109. IEEE, 2016.
-  Michael Land, Neil Mennie, and Jennifer Rusted. The Roles of Vision and Eye Movements in the Control of Activities of Daily Living. Perception, 28(11):1311–1328, 1999.
-  Hema S Koppula and Ashutosh Saxena. Anticipating Human Activities Using Object Affordances for Reactive Robotic Response. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1):14–29, 2016.
-  Harish chaandar Ravichandar, Avnish Kumar, and Ashwin Dani. Bayesian Human Intention Inference Through Multiple Model Filtering with Gaze-based Priors. In th International Conference on Information Fusion FUSION, pages 1–7, June 2016.
-  Chien-Ming Huang, Sean Andrist, Allison Sauppé, and Bilge Mutlu. Using gaze patterns to predict task intent in collaboration. Frontiers in Psychology, 6(1016):3, July 2015.
-  Tingting Liu, Jiaole Wang, and Max Q H Meng. Human robot cooperation based on human intention inference. In 2014 IEEE International Conference on Robotics and Biomimetics (ROBIO), pages 350–355. IEEE, 2014.
-  Michael F Land and Mary Hayhoe. In what ways do eye movements contribute to everyday activities? Vision Research, 41(25-26):3559–3565, November 2001.
-  Roland S Johansson, Göran Westling, Anders Bäckström, and J Randall Flanagan. Eye–Hand Coordination in Object Manipulation. Journal of Neuroscience, 21(17):6917–6932, September 2001.
-  Neil Mennie, Mary Hayhoe, and Brian Sullivan. Look-ahead fixations: anticipatory eye movements in natural tasks. Experimental Brain Research, 179(3):427–442, December 2006.
-  Dana H Ballard, Mary M Hayhoe, and Jeff B Pelz. Memory Representations in Natural Tasks. Journal of Cognitive Neuroscience, 7(1):66–80, 1995.
-  Jeff Pelz, Mary Hayhoe, and Russ Loeber. The coordination of eye, head, and hand movements in a natural task. Experimental Brain Research, 139(3):266–277, 2001.
-  Joseph C Miller, Joelle C Ruthig, April R Bradley, Richard A Wise, Heather A Pedersen, and Jo M Ellison. Learning effects in the block design task: A stimulus parameter-based approach. Psychological Assessment, 21(4):570–577, 2009.
-  M A Hearst, S T Dumais, E Osuna, J Platt, and B Scholkopf. Support Vector Machines. IEEE Intelligent Systems and their Applications, 13(4):18–28, 1998.
A Study of Cross-Validation and Bootstrap for Accuracy Estimation
and Model Selection.
International Joint Conference on Artificial Intelligence IJCAI, pages 1–7, 1995.
-  Ryan Rifkin and Aldebaro Klautau. In Defense of One-Vs-All Classification. Journal of Machine Learning Research, pages 101–141, June 2004.