Gesture Recognition for Initiating Human-to-Robot Handovers

by   Jun Kwan, et al.

Human-to-Robot handovers are useful for many Human-Robot Interaction scenarios. It is important to recognize when a human intends to initiate handovers, so that the robot does not try to take objects from humans when a handover is not intended. We pose the handover gesture recognition as a binary classification problem in a single RGB image. Three separate neural network modules for detecting the object, human body key points and head orientation, are implemented to extract relevant features from the RGB images, and then the feature vectors are passed into a deep neural net to perform binary classification. Our results show that the handover gestures are correctly identified with an accuracy of over 90 our approach modular and generalizable to different objects and human body types.



There are no comments yet.


page 1

page 4


A Proposed Set of Communicative Gestures for Human Robot Interaction and an RGB Image-based Gesture Recognizer Implemented in ROS

We propose a set of communicative gestures and develop a gesture recogni...

It's your turn! – A collaborative human-robot pick-and-place scenario in a virtual industrial setting

In human-robot collaborative interaction scenarios, nonverbal communicat...

Commodifying Pointing in HRI: Simple and Fast Pointing Gesture Detection from RGB-D Images

We present and characterize a simple method for detecting pointing gestu...

Feature Fusion using Extended Jaccard Graph and Stochastic Gradient Descent for Robot

Robot vision is a fundamental device for human-robot interaction and rob...

Human Tactile Gesture Interpretation for Robotic Systems

Human-robot interactions are less efficient and communicative than human...

Neural Network Based Lidar Gesture Recognition for Realtime Robot Teleoperation

We propose a novel low-complexity lidar gesture recognition system for m...

Head and eye egocentric gesture recognition for human-robot interaction using eyewear cameras

Non-verbal communication plays a particularly important role in a wide r...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robots are under rapid development in sectors such as manufacturing, automation and hospitality. Collaborative robots are increasingly being used in industry, and is expected to be useful in home environments in the future. One of the expected capabilities for collaborative robots is the ability to perform object handovers. Object handovers can happen in two directions: Robot-to-Human where the robot delivers a requested object to a human, and Human-to-Robot where the robot acquires an object from a human. The object handover problem has been studied extensively in the robotics literature, with most of the published work focusing on the Robot-to-Human handover scenario. Human-to-Robot handover scenario is arguably the more difficult problem because, as opposed to the Robot-to-Human scenario, the robot responsible for grasping the object from the human partner’s hand, which raises safety issues. This makes the perception problem very critical for the Human-to-Robot handover scenario. The perception system is responsible for detecting when and where a human partner wants to engage in a handover, detect the object the human wants to pass, and ensure that a finger or body part of the human is not grasped. For example, a human might be holding a cellphone near the robot, and the robot should not unexpectedly reach out and grasp it from the human’s hand. In this paper, we focus on detecting if a human is intending to initiate a Human-to-Robot object handover.

Fig. 1: We detect whether a user is intending to initiate a handover from a single RGB image. Handover gesture is correctly detected (left column). No handover activity is detected (right column).

Human-to-Robot handovers have been demonstrated on a physical robot system by a handful of researchers [16, 10, 8, 11, 17, 13]. Even though successful demonstrations has been shown, in these works the robot is programmed for a single task: object handovers. In real scenarios where the robot is expected to engage in many tasks, the robot must first recognize the action of the human handing an object over to the robot. There are many communication cues that humans use to recognize handover intent, including direct cues such as verbal communication, and indirect cues such as eye gaze and body gestures. In this work, we focus on the binary classification of whether a human partner is intending to hand over an object from body gestures only.

Our approach is based on extracting features relevant to the task from three independent modules, and learning a classifier to detect the existence of a handover gesture, based on the extracted features. Each of these modules is a neural network, described below:

  • Object Detector: detects the presence and pixel coordinates of the bounding box of an object

  • Human Body Pose Detector: detects the human body pose with specific key points

  • Head Orientation Detector: The head orientation of the person, represented in Euler angles with respect to the camera frame.

We train a deep neural network that converts the resulting feature vector generated by the modules into a binary classification result. Our system uses the Faster Region Based Convolutional Neural Network (Faster R-CNN)

[12] for object detection, Keypoint R-CNN [6]

to estimate the human body pose, multi-loss Resnet50

[14] architecture to estimate the head pose of a person, and a fully connected deep net to produce the final result for recognition. We represent the human body key points in the object coordinate frame so that the detection is independent of where the human is in the image frame. Our approach is 1) modular because each module can be replaced as long as the output type is the same 2) generalizable because it is independent of the object and human appearance, and the relative position of the human in the image.

Ii Approach

Our approach utilizes a total of four modules as shown in the system diagram (Fig. 2

). The input RGB image is passed into the three neural networks to obtain relevant features in parallel: An object detector detects the pixel coordinates of the object of interest, multiple keypoints of a person’s body is detected, and the head pose of the person is estimated. The resulting features are processed and compressed into a feature vector, which is then passed into the final multi-layer perceptron (MLP) to generate a binary output, which is the estimation of whether the human intends to initiate a handover.

Fig. 2: System Diagram

Ii-a Object Detection

Faster R-CNN[12] is implemented to detect pixel coordinates of an object held by a person. When an object is detected, a bounding box will be generated by this module. In the case where multiple objects are detected, the object with largest bounding box area is selected to be the target. Faster R-CNN is a modification of the original Region Based CNN[4] and the improved Fast R-CNN[5]. R-CNNs introduces a selective search algorithm to propose regions of interest in the image. These regions are then individually passed into a CNN which extracts the features of the regions. These features are subsequently passed into an SVM which determines the presence of the object in the proposed regions. Fast R-CNN improves the performance of R-CNN by passing the image through the CNN first to extract the feature map which is then used to generate the region proposals. Fast R-CNN is faster than basic R-CNN because the image is only passed into the CNN once whereas in R-CNN the large number of region proposals are all passed into the CNN slowing down the entire network. Faster R-CNN takes this improvement further by switching out the selective search algorithm used by R-CNN and Fast R-CNN to generate the region proposals with a region proposal network.

Ii-B Body Keypoints Detection

Human body keypoints is commonly used in robotics applications, such as for ensuring safety for Human-Robot Interaction [2, 13] and recognizing gestures such as a pointing gestures [3].

Mask R-CNN[6] is implemented to detect various keypoints of a human body, such as shoulders, elbows, wrists, etc. Mask R-CNN is a modification of Faster R-CNN, where in parallel to the class and bounding box prediction it also adds a branch that outputs a binary mask for each Region of Interest (RoI). Mask R-CNN is able to perform semantic segmentation on each separate RoI, allowing the model to identify the boundaries of objects at the pixel level.

For keypoints detection, each keypoint is treated as a separate class during the training process. The segmentation branch of Mask R-CNN outputs k binary masks representing k different keypoints. In each binary mask, only one pixel is labelled as the foreground by the model representing the location of the keypoint. The coordinates of each of these points are then taken as the estimation of the pose. This modification of the Mask R-CNN model is called Keypoint R-CNN.

Ii-C Head Pose Estimation

Multi-loss Resnet50 architecture, proposed by Ruiz et al.[14]

, is utilised for head pose estimation. Multiple losses are involved in the work and a loss is designated for each Euler angle. Thus, there are a total of three losses which are designated for yaw, pitch and roll separately. A binned pose classification and a regression loss are included in each loss. Furthermore, ResNet50 is used as the backbone of the network and three fully connected layers are attached to it for Euler angle predictions. A softmax layer and a cross entropy loss are implemented for bin classification, and hence there are three cross entropy losses in total. These losses are then back-propagated through the network. A mean squared error loss is added to the network as a regression loss as well. In addition, multi-task cascaded convolution networks (MTCNN)


, proposed by Zhang et al., is used for face detection before computing the head pose estimation of the person.

Ii-D Gesture Recognition from Extracted Features

We used a Multi-Layer Perceptron that consists of an input layer, four hidden layers, and an output layer. The network architecture is illustrated in Fig. 3

. Features extracted by the previous modules are processed and concatenated into a feature vectore before passing into the MLP. Only keypoints of upper body of a person are included in the feature vector. We implemented two ways to input the feature vector, according to how the body keypoints are represented: relative to the object, or absolute pixel coordinates. Below we explain each method.

Fig. 3: Neural network architecture for the MLP to detect handover gestures
  • Absolute Pixels: The input feature vector has a size of 29.

    • Object (4 values): the pixel coordinates of the centroid, width and height of the object bounding box

    • Absolute Human body keypoints (22 values): pixel coordinates of each of the 11 keypoints belonging to upper body

    • Head orientation (3 values): yaw, pitch and roll angles in degrees, represented in camera frame

  • Relative to Object: The input feature vector has a size of 26.

    • Object presence (1 value): Binary value of whether an object of interest is detected.

    • Relative human body keypoints (22 values): pixel coordinates of each of the 11 keypoints belonging to upper body, represented in the 2D coordinate frame attached to the target object centroid

    • Head orientation (3 values): yaw, pitch and roll angles in degrees, represented in camera frame

The output of the MLP is passed through a sigmoid function resulting in a likelihood of gesture detection between [0,1], and compared against a threshold (set to 0.5 in current implementation) to obtain the result of binary classification.

Iii Experiments

Iii-a Datasets and Training

For object detection model, the Faster R-CNN is implemented with Facebook AI Research’s Detectron2 engine [15]. Detectron2 engine contains a collection of state-of-the-art object detection algorithms as well as APIs to train and deploy multiple algorithms at ease. The Faster R-CNN model is loaded with the COCO 2017 pre-trained weights and retrained on a subset of the COCO 2017 dataset [9]. For this implementation, the object considered in the handover process is an apple, and the Faster R-CNN model is retrained on just the images of apples in the COCO 2017 dataset. Even though our system is implemented for a single object class, it will also work on generic object detectors as long as the object bounding box is generated.

For the body keypoints detection, the Keypoint R-CNN model is also implemented using the Detectron2 engine [15]. The model uses the pre-trained weights provided by the engine itself. The pre-trained weights are found to be sufficiently accurate at detecting the human poses usually encountered in a handover scenario.

For head pose estimation, we used the 300W across Large Poses (300W-LP) dataset [19] to train the network. The pre-trained model is loaded during the training phase. This dataset is initially used for 3D Dense Face Alignment (3DDFA)[19], whereby a convolutional neural network (CNN) is used to apply a dense 3D model to an image. Moreover, 300W-LP contains various 3D landmarks which will be useful for training to obtain Euler angles such as yaw, pitch and roll.

For the handover gesture detection, we created a custom dataset for training the MLP model. A total of 25 videos were recorded in a lab setting, containing a total of 2506 images. Each image was annotated manually to whether or not the image indicates an event of handover. We use a 81%/19% training and testing split for training, in which while enforcing balancing between positive and negative examples. We also store all features extracted from RGB images in JSON format.

The training process begins with slicing a raw RGB video input into images and feeding them into the object detection, body pose estimation, and head pose estimation modules sequentially. The respective feature outputs are then processed for the MLP model. At this stage, the feature outputs are vectorised and passed into the MLP model. Some computation is also performed in this stage to make the frame to be relative to the object. There are a total of 26 parameters used for training essentially. When no object is detected, the yaw, pitch and roll in the feature vector are set to a dummy variable (-999 in this case) and the rest is set to be 0. This is to indicate an invalid feature. Binary Cross Entropy (BCE) loss is implemented in the training of MLP.

Iii-B Baselines

In addition to our approach with two variations (absolute and relative human body key points), as explained in Sec. II-D, we implemented two baselines:

  • End-to-end: Alexnet [1] is used for this end-to-end image classification. A raw RGB image is provided as the input to the model to do feature extraction. However, Alexnet eventually classifies all images as 0 and hence it completely fails at the classification of the custom dataset.

  • CNN on skeleton image: Resnet50 [7] is used for skeleton image classification. A skeleton image is generated by colouring the corresponding coordinates of each estimation point, bounding box, and head orientation on a black background. The image is then passed into Resnet50 to perform feature extraction. Resnet50 can only successfully classify some images in the custom dataset to a small extent.

Iii-C Quantitative Results

The detection accuracy for our method (averaged over 5 random train/test splits), as well the baselines are shown in Table I.

Method Accuracy (%)
End-to-end 50.0
CNN on skeleton image 83.3
Ours (absolute pixels) 90.1
Ours (relative to object) 90.6
TABLE I: Handover gesture detection accuracy. Our method, with human pose represented relative to object, performs the best.
(a) Standing at the corner
(b) No object is detected (Left). Object is not in hand (Right).
Fig. 4: Some successful examples of the classification results. X and O shows indicates the correct label for the image.
Fig. 5: Some failure cases. False positive (Left). False negative (Right)

Our results show that the End-to-End method, which uses the raw RGB image, was no better than random guesses. This shows the difficulty of the problem, as the difference between positive and negative examples are very subtle. The CNN on Skeleton Image approach achieved a respectable 83.3% accuracy.

Both variations of our approach outperformed the baselines. Our method that uses the feature vector relative to the object had 90.6% detection accuracy, performing slightly better than the feature representation that uses absolute pixel values.

Iii-D Qualitative Results

Some qualitative results are shown in Fig. 4. From the experiments, object-centric approach performs better due to its robustness. When a subject stands at the corner (Fig. 3(a)), the model still can classify the event of handover accurately due to its object-centric nature. In addition, when neither the object is detected nor it is in hand (Fig. 3(b)), the model will not classify it as a handover event. Nonetheless, the model is not free from errors. Some failure cases are shown in Fig. 5. This may be due to the limited size of our custom dataset and some scenarios are not taken in account. The size of our custom dataset can be increased in the future involving more subjects to make it more robust for training.

Iv Conclusion

We successfully demonstrated an approach for recognition of human-to-robot handovers with neural networks. The system takes in a single RGB frame and outputs a binary classification for the recognition of human-to-robot handovers. Several modules are incorporated, i.e. object detection, keypoints detection, head pose estimation and multi-layer perceptron. The system also performs computation on the body keypoints and this allows handover situations to be detected regardless of the object’s location and its surrounding environment. As a result, it becomes an object-centric frame. In general, the performance of the system is good to a large extent as the feature vectors can be classified accurately. Nonetheless, some errors can be observed during deployment due to a limited data set to train the multi-layer perceptron, and hence further development will be required. Future work includes investigating the use of other communication cues such as verbal and gaze as well as temporal information, as well as detection in the presence of multiple objects and people.


  • [1] I. S. Alex Krizhevsky and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems, Cited by: 1st item.
  • [2] A. Cosgun, M. Bunger, and H. I. Christensen (2013) Accuracy analysis of skeleton trackers for safety in hri. In Proceedings of the Workshop on Safety and Comfort of Humanoid Coworker and Assistant (HUMANOIDS), pp. 15–17. Cited by: §II-B.
  • [3] A. Cosgun, A. J. Trevor, and H. I. Christensen (2015) Did you mean this object?: detecting ambiguity in pointing gesture targets. In ACM/IEEE international conference on Human-Robot Interaction (HRI) workshop on Towards a Framework for Joint Action, Cited by: §II-B.
  • [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In

    IEEE conference on computer vision and pattern recognition

    Cited by: §II-A.
  • [5] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §II-A.
  • [6] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §I, §II-B.
  • [7] S. R. Kaiming He and J. Sun (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385. Cited by: 2nd item.
  • [8] J. Konstantinova, S. Krivic, A. Stilli, J. Piater, and K. Althoefer (2017) Autonomous object handover using wrist tactile information. In Towards Autonomous Robotic Systems, Y. Gao, S. Fallah, Y. Jin, and C. Lekakou (Eds.), pp. 450–463. Cited by: §I.
  • [9] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, Cited by: §III-A.
  • [10] G. Maeda, G. Neumann, M. Ewerton, R. Lioutikov, O. Kroemer, and J. Peters (2016) Probabilistic movement primitives for coordination of multiple human–robot collaborative tasks. Autonomous Robots 41, pp. . External Links: Document Cited by: §I.
  • [11] M. K. X. J. Pan, E. A. Croft, and G. Niemeyer (2018) Exploration of geometry and forces occurring within human-to-robot handovers. In 2018 IEEE Haptics Symposium (HAPTICS), Vol. , pp. 327–333. Cited by: §I.
  • [12] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §I, §II-A.
  • [13] Rosenberger,Patrick, A. Cosgun, R. Newbury, J. Kwan, V. Ortenzi, P. Corke, and M. Grafinger (2020) Object-independent human-to-robot handovers using real time robotic vision. arXiv preprint arXiv:2006.01797. Cited by: §I, §II-B.
  • [14] N. Ruiz, E. Chong, and J. M. Rehg (2018) Fine-grained head pose estimation without keypoints. In IEEE conference on computer vision and pattern recognition workshops, Cited by: §I, §II-C.
  • [15] Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019) Detectron2. Note: July 17, 2020) Cited by: §III-A, §III-A.
  • [16] K. Yamane, M. Revfi, and T. Asfour (2013) Synthesizing object receiving motions of humanoid robots with human motion database. In IEEE International Conference on Robotics and Automation, Cited by: §I.
  • [17] W. Yang, C. Paxton, M. Cakmak, and D. Fox (2020) Human grasp classification for reactive human-to-robot handovers. arXiv preprint arXiv:2003.06000. Cited by: §I.
  • [18] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: §II-C.
  • [19] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li (2016) Face alignment across large poses: a 3d solution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 146–155. Cited by: §III-A.