Robots are under rapid development in sectors such as manufacturing, automation and hospitality. Collaborative robots are increasingly being used in industry, and is expected to be useful in home environments in the future. One of the expected capabilities for collaborative robots is the ability to perform object handovers. Object handovers can happen in two directions: Robot-to-Human where the robot delivers a requested object to a human, and Human-to-Robot where the robot acquires an object from a human. The object handover problem has been studied extensively in the robotics literature, with most of the published work focusing on the Robot-to-Human handover scenario. Human-to-Robot handover scenario is arguably the more difficult problem because, as opposed to the Robot-to-Human scenario, the robot responsible for grasping the object from the human partner’s hand, which raises safety issues. This makes the perception problem very critical for the Human-to-Robot handover scenario. The perception system is responsible for detecting when and where a human partner wants to engage in a handover, detect the object the human wants to pass, and ensure that a finger or body part of the human is not grasped. For example, a human might be holding a cellphone near the robot, and the robot should not unexpectedly reach out and grasp it from the human’s hand. In this paper, we focus on detecting if a human is intending to initiate a Human-to-Robot object handover.
Human-to-Robot handovers have been demonstrated on a physical robot system by a handful of researchers [16, 10, 8, 11, 17, 13]. Even though successful demonstrations has been shown, in these works the robot is programmed for a single task: object handovers. In real scenarios where the robot is expected to engage in many tasks, the robot must first recognize the action of the human handing an object over to the robot. There are many communication cues that humans use to recognize handover intent, including direct cues such as verbal communication, and indirect cues such as eye gaze and body gestures. In this work, we focus on the binary classification of whether a human partner is intending to hand over an object from body gestures only.
Our approach is based on extracting features relevant to the task from three independent modules, and learning a classifier to detect the existence of a handover gesture, based on the extracted features. Each of these modules is a neural network, described below:
Object Detector: detects the presence and pixel coordinates of the bounding box of an object
Human Body Pose Detector: detects the human body pose with specific key points
Head Orientation Detector: The head orientation of the person, represented in Euler angles with respect to the camera frame.
We train a deep neural network that converts the resulting feature vector generated by the modules into a binary classification result. Our system uses the Faster Region Based Convolutional Neural Network (Faster R-CNN) for object detection, Keypoint R-CNN 
to estimate the human body pose, multi-loss Resnet50 architecture to estimate the head pose of a person, and a fully connected deep net to produce the final result for recognition. We represent the human body key points in the object coordinate frame so that the detection is independent of where the human is in the image frame. Our approach is 1) modular because each module can be replaced as long as the output type is the same 2) generalizable because it is independent of the object and human appearance, and the relative position of the human in the image.
Our approach utilizes a total of four modules as shown in the system diagram (Fig. 2
). The input RGB image is passed into the three neural networks to obtain relevant features in parallel: An object detector detects the pixel coordinates of the object of interest, multiple keypoints of a person’s body is detected, and the head pose of the person is estimated. The resulting features are processed and compressed into a feature vector, which is then passed into the final multi-layer perceptron (MLP) to generate a binary output, which is the estimation of whether the human intends to initiate a handover.
Ii-a Object Detection
Faster R-CNN is implemented to detect pixel coordinates of an object held by a person. When an object is detected, a bounding box will be generated by this module. In the case where multiple objects are detected, the object with largest bounding box area is selected to be the target. Faster R-CNN is a modification of the original Region Based CNN and the improved Fast R-CNN. R-CNNs introduces a selective search algorithm to propose regions of interest in the image. These regions are then individually passed into a CNN which extracts the features of the regions. These features are subsequently passed into an SVM which determines the presence of the object in the proposed regions. Fast R-CNN improves the performance of R-CNN by passing the image through the CNN first to extract the feature map which is then used to generate the region proposals. Fast R-CNN is faster than basic R-CNN because the image is only passed into the CNN once whereas in R-CNN the large number of region proposals are all passed into the CNN slowing down the entire network. Faster R-CNN takes this improvement further by switching out the selective search algorithm used by R-CNN and Fast R-CNN to generate the region proposals with a region proposal network.
Ii-B Body Keypoints Detection
Mask R-CNN is implemented to detect various keypoints of a human body, such as shoulders, elbows, wrists, etc. Mask R-CNN is a modification of Faster R-CNN, where in parallel to the class and bounding box prediction it also adds a branch that outputs a binary mask for each Region of Interest (RoI). Mask R-CNN is able to perform semantic segmentation on each separate RoI, allowing the model to identify the boundaries of objects at the pixel level.
For keypoints detection, each keypoint is treated as a separate class during the training process. The segmentation branch of Mask R-CNN outputs k binary masks representing k different keypoints. In each binary mask, only one pixel is labelled as the foreground by the model representing the location of the keypoint. The coordinates of each of these points are then taken as the estimation of the pose. This modification of the Mask R-CNN model is called Keypoint R-CNN.
Ii-C Head Pose Estimation
Multi-loss Resnet50 architecture, proposed by Ruiz et al.
, is utilised for head pose estimation. Multiple losses are involved in the work and a loss is designated for each Euler angle. Thus, there are a total of three losses which are designated for yaw, pitch and roll separately. A binned pose classification and a regression loss are included in each loss. Furthermore, ResNet50 is used as the backbone of the network and three fully connected layers are attached to it for Euler angle predictions. A softmax layer and a cross entropy loss are implemented for bin classification, and hence there are three cross entropy losses in total. These losses are then back-propagated through the network. A mean squared error loss is added to the network as a regression loss as well. In addition, multi-task cascaded convolution networks (MTCNN)
, proposed by Zhang et al., is used for face detection before computing the head pose estimation of the person.
Ii-D Gesture Recognition from Extracted Features
We used a Multi-Layer Perceptron that consists of an input layer, four hidden layers, and an output layer. The network architecture is illustrated in Fig. 3
. Features extracted by the previous modules are processed and concatenated into a feature vectore before passing into the MLP. Only keypoints of upper body of a person are included in the feature vector. We implemented two ways to input the feature vector, according to how the body keypoints are represented: relative to the object, or absolute pixel coordinates. Below we explain each method.
Absolute Pixels: The input feature vector has a size of 29.
Object (4 values): the pixel coordinates of the centroid, width and height of the object bounding box
Absolute Human body keypoints (22 values): pixel coordinates of each of the 11 keypoints belonging to upper body
Head orientation (3 values): yaw, pitch and roll angles in degrees, represented in camera frame
Relative to Object: The input feature vector has a size of 26.
Object presence (1 value): Binary value of whether an object of interest is detected.
Relative human body keypoints (22 values): pixel coordinates of each of the 11 keypoints belonging to upper body, represented in the 2D coordinate frame attached to the target object centroid
Head orientation (3 values): yaw, pitch and roll angles in degrees, represented in camera frame
The output of the MLP is passed through a sigmoid function resulting in a likelihood of gesture detection between [0,1], and compared against a threshold (set to 0.5 in current implementation) to obtain the result of binary classification.
Iii-a Datasets and Training
For object detection model, the Faster R-CNN is implemented with Facebook AI Research’s Detectron2 engine . Detectron2 engine contains a collection of state-of-the-art object detection algorithms as well as APIs to train and deploy multiple algorithms at ease. The Faster R-CNN model is loaded with the COCO 2017 pre-trained weights and retrained on a subset of the COCO 2017 dataset . For this implementation, the object considered in the handover process is an apple, and the Faster R-CNN model is retrained on just the images of apples in the COCO 2017 dataset. Even though our system is implemented for a single object class, it will also work on generic object detectors as long as the object bounding box is generated.
For the body keypoints detection, the Keypoint R-CNN model is also implemented using the Detectron2 engine . The model uses the pre-trained weights provided by the engine itself. The pre-trained weights are found to be sufficiently accurate at detecting the human poses usually encountered in a handover scenario.
For head pose estimation, we used the 300W across Large Poses (300W-LP) dataset  to train the network. The pre-trained model is loaded during the training phase. This dataset is initially used for 3D Dense Face Alignment (3DDFA), whereby a convolutional neural network (CNN) is used to apply a dense 3D model to an image. Moreover, 300W-LP contains various 3D landmarks which will be useful for training to obtain Euler angles such as yaw, pitch and roll.
For the handover gesture detection, we created a custom dataset for training the MLP model. A total of 25 videos were recorded in a lab setting, containing a total of 2506 images. Each image was annotated manually to whether or not the image indicates an event of handover. We use a 81%/19% training and testing split for training, in which while enforcing balancing between positive and negative examples. We also store all features extracted from RGB images in JSON format.
The training process begins with slicing a raw RGB video input into images and feeding them into the object detection, body pose estimation, and head pose estimation modules sequentially. The respective feature outputs are then processed for the MLP model. At this stage, the feature outputs are vectorised and passed into the MLP model. Some computation is also performed in this stage to make the frame to be relative to the object. There are a total of 26 parameters used for training essentially. When no object is detected, the yaw, pitch and roll in the feature vector are set to a dummy variable (-999 in this case) and the rest is set to be 0. This is to indicate an invalid feature. Binary Cross Entropy (BCE) loss is implemented in the training of MLP.
In addition to our approach with two variations (absolute and relative human body key points), as explained in Sec. II-D, we implemented two baselines:
End-to-end: Alexnet  is used for this end-to-end image classification. A raw RGB image is provided as the input to the model to do feature extraction. However, Alexnet eventually classifies all images as 0 and hence it completely fails at the classification of the custom dataset.
CNN on skeleton image: Resnet50  is used for skeleton image classification. A skeleton image is generated by colouring the corresponding coordinates of each estimation point, bounding box, and head orientation on a black background. The image is then passed into Resnet50 to perform feature extraction. Resnet50 can only successfully classify some images in the custom dataset to a small extent.
Iii-C Quantitative Results
The detection accuracy for our method (averaged over 5 random train/test splits), as well the baselines are shown in Table I.
|CNN on skeleton image||83.3|
|Ours (absolute pixels)||90.1|
|Ours (relative to object)||90.6|
Our results show that the End-to-End method, which uses the raw RGB image, was no better than random guesses. This shows the difficulty of the problem, as the difference between positive and negative examples are very subtle. The CNN on Skeleton Image approach achieved a respectable 83.3% accuracy.
Both variations of our approach outperformed the baselines. Our method that uses the feature vector relative to the object had 90.6% detection accuracy, performing slightly better than the feature representation that uses absolute pixel values.
Iii-D Qualitative Results
Some qualitative results are shown in Fig. 4. From the experiments, object-centric approach performs better due to its robustness. When a subject stands at the corner (Fig. 3(a)), the model still can classify the event of handover accurately due to its object-centric nature. In addition, when neither the object is detected nor it is in hand (Fig. 3(b)), the model will not classify it as a handover event. Nonetheless, the model is not free from errors. Some failure cases are shown in Fig. 5. This may be due to the limited size of our custom dataset and some scenarios are not taken in account. The size of our custom dataset can be increased in the future involving more subjects to make it more robust for training.
We successfully demonstrated an approach for recognition of human-to-robot handovers with neural networks. The system takes in a single RGB frame and outputs a binary classification for the recognition of human-to-robot handovers. Several modules are incorporated, i.e. object detection, keypoints detection, head pose estimation and multi-layer perceptron. The system also performs computation on the body keypoints and this allows handover situations to be detected regardless of the object’s location and its surrounding environment. As a result, it becomes an object-centric frame. In general, the performance of the system is good to a large extent as the feature vectors can be classified accurately. Nonetheless, some errors can be observed during deployment due to a limited data set to train the multi-layer perceptron, and hence further development will be required. Future work includes investigating the use of other communication cues such as verbal and gaze as well as temporal information, as well as detection in the presence of multiple objects and people.
-  (2012) ImageNet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems, Cited by: 1st item.
-  (2013) Accuracy analysis of skeleton trackers for safety in hri. In Proceedings of the Workshop on Safety and Comfort of Humanoid Coworker and Assistant (HUMANOIDS), pp. 15–17. Cited by: §II-B.
-  (2015) Did you mean this object?: detecting ambiguity in pointing gesture targets. In ACM/IEEE international conference on Human-Robot Interaction (HRI) workshop on Towards a Framework for Joint Action, Cited by: §II-B.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In , Cited by: §II-A.
-  (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §II-A.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §I, §II-B.
-  (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385. Cited by: 2nd item.
-  (2017) Autonomous object handover using wrist tactile information. In Towards Autonomous Robotic Systems, Y. Gao, S. Fallah, Y. Jin, and C. Lekakou (Eds.), pp. 450–463. Cited by: §I.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, Cited by: §III-A.
-  (2016) Probabilistic movement primitives for coordination of multiple human–robot collaborative tasks. Autonomous Robots 41, pp. . External Links: Cited by: §I.
-  (2018) Exploration of geometry and forces occurring within human-to-robot handovers. In 2018 IEEE Haptics Symposium (HAPTICS), Vol. , pp. 327–333. Cited by: §I.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §I, §II-A.
-  (2020) Object-independent human-to-robot handovers using real time robotic vision. arXiv preprint arXiv:2006.01797. Cited by: §I, §II-B.
-  (2018) Fine-grained head pose estimation without keypoints. In IEEE conference on computer vision and pattern recognition workshops, Cited by: §I, §II-C.
-  (2019) Detectron2. Note: https://github.com/facebookresearch/detectron2(Accessed: July 17, 2020) Cited by: §III-A, §III-A.
-  (2013) Synthesizing object receiving motions of humanoid robots with human motion database. In IEEE International Conference on Robotics and Automation, Cited by: §I.
-  (2020) Human grasp classification for reactive human-to-robot handovers. arXiv preprint arXiv:2003.06000. Cited by: §I.
-  (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: §II-C.
-  (2016) Face alignment across large poses: a 3d solution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 146–155. Cited by: §III-A.