Recognition and Localisation of Pointing Gestures using a RGB-D Camera

by   Naina Dhingra, et al.

Non-verbal communication is part of our regular conversation, and multiple gestures are used to exchange information. Among those gestures, pointing is the most important one. If such gestures cannot be perceived by other team members, e.g. by blind and visually impaired people (BVIP), they lack important information and can hardly participate in a lively workflow. Thus, this paper describes a system for detecting such pointing gestures to provide input for suitable output modalities to BVIP. Our system employs an RGB-D camera to recognize the pointing gestures performed by the users. The system also locates the target of pointing e.g. on a common workspace. We evaluated the system by conducting a user study with 26 users. The results show that the system has a success rate of 89.59 and 79.92 arm respectively, and 73.57 and 68.99 right arm respectively.



There are no comments yet.


page 5

page 6


Zoomorphic Gestures for Communicating Cobot States

Communicating the robot state is vital to creating an efficient and trus...

Hand Gesture Detection and Conversion to Speech and Text

The hand gestures are one of the typical methods used in sign language. ...

Selecting a Small Set of Optimal Gestures from an Extensive Lexicon

Finding the best set of gestures to use for a given computer recognition...

Touchless Typing using Head Movement-based Gestures

Physical contact-based typing interfaces are not suitable for people wit...

Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents

We present Text2Gestures, a transformer-based learning method to interac...

Human Computer Interaction Using Marker Based Hand Gesture Recognition

Human Computer Interaction (HCI) has been redefined in this era. People ...

Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals

Visual lip gestures observed whilst lipreading have a few working defini...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In a meeting environments, when sighted people and BVIP are working together, sighted people tend to do some habitual gestures, from which the most common ones are: facial expressions, hand gestures, pointing gestures, eye gaze, etc. There are in total 136 gestures [1] which are termed as part of non-verbal communication (NVC). They need to be understood together with the verbal communication to understand the complete meaning of the conversation. However, for BVIP, the information from visual gestures are missing [4]. To understand the meaning of pointing gestures, it is crucial to know where a person is pointing at. Pointing gestures are the most common ones in nonverbal communication, and they become important in meetings where the speakers point towards objects in the room, or at artefacts on a whiteboard, as a reference to their speech. However, these pointing gestures are not accessible for BVIP and thus they lack important information during a conversation within a team meeting. To address this issue, we developed a system that automatically detects pointing gestures and determines the position where a person is pointing at. However, although NVC is easily understood by the humans, it is difficult for machines to recognize and interpret it reliably [3] and to avoid false alerts to the BVIP.

The main contributions of this paper are as follows: (1) We developed an autonomous system using OpenPTrack and ROS (Robot Operating System) to detect and localise the position of a pointing gesture. (2) We designed our system to work in real time and performed experiments using and grids. (3) We conducted a user study with 26 users to evaluate our system. We expect that our work will help the researchers to integrate BVIP in team meetings.

This paper is organized as follows: Section 2 describes the state of the art in OpenPtrack software, pointing and related gestures. Section 3 describes the methods and techniques used in our system, while section 4 gives an overview of the user study conducted and the setup of the built system. Finally, section 5 discusses the results obtained from the user study as well as the accuracies obtained by our system in detecting and localizing pointing gestures.

2 State of the Art

2.1 OpenPtrack

OpenPTrack is an open source software for tracking people and calibrating a multi- RGB-D camera setup

[7]. It can track multiple people at a frame rate of the sensor. It can also employ heterogeneous 3D sensors group. OpenPTrack uses a calibration procedure which depends on ROS communication networking capabilities and communication. In the past, detection and tracking systems exploited the color and depth information of a user, since cheap RGB-D sensors are available. Further, previous software were limited to single camera usage tracking system. These systems did not use multiple cameras and could not be implemented in distributed settings. Four our system, we use OpenPtrack since it allows expanding our pointing gesture system to multiple camera setup.

2.2 Pointing Gesture Recognition

Pointing gestures can be measured in different ways. Glove based techniques were used initially to sense the gesture being performed by the hand [10]

. Nowadays, computer vision based techniques

[3, 11]

or Hidden Markov Models (HMMs)

[12] are used for detection. In particular, for pointing gesture detection, cascaded HMMs along with a particle filter was used for pointing gesture detection in [9]

. The HMM in their first stage takes estimation of the hand position. It maps the estimated position to a precise position by modeling the kinematic features of the pointing finger. The output 3D coordinates are fed into their HMM in a second stage that differentiates the pointing gestures from other types of gesture. This technique requires a long processing time and a large training dataset.

Deep learning [6]

has been successfully used in various applications of computer vision, which has inspired its use for gesture and body pose estimations as well

[8]. Deep learning approaches also solved pointing gesture recognition in [5]. However, it requires large training dataset and only works with the specific data type on which it was trained on.

Our problem statement is to solve the pointing gesture recognition for BVIP more robustly. Using deep learning approaches would have required a large training dataset to make them applicable on different setups, such as a variety of meeting room layouts with a different number of people interacting at the same time. Thus, we chose a traditional way by using mathematical geometry and feature localisation. At first, a Kinect sensor along with OpenPtrack is used to locate the body joints. Next a mathematical geometry transformation is applied to achieve the spatial position of the pointing gesture’s target. This position is classified into 6 fields (for

matrix), and into 12 fields (for matrix).

3 Methodology

The implementation of our pointing gesture recognizer and localizing system is based on OpenPTrack [2]. Using Kinect v2 as sensor, this software allows person tracking in real time over large areas. However, since this framework is not capable of directly detecting pointing gestures or other behavioral features of a user, we also forward the data to ROS 111 By doing so, we can obtain the joints’ coordinates in space for human gestures such as pointing. The main idea is that different packages of ROS could be implemented that contain so-called nodes, which are units that perform logic and computation for different parts of a robot, i.e. control of actuators, transform or change resolution of images provided by a sensor, etc. The different nodes of ROS can communicate with each other in order to share useful information or the functioning of the whole system, which is done by the topics. Every node implementation of a pointing gesture recognizer for blind users can subscribe to such a topic to receive information or publish on a topic to share its content. OpenPtrack thus uses ROS to allow the information provided by Kinect to be further processed. The joints x, y and z coordinates are published under a repository which also contains different IDs for the different joints. The coordinate transformation from the sensor’s reference frames to the world reference frame are performed using a ROS package called /TF, which rotates and translates the reference frames to the desired positions.

A deictic (or pointing) gesture consists of the movement of an arm to point at a target in space and to highlight it by this gestures for other people without necessarily having to verbally describe its position exactly. The joint’s coordinates that have to be obtained are thus from elbow and hand, since these represent the human forearm and hence the major components for pointing. In order to define a pointing action, the link connecting the two aforementioned joints was measured, and named pointing vector as shown in the Figure


Figure 1: Left:Pointing gesture with pointing vector. Right: Stabilization time of a pointing gesture where dr/dt is the change in the circle’s diameter.

Equation 1 is used to locate the position of the target on a vertical plane (e.g. a whiteboard) the user is pointing at.


where is a point predefined on the ground plane, is the normal vector to the plane and are the positions of hand and elbow joints, respectively.

OpenPtrack defines all measurements in a world reference frame. To understand the definition of the world reference frame in OpenPtrack, the TF package of ROS is used, which is a predefined package for coordinate transformation using rotation matrices and quaternions. The next step is to define a whiteboard/matrix plane coordinate frame in order to obtain the measured target point on it. This plane coordinate frame is achieved by applying the rotation matrix in relation world coordinate frame. The output values from OpenPtrack are converted in whiteboard/matrix plane coordinate frames. These converted values analysed by putting hard limits for each box in the matrix along both and direction. All of these values are evaluated on run time.

Before getting the information from the system, on which target a user is pointing at, it is required to wait about 3-3.5 sec for the pointing gestures to get stabilized. This waiting time is for a user to reach the stable pointing gesture without moving or vibrating his/her arm. The stability output from the system is achieved after the setting time as shown in Figure 1. It also has to be noted that the pointing gesture will become unstable again after certain time period.

4 Experimental Setup

The setup resembles an environment in which sighted users have to perform pointing gestures, which are automatically recognized by our system. The pointing gesture’s target will then be determined to be provided to a suitable output device for BVIP. The experiments consist of four parts: two studies using the left arm and two using the right arm for pointing. The pointing gestures have to be performed on two different grid sizes in order to evaluate the accuracy of our system, i.e., at each and grid printed on the board.

The setup is shown in Figure 2. The board has the dimensions 1290mm x 1910mm and was 1000mm above the ground. The Kinect sensor was placed at a height of 1300mm above the top edge of the board and centered. Each box in the grid was numbered. The setup is shown in 2. The user had to stand at a constant distance of 1.5 m and centered in front of the board. Then he was asked to point towards the numbers following a given sequence told by the experimenter, and point for a few seconds to achieve a stable gesture before moving to the next number in the sequence. The stability time procedure is illustrated in Figure 1. After the user was prompted to point at a certain box and the wait-time from Figure 1 was exceeded, the measured target number was recorded.

Figure 2: Left: Measurement setup; Right: Experimental setup of the system. The Kinect is placed above the board having the matrix of numbers for the user to point at.

5 User Study and Results

The system was evaluated in a user study with 26 participants. Different parameters such as handedness, user’s height, and arm length were measured. Since a user’s pointing is significantly influenced by the pointing stability, this also impacts the accuracy of our system, resulting in noticeable differences for the and the grids. The error increases with decreasing box sizes, i.e. it is larger for the

grid. The confusion matrix in Figure

3 left gives an overview on the percentage of the correct pointing at a target number in the matrix using the left arm. Similarly, Figure 3 (right) describes the quantitative values for right arm corresponding to matrix, Figure 4 for the left arm pointing at matrix and Figure 5 for the right arm pointing at matrix.

Figure 3: Pointing accuracy for the left/right arm using a grid.
Figure 4: Pointing accuracy for the left arm using a grid.
Figure 5: Pointing accuracy for the right arm using a grid.

Table 1 shows the accuracy values achieved in the four experiments, i.e., (1) Left arm using matrix, (2) Right arm using matrix, (3) Left arm for matrix and (4) Right arm for matrix. This accuracy is calculated by converting the output from the system to binary output, i.e., 1, if the output from the system was correct, otherwise it is 0. Then the total number of correct results is divided by the total of trials in the experiment multiplied by 100 to get the percentage of the accuracy.

Left Arm Right Arm
89.59 % 79.92 %
73.57 % 68.99 %

Table 1: Accuracy for the experiments performed in the user study by using and matrix and by using left and the right arm.

Each of the four tests resulted in a higher accuracy when using the left arm for pointing. This could be caused by the inherent asymmetry within Kinect v2. The IR emitter is centered in the Kinect box, while the IR receiver is off-centered. This leads to a camera’s perspective that sees the left arm slightly better than the right one, i.e. the left arm is measured slightly longer than the right one,

6 Conclusion

We worked on automatic pointing gesture detection and pointing target localization in a meeting environment. A prototype of the automatic system was built and tested by conducting a user study. The output of this system will be converted to suitable modality which will help BVIP to get the extra information. Although for our application it is required to have its good performance for 2 x 3 but it proves to have high precision for small areas for localizer function and performs good for both 2 x 3 and 3 x 4 grids in all the four the experiments. We also found out that the stable time for getting the value of localizer is achieved after around 3 seconds and the hand of the user starts to vibrate after an interval again. Our user study also showed that the height of the user did not effect much on the performance. The arm size which is either very small or very large has a small decrease in the accuracy.

In future, the output of our system will be converted by a suitable haptic interface helping BVIP to access these pointing gestures. Moreover, we will expand our system with multiple cameras, and we will have several users pointing simultaneously. Also, we will improve the system to a have more symmetrical output, i.e., the same performance for pointing using the left and the right arm.


This work has been supported by the Swiss National Science Foundation (SNF) under the grant no. 200021E 177542 / 1. It is part of a joint project between TU Darmstadt, ETH Zurich, and JKU Linz with the respective funding organizations DFG (German Research Foundation), SNF (Swiss National Science Foundation) and FWF (Austrian Science Fund).


  • [1] C. R. Brannigan and D. A. Humphries (1972) Human non-verbal behavior, a means of communication. Ethological studies of child behavior, pp. 37–64. Cited by: §1.
  • [2] M. Carraro, M. Munaro, J. Burke, and E. Menegatti (2018) Real-time marker-less multi-person 3d pose estimation in rgb-depth camera networks. In International Conference on Intelligent Autonomous Systems, pp. 534–545. Cited by: §3.
  • [3] N. Dhingra and A. Kunz (2019) Res3ATN-deep 3d residual attention network for hand gesture recognition in videos. In International Conference on 3D Vision (3DV 2019), pp. 491–501. Cited by: §1, §2.2.
  • [4] S. Günther, R. Koutny, N. Dhingra, M. Funk, C. Hirt, K. Miesenberger, M. Mühlhäuser, and A. Kunz (2019) MAPVI: meeting accessibility for persons with visual impairments. In Proceedings of the 12th ACM International Conference on PErvasive Technologies Related to Assistive Environments, pp. 343–352. Cited by: §1.
  • [5] Y. Huang, X. Liu, X. Zhang, and L. Jin (2016) A pointing gesture based egocentric interaction system: dataset, approach and application. In

    Proceedings of the IEEE conference on computer vision and pattern recognition workshops

    pp. 16–23. Cited by: §2.2.
  • [6] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §2.2.
  • [7] M. Munaro, F. Basso, and E. Menegatti (2016) OpenPTrack: open source multi-camera calibration and people tracking for rgb-d camera networks. Robotics and Autonomous Systems 75, pp. 525–538. Cited by: §2.1.
  • [8] N. Neverova, C. Wolf, G. W. Taylor, and F. Nebout (2014) Multi-scale deep learning for gesture detection and localization. In European Conference on Computer Vision, pp. 474–490. Cited by: §2.2.
  • [9] C. Park and S. Lee (2011) Real-time 3d pointing gesture recognition for mobile robots with cascade hmm and particle filter. Image and Vision Computing 29 (1), pp. 51–63. Cited by: §2.2.
  • [10] D. L. Quam (1990) Gesture recognition with a dataglove. In IEEE Conference on Aerospace and Electronics, pp. 755–760. Cited by: §2.2.
  • [11] S. S. Rautaray and A. Agrawal (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artificial intelligence review 43 (1), pp. 1–54. Cited by: §2.2.
  • [12] A. D. Wilson and A. F. Bobick (1999) Parametric hidden markov models for gesture recognition. IEEE transactions on pattern analysis and machine intelligence 21 (9), pp. 884–900. Cited by: §2.2.