Acoustic Sensing-based Hand Gesture Detection for Wearable Device Interaction

by   Bing Zhou, et al.

Hand gesture recognition attracts great attention for interaction since it is intuitive and natural to perform. In this paper, we explore a novel method for interaction by using bone-conducted sound generated by finger movements while performing gestures. We design a set of gestures that generate unique sound features, and capture the resulting sound from the wrist using a commodity microphone. Next, we design a sound event detector and a recognition model to classify the gestures. Our system achieves an overall accuracy of 90.13 quiet environments and 85.79 can be deployed on existing smartwatches as a low power service at no additional cost, and can be used for interaction in augmented and virtual reality applications.



page 2

page 4

page 5

page 7


Hand-Gesture-Recognition Based Text Input Method for AR/VR Wearable Devices

Static and dynamic hand movements are basic way for human-machine intera...

Preprint Extending Touch-less Interaction on Vision Based Wearable Device

This is the preprint version of our paper on IEEE Virtual Reality Confer...

Holoscopic 3D Micro-Gesture Database for Wearable Device Interaction

With the rapid development of augmented reality (AR) and virtual reality...

TeethTap: Recognizing Discrete Teeth Gestures Using Motion and Acoustic Sensing on an Earpiece

Teeth gestures become an alternative input modality for different situat...

A basic gesture and motion format for virtual reality multisensory applications

The question of encoding movements such as those produced by human gestu...

Gesture-based Human-Machine Interaction: Taxonomy, Problem Definition, and Analysis

The possibility for humans to interact with physical or virtual systems ...

Personalised product design using virtual interactive techniques

Use of Virtual Interactive Techniques for personalized product design is...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


With recent advances in technology and significant reduction in cost factor, wearable devices like smartwatches and fitness wristbands have become increasingly popular as they provide quick access to key functionality such as making phone calls, replying to messages, and monitoring our health. However, these devices are still difficult to interact with due to the small size of their screens. Users have to navigate menus through the relatively small screen or via the limited physical buttons, which makes the interaction prone to error and inefficient, particularly because of the wide range of differences in users’ finger sizes. Additionally, when one hand is occupied (e.g., while holding a shopping bag), it is hard or sometimes not even possible to interact with wearable devices. Thus, a single-hand-based interaction approach is preferred.

A series of efforts have been undertaken to address this problem. Researchers have explored various specialized hardware and sensors, such as force sensitive resistors Dementyev and Paradiso (2014), EMG sensor Saponas et al. (2010), electric field sensor Wilhelm et al. (2015), which have shown very promising results. However, they are not widely adopted in consumer products as they require additional hardware, which is not easy to integrate into wearable devices due to their small size factor. To eliminate these issues, several researchers propose interaction methods that leverage existing hardware of smartwatches, such as inertial sensors (e.g., accelerometer and gyroscope). Shen et al. Shen et al. (2016) also use inertial sensors in smartwatches to support more gestures and continuous arm tracking. However, these approaches only support arm-level movement detection, and are not fine-grained enough to detect finger movements. Wen et al. Wen et al. (2016) propose a solution that recognizes a set of fine-grained hand gestures, however, it only works when the user and their arms are stationary, which highly constrains its practical use.

Another area of focus for gesture detection has been vision-based approaches for Augmented Reality (AR) and Virtual Reality (VR). LeapMotion  Weichert et al. (2013), Microsoft Hololens 8 as well as several research work Mueller et al. (2018); Sridhar et al. (2013)

all rely on computer vision techniques to detect hand gestures. However, they suffer from the fundamental limitation of the user’s hands having to be in the line of sight for the gestures to be recognized. Furthermore, in most cases, such vision-based techniques rely on depth cameras 

F. Weichert, D. Bachmann, B. Rudak, and D. Fisseler (2013); 8, which are subject to poor performance in outdoor environments due to saturation of infrared light in sunny conditions.

In this paper, we propose a novel user interaction system, which leverages existing microphones on wearable devices and can be readily deployed on most existing smartwatches or wristbands (as shown in Figure 1). The key insight is that performing a certain hand gesture using fingers creates unique sound signals. Such signals are then conducted by the bones from finger tips to the wrist and can be finally captured by microphones on wrist-worn devices. We propose a series of machine learning pipelines, which recognize the hand gestures from the captured sound signals. The advantage of our design is that it does not require costly special sensors, such as an EMG sensor, or a force sensitive sensor, that take up space and potentially introduce discomfort when worn for a long duration. As existing wearable devices already use microphones for voice interaction (e.g., “Siri” on the Apple watches 19), our approach can be easily integrated without incurring additional energy cost.

We develop a simple prototype based on a personal computer and an external microphone setup. The results show that we can reliably capture the hand gestures with just an external microphone attached to the wrist. This also demonstrates the feasibility of interacting with smartwatches using their embedded microphones111We only use smartwatches for sound capturing from some participants in our studies due to lack of smartwatches for all participants and current challenges with co-location when running the experiments.. Below we summarize the contributions of this paper:

  • We propose a novel approach for smartwatch interaction leveraging the bone-conducted sound sensing, which requires no additional hardware and can readily be deployed on existing devices.

  • We design a series of sound signal pre-processing techniques to suppress the background noise to improve the signal-to-noise ratio (SNR), and extract reliable features for further detection.

  • To balance the trade-offs between performance and power consumption, we develop a two-stage machine learning pipeline to minimize the impact on battery life.

  • We design an acoustic data augmentation scheme for generating “synthesized” training samples, which reduces false negatives significantly for noisy environments.

  • We build a prototype and show that our system can be used to perform both 2D and 3D interaction, and achieves an overall accuracy of 90.13% in quiet environments and 85.79% in noisy conditions.

Related Work

Wearable Device Interaction. WristFlex uses an array of force sensitive resistors (FSRs) worn around the wrist, and the interface can distinguish subtle finger pinch gestures Dementyev and Paradiso (2014). SkinWatch Ogata and Imai (2015) provides gesture input by sensing deformation of the skin under a wearable wrist device, such as a smartwatch or wrist band. It takes the two-dimensional deformation signal from a small and thin sensor attached to the user’s skin as input, and matches it to some pre-defined input gestures. eRing Wilhelm et al. (2015) detects multiple hand gestures with a single ring through electric field sensing. HandSense Nguyen et al. (2019) explores recognizing dynamic, micro finger movements using capacitive coupling for interacting with a head-mounted device. Electrodes are attached to fingertips of users’ gloves and the capacitive coupling among all pairs of electrodes is measured quickly to infer the real-time spatial relationship between fingers.

Acoustic Sensing on Smart Devices. Acoustic signals on smart devices have been explored for many applications in recent years, such as distance measurement Peng et al. (2007) and tracking Zhou et al. (2017).Acoustic sensing is also explored for active finger movement tracking Nandakumar et al. (2016); Wang et al. (2016). FingerIO Nandakumar et al. (2016) and LLAP Wang et al. (2016) actively emit high frequency sound signals from the built-in speakers on mobile devices, and leverage phase shift in received signals for finger movement tracking. However, such approaches usually only support tracking a single finger movement close to the device. They do not recognize hand gestures involving multiple fingers, which limits their applicability in interaction. Compared to such previous work, our approach leverages bone-conducted sound signals from finger movements for wearable device interaction.


Gesture Design

There are several considerations when designing gestures from finger movements. First, the gestures should generate sound signals that should be distinguishable so that the associated gestures can be reliably detected. Second, these gestures should be subtle enough so that they can be performed without drawing too much attention to the user. Third, the gestures should be intuitive and natural to perform.

Figure 2: Hand gesture set designed for interaction, which includes pitching, rubbing up, rubbing down, flicking and opening the palm, with respective sample sound signals.

In this work, we propose five gestures: pinching, flicking, rubbing up, rubbing down and opening up, as shown in Figure 2. They can be mapped to common interactions on a smartphone or smartwatch, such as click, go back to previous level, scroll up/down, and go back to home, respectively. In Figure 2, we also show a typical sample of the sound signals generated by each gesture’s associated finger movements. The complete set of these gestures provides the basic operations necessary to interact with smart devices, such as smartwatches or wristbands.

Gesture Detection Architecture

To detect the hand gestures in real-time, we propose a two-stage detection architecture, as shown in Figure 3, which can be divided into three modules: sound signal capturing, low power always-on event detection, and a triggered module for gesture detection.

Figure 3:

Our design consists of three major components: a signal capturing module, a low power always on module for signal filtering and event detection, and a triggered module for more computational heavy feature extraction and gesture model inference.

Sound Signal Capturing

There are usually one or more microphones on smartwatches for sound recording, speech recognition and making phone calls. For example, Apple watch has a microphone enabled as “always on” for its Siri function. Since the smartwatches are attached to the wrist, the sound can be captured via bone-conduction propagated from fingers to the wrist, which provides better signal-to-noise ratio than air-conducted sound. By accessing the microphone, we get a stream of real-time sound signals, which are fed to our pipeline for gesture detection.

Low Power Event Detector

To avoid continuous computation of heavy feature extraction and deep model inferences, we design a low power event detector to detect possible finger gestures, and trigger the gesture classification only when an event is detected. This step avoids unnecessary computations when there is no hand gesture happening. We apply a few lightweight steps to do the first-stage event detection, which consists of two steps: signal filtering and signal detection. This module keeps running in the background so that every intended finger gesture can be captured. We have intentionally designed this “always on” approach compared to, say “Siri” like voice command-based activation of gesture recognition, as we believe that interaction should be subtle, natural and applicable in any environment including circumstances where voice-based commands may not be appropriate (e.g., while watching a movie at a movie theater).

Signal Filtering. Before we further process the raw sound signal, we need to filter out potential background noise to improve the SNR. We perform such filtering by using a bandpass filter, which allows the sound frequency range of the gestures to pass and filters out the background noise. To find out the frequency range of hand gesture sounds, we collect a series of gesture sound signals in a relatively quiet environment to analyze the frequency distribution. The result is shown in Figure 4. We can see the frequencies mainly fall in the range of 10 - 500. Thus we design a Butterworth band-pass filter Hussin et al. (2016) with cutoff frequency range 10 - 500 to remove the higher frequency background noises so that weak gesture sounds from fingers will not be buried in the background noise.

Figure 4: Frequency analysis of sound signals of hand gestures.

Event Detection.

The signal detection component detects whether there is a possible hand gesture event. This can avoid most of the unnecessary computations when the environment is relatively quiet, or only contains higher frequency sound, which is filtered out by the bandpass filter. Note that the main goal of this first-stage detection is not to guarantee high accuracy, but to rule out obvious non-event periods. To accomplish this, we designed a simple heuristic: we create a buffer with a time window of 2 seconds with a moving step of 0.1 seconds. We detect the peaks above a certain threshold, if a peak is detected in the central portion of the segment (i.e, between 0.5 – 1 seconds), the signal is locked. We crop the center 1 second window signal and feed it into next stages, which triggers the sequential steps for gesture detection. In our implementation, we chose the threshold empirically based on our experiments to achieve the best balance of false alarms and missed gestures.

Gesture Detection

Once an event is captured, it triggers the gesture detection module, which consists of three steps: MFCC feature extraction, gesture detection model and non-maximum suppression.

MFCC Feature Extraction.

After we get the filtered signal, we apply spectrum analysis on the signal to turn it into frequency domain, which provides richer information than the time domain. Specifically, we extract the MFCC features 

Logan and others (2000) as shown in Figure 5

. This gives us a stream of 2D heat maps (i.e., spectrograms), describing the sound features in frequency domain. Then, we segment the signal with a sliding windows with a step of 0.05 seconds and a window length of 0.5 seconds, which is roughly the duration of a single gesture. The prepared data is then fed into our machine learning pipeline for prediction. For each 0.5 seconds segmented MFCC feature window, our model predicts the probability of each hand gesture.

Figure 5: We extract MFCC features from the filtered sound signal and apply a moving window to segment the MFCC features, which are fed into machine learning models for recognition.

Gesture Detection Model.

Recently, deep learning approaches (especially CNNs) have shown a great successes in a variety of challenging tasks, such as image classification, due to their powerful automatic feature extraction 

He et al. (2016); Simonyan and Zisserman (2015)

. We design a CNN-based neural network, which takes as input a MFCC of the segmented signals, and trains it on a large data set collected from our test users. Considering the targeted deployment on wearable devices with limited computation resources, we choose to design a light-weight CNN model for detection. The customized CNN architecture designed for gesture recognition is shown in Figure


. The input layer takes the 40x44x1 MFCC features as input, and output the probability of each gesture. Then, we add four consecutive CNN layers for feature extraction. We use rectified linear unit (ReLU) as the activation function for convolutional layers, and after each activation we add a max pooling layer and a dropout layer as regularization factors, to avoid over-fitting. Finally, we flatten the CNN feature maps followed by two linear layers. The output is a dense layer with a Softmax activation function, which predicts the probability of each gesture class. The CNN is trained on a data set that contains acoustic samples from 5 classes (5 gestures) and categorical cross-entropy is used as the loss function.

We carefully choose the number of hidden layers and hidden units of each layer of the network to fit the limited computational resources available in wearable devices. The output of the model provides a distribution of scores over each finger gesture for each frame. Due to the sliding window we used, there will be overlaps between frames, so we need to further process the results to generate the final output.

Figure 6: Gesture detection model architecture.

Non-maximum Suppression.

The softmax layer outputs the probability for each gesture as

and . We output the detected gesture if the probability is larger than a pre-defined threshold . Since we are using a moving window for detection, there could be multiple results predicted for one actual event.To generate one final recognition result for each actual event, we apply non-maximum suppression Neubeck and Van Gool (2006) algorithm to consolidate the results.

Data Collection and Augmentation

Training Data Collection. Training a neural network usually requires a large amount of labelled training data, which necessitates a lot of human effort. Additionally, in order for the model to generalize well, the training data should represent different environments. Labelling the subtle sound signals for our application presents a very challenging problem. We propose a method to collect training data in a quiet environment, thus, it can be segmented and labelled automatically. During labelled data generation step, the user repeats the same gesture for each round, and switches to another gesture in the next round. Then the signals are automatically segmented using simple peak detection under the constraint of a minimum interval of 0.5 second in between. Through this step, we can easily get labelled clean data (i.e., with minimum to no background noise). However, training a model on such a dataset may not work well in a more typical environment, which is usually noisy. To solve this problem, we propose an acoustic data augmentation technique to generate more synthesized training data with background noise.

Figure 7: Acoustic data augmentation. Multiple augmented samples are created by combining each clean sample with several randomly sampled background noise samples.

Acoustic Data Augmentation. Data augmentation is commonly used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data Shorten and Khoshgoftaar (2019). It is an effective way to prevent over-fitting when the amount of data is relatively small. In our case, we have a highly imbalanced data set, i.e., we can have unlimited background noise and our clean training samples are quite limited. To boost the training data set and increase the robustness against background noise, we design an acoustic data augmentation pipeline, shown in Figure 7. We download a large amount of audio tracks of daily environment conditions as background noise from online sources (2021) and use them as our background noise data set. Then, for each clean sample of a certain gesture, we augment it to generate multiple augmented samples by adding it with some randomly sampled clips from background noise data set. To better simulate the actual data in different environments, we also augment the background noise volume level by randomly adjusting the volume within a range. With this acoustic data augmentation, we can obtain a large training data set, which has more variety for robust model training and avoids over-fitting. Our evaluation results shows a significant performance increase with augmentation.


Experiment Setup. To validate the design, we develop a pipeline to capture audio data from an Android smartwatch or an external microphone attached to the wrist, and run the gesture detection pipeline on a personal computer (shown in Figure 8). For the convenience of evaluating the accuracy performance, the recognition pipeline is developed using Python and tested on a personal computer. We also develop two different applications (a web browser and a 3D object manipulation app) to demonstrate how to interact with different user experiences via hand gestures.

Figure 8: Experiment setup for sound signal capturing and gesture detection. We record signals from both a smartwatch and an external microphone, and run the detection pipeline on a personal computer for evaluation purposes.

Machine Learning Pipeline.

The machine learning pipeline requires offline training and online recognition. We record sound signals from a smartwatch using the built-in audio recorder and export the recordings for analysis. For the external microphone, we use pyaudio library for sound recording for training as well as streaming real-time sound signals for real-time gesture detection. We train the CNN model off-line on a computing server equipped with GPUs. Keras 

Chollet and others (2015)

with Tensorflow 

Abadi et al. (2016) backend is used for CNN construction and training. Adam optimizer Kingma and Ba (2014) with a learning rate of is used to speed up the training. For real-time recognition, we run the model inference on a regular personal computer as part of two different applications to test the accuracy of our gesture recognition.


Data Collection

To generate the training data, we invited 5 participants, of different ages and genders, to contribute audio clips using either an Android smartwatch or an external microphone attached to the wrist (shown in Figure 8). The differences in the sound capture equipment of various participants help create a sufficiently diverse and rich data set to create a robust model222Due to the pandemic, we had to limit the scale of our experiments.. In addition, the data captured from smartwatches demonstrate the feasibility of our system leveraging sound data from actual wearable devices, which can be used to interact with not only the wearable device itself but also perform 2D and 3D interaction in augmented and virtual reality applications.

During data collection, we instruct the participants to perform each gesture repeatedly. For convenient segmentation, the data is collected in relatively quiet environments. In total, the data set contains 2683 valid samples from 5 gesture classes after the automatic data segmentation and labelling. We randomly shuffle the data and divide it in three parts, for model training, for model validation and testing. After doing the division, we augment each sample with random background noise and ensure that each original sample belongs to just one group (train, test or validation). We test a different number of synthetic samples from each gesture sample, as explained in Data Augmentation Evaluation, and end up choosing a ratio of 1:10, which yields the best result. Thus we generate 26830 valid samples for 5 gestures in total.

Train Validation Test Total
Clean 1878 268 537 2683
Augmented 18780 2680 5370 26830
Table 1: Evaluation data set summary.

System Evaluation

We train two models based on the clean training set and the augmented training set, and compare the performance on both clean testing set and augmented testing set (which have not been used during training, neither as training samples nor to tune the hyper-parameters of the model). Note that the augmented set is more challenging as it contains background noise, and thus, is a better metric of performance in a real environment deployment.

Recognition Accuracy Evaluation

Overall Accuracy. We introduce precision, recall, F1-score and accuracy as metrics. Precision is the fraction of true positives among all samples classified as positive, defined as ; recall is the fraction of true positives among all positive samples, defined as

. A high precision means the actual gestures can be detected correctly, a high recall means less likely the actual gestures are missed. We also introduce F1-score, which is the harmonic mean of precision and recall defined as

. Accuracy is the proportion of correct predictions (both true positives and true negatives) among the total number of cases examined, defined as .

Train Test Precision Recall F1 Accuracy
Clean Clean 0.9155 0.8883 0.9017 0.9013
Clean Aug. 0.4783 0.4453 0.4612 0.4648
Aug. Clean 0.9293 0.8808 0.9044 0.9088
Aug. Aug. 0.8874 0.8320 0.8588 0.8579
Table 2: Results summary over test set.

Table 2 shows the results with different combinations, averaging over all classes. In the ideal case, when we train and evaluate the model on the clean data set, the model achieves very promising results of 91.55% precision and 88.83% recall, with an F1-score of over 90%. However, the performance drops significantly when testing on the augmented testing data with background noise. The F1-score drops to 46%, which is not usable in a real environment. This is mainly caused by the model being trained on samples with minimum background noise, thus not generalizing well on real samples in noisy environments. In contrast, the model trained with augmented training data shows comparable results with the model trained on clean data, when tested against clean test data, while the results of the F1 metric measured on the augmented test data remain above 85%, a huge improvement compared to the results without augmentation.

Per Gesture Accuracy. Different hand gestures involve different finger motions; some are more distinct to recognize and some are less distinct. Thus, we also evaluate the performance for each type of gesture. We show the receiver operating characteristic (ROC) curve and the area under the curve (AUC) for each gesture in our evaluation in Figure 9. A ROC curve is a plot that illustrates the diagnostic ability of a classifier system as its threshold is varied. An AUC indicates better classification performance. From Figure 9, we can see the pitching, flicking, and rubbing up gestures have relatively larger AUC, which means they are easier to recognize; while rubbing down and open palm gestures have relatively lower performance. The open palm gesture has the worst accuracy among all the gestures, which could be caused by the complex nature of the gesture. It involves two sub-sequential gestures: make a fist and then open the palm. Training a recurrent model could better take advantage of such sequential data for improvement.

Figure 9: ROC curves for each gesture.

Data Augmentation Evaluation

We evaluate how effective data augmentation can improve the performance by generating “synthesized” training samples when training data is limited. We evaluate the performance regarding to different augmentation ratio, i.e., the ratio between the synthesized samples and the clean samples. Varying this ratio gives different amounts of training samples. Figure 3 shows the results of the model trained with different augmentation ratios, ranging from 1 to 100, and tested against clean and augmented testing data set. As the ratio increases, we observe the performance increases as well. The model trained with the ratio of 10 performs best on the clean test data with a F1-score above 90%. Although the accuracy is slightly lower than model trained with a ratio of 50 evaluated on the augmented test data, the difference is negligible. However, augmenting the training data with a ratio of 50 significantly increased the data set size, which takes more resources for training. Choosing a even larger ratio of 100, the performance starts to drop as the model could be over-fitted to the background noise rather than the limited gesture sound samples. In our final design, we choose a ratio of 10 to make the trade offs among these factors.

Ratio Test Precision Recall F1 Accuracy
1 Clean 0.8167 0.7300 0.7709 0.7728
2 Clean 0.8435 0.7728 0.8066 0.8063
5 Clean 0.9291 0.8547 0.8903 0.8920
10 Clean 0.9293 0.8803 0.9044 0.9088
50 Clean 0.8953 0.8436 0.8687 0.8622
100 Clean 0.9213 0.8715 0.8957 0.8827
1 Aug. 0.8308 0.7553 0.7913 0.7935
2 Aug. 0.8467 0.7952 0.8201 0.8194
5 Aug. 0.8601 0.8233 0.8413 0.8389
10 Aug. 0.8874 0.8320 0.8588 0.8579
50 Aug. 0.8909 0.8456 0.8677 0.8654
100 Aug. 0.8795 0.8292 0.8536 0.851
Table 3: Results with different data augmentation ratios.

False Alarm Rate Evaluation

Choosing the threshold is critical to make the balance between true positives and false alarms. It is important to keep the false alarm rate as low as possible to avoid unintended input signals. A false alarm is any of the 5 gestures being detected while the user does not intend to make it. Figure 10

shows the probability distribution of the events triggered by actual finger gestures and background noise. As we can see, the probabilities triggered by actual gestures are usually much higher than those triggered by background noise. Choosing a threshold

allows us to capture the majority of actual gestures and ignore the background noise events. Thus, we set our primary threshold as 0.7. We choose 0.6 as our secondary threshold to increase the sensitivity in a more challenging environment, as proposed in our dynamic thresholding detection design.

Figure 10: Distributions of probabilities of events triggered by finger gestures and background noises.
Figure 11: Using finger gestures to manipulate 3D models.

Qualitative User Experience Study

We built a web browser navigation application and a 3D object manipulation application to test our gesture detection system. The mapping of gestures to controls are listed in Table 4. We conduct a survey with 5 users to collect their feedback, mainly on the usability of the system.

Both applications run in real-time on a personal computer, so that the users can interact with the applications via the hand gestures. Out of the 5 users, 4 reported being able to control the web browser easily and agreed that our gesture-based interaction is useful for such browsing tasks. All of them rated it equally easy to scroll the web pages as when using a mouse. As we are able to successfully capture data from wearable devices to classify hand gestures in real time, all participants agree this system would be very useful for 3D object manipulation in AR/VR settings to eliminate the need for additional controllers. No obvious latency is observed for each gesture input. Figure 11 shows the application in action to control 3D objects on the screen via gestures. By attaching an external microphone to their wrist, the users can easily manipulate a 3D object such as zooming in/out, performing rotation and returning to default view. Occasional mistaken gestures or false positives are observed in noisy environment, which needs improvement. Nevertheless, the participants agree our system provides a new way for interaction with a great potential for broader applications.

Web Browser 3D Object Viewer
Pitching Reserved Zoom In
Rubbing Up Scroll Up Rotate Right
Rubbing Down Scroll Down Rotate Left
Flicking Go Back Zoom Out
Open Palm Exit Browser Default View
Table 4: Mapping of gestures to controls.


Limitations. This is only a research prototype and far from a well engineered product. It has several limitations: i) Implementation on wearable devices. Although we validated the feasibility and performance on a personal computer, and also proposed solutions to meet the computation power and battery constraints on wearable devices, we do not have the implementation of the system on wearable devices yet, other than the sound capturing component for data generation. ii) False alarms. Although the acoustic data augmentation greatly improved the performance in noisy environments, there is still room for further improvement, especially in very noisy environments. iii) User gesture changes. The current CNN model is trained on limited data, far from exhaustive to be robust against various hand gesture variations. An online model updating mechanism is needed to address such changes dynamically.

Future Work. i) Collecting a larger data set. We will collect more training data from more users with a larger variety of devices. This will further improve the performance along with a more sophisticated neural network design. ii) Wearable device implementation. We will implement applications on wearable devices and perform more evaluations regarding the computation resources and power consumption. iii) Large scale experiment. Our current data collection and experiments are limited to only users. Large scale experiments are needed to improve the maturity of the solution.


In this paper, we propose a system that leverages bone-conducted sound signals for 2D and 3D interaction. We designed a complete pipeline for robust hand gesture detection even in noisy environments, and validated its usability through different applications. We showed that we are able to successfully capture data from wearable devices to classify hand gestures in real time, which can be used to interact with not only the wearable device itself, but also potentially perform 2D and 3D interaction in augmented and virtual reality applications without the need for controllers. Experiments show that our system achieves an overall accuracy of 90.13% in a quiet environment, and 85.79% under noisy conditions.


  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) TensorFlow: a system for large-scale machine learning.. In OSDI, Vol. 16, pp. 265–283. Cited by: Implementation.
  • F. Chollet et al. (2015) Keras. GitHub. Note: url Cited by: Implementation.
  • A. Dementyev and J. A. Paradiso (2014) WristFlex: low-power gesture input with wrist-worn pressure sensors. In Proceedings of the 27th annual ACM symposium on User interface software and technology, pp. 161–166. Cited by: Introduction, Related Work.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: Gesture Detection.
  • S. F. Hussin, G. Birasamy, and Z. Hamid (2016) Design of butterworth band-pass filter. Politeknik & Kolej Komuniti Journal of Engineering and Technology 1 (1). Cited by: Low Power Event Detector.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Implementation.
  • B. Logan et al. (2000) Mel frequency cepstral coefficients for music modeling.. In ISMIR, Vol. 270, pp. 1–11. Cited by: Gesture Detection.
  • [8] (5/28/2021) Microsoft hololens — mixed reality technology for business. Note: Cited by: Introduction.
  • F. Mueller, F. Bernard, O. Sotnychenko, D. Mehta, S. Sridhar, D. Casas, and C. Theobalt (2018) Ganerated hands for real-time 3d hand tracking from monocular rgb. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 49–59. Cited by: Introduction.
  • R. Nandakumar, V. Iyer, D. Tan, and S. Gollakota (2016) FingerIO: using active sonar for fine-grained finger tracking. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 1515–1525. Cited by: Related Work.
  • A. Neubeck and L. Van Gool (2006) Efficient non-maximum suppression. In 18th International Conference on Pattern Recognition (ICPR’06), Vol. 3, pp. 850–855. Cited by: Gesture Detection.
  • V. Nguyen, S. Rupavatharam, L. Liu, R. Howard, and M. Gruteser (2019) HandSense: capacitive coupling-based dynamic, micro finger gesture recognition. In Proceedings of the 17th Conference on Embedded Networked Sensor Systems, pp. 285–297. Cited by: Related Work.
  • M. Ogata and M. Imai (2015) SkinWatch: skin gesture interaction for smart watch. In Proceedings of the 6th Augmented Human International Conference, pp. 21–24. Cited by: Related Work.
  • C. Peng, G. Shen, Y. Zhang, Y. Li, and K. Tan (2007) BeepBeep: a high accuracy acoustic ranging system using cots mobile devices. In ACM SenSys, Cited by: Related Work.
  • T. S. Saponas, D. S. Tan, D. Morris, J. Turner, and J. A. Landay (2010) Making muscle-computer interfaces more practical. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 851–854. Cited by: Introduction.
  • S. Shen, H. Wang, and R. Roy Choudhury (2016) I am a smartwatch and i can track my user’s arm. In Proceedings of the 14th annual international conference on Mobile systems, applications, and services, pp. 85–96. Cited by: Introduction.
  • C. Shorten and T. M. Khoshgoftaar (2019) A survey on image data augmentation for deep learning. Journal of Big Data 6 (1), pp. 1–48. Cited by: Data Collection and Augmentation.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: Gesture Detection.
  • [19] (5/28/2021) Siri does more than ever. even before you ask.. Note: Cited by: Introduction.
  • S. Sridhar, A. Oulasvirta, and C. Theobalt (2013) Interactive markerless articulated hand motion tracking using rgb and depth data. In Proceedings of the IEEE international conference on computer vision, pp. 2456–2463. Cited by: Introduction.
  • W. Wang, A. X. Liu, and K. Sun (2016) Device-free gesture tracking using acoustic signals. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking, pp. 82–94. Cited by: Related Work.
  • F. Weichert, D. Bachmann, B. Rudak, and D. Fisseler (2013) Analysis of the accuracy and robustness of the leap motion controller. Sensors 13 (5), pp. 6380–6393. Cited by: Introduction.
  • H. Wen, J. Ramos Rojas, and A. K. Dey (2016) Serendipity: finger gesture recognition using an off-the-shelf smartwatch. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 3847–3851. Cited by: Introduction.
  • M. Wilhelm, D. Krakowczyk, F. Trollmann, and S. Albayrak (2015) ERing: multiple finger gesture recognition with one ring using an electric field. In Proceedings of the 2nd international Workshop on Sensor-based Activity Recognition and Interaction, pp. 1–6. Cited by: Introduction, Related Work.
  • (2021) Royalty free sound effects. Note: https://www.zapsplat.comAccessed: 2021-05-20 Cited by: Data Collection and Augmentation.
  • B. Zhou, M. Elbadry, R. Gao, and F. Ye (2017) BatMapper: acoustic sensing based indoor floor plan construction using smartphones. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pp. 42–55. Cited by: Related Work.