Log In Sign Up

Efficient Convolutional Neural Network for FMCW Radar Based Hand Gesture Recognition

by   Xiaodong Cai, et al.

FMCW radar could detect object's range, speed and Angleof-Arrival, advantages are robust to bad weather, good range resolution, and good speed resolution. In this paper, we consider the FMCW radar as a novel interacting interface on laptop. We merge sequences of object's range, speed, azimuth information into single input, then feed to a convolution neural network to learn spatial and temporal patterns. Our model achieved 96 test.


page 1

page 3


Long-Range Gesture Recognition Using Millimeter Wave Radar

Millimeter wave (mmWave) based gesture recognition technology provides a...

A recurrent CNN for online object detection on raw radar frames

Automotive radar sensors provide valuable information for advanced drivi...

300 GHz Radar Object Recognition based on Deep Neural Networks and Transfer Learning

For high resolution scene mapping and object recognition, optical techno...

Channel Boosting Feature Ensemble for Radar-based Object Detection

Autonomous vehicles are conceived to provide safe and secure services by...

High Resolution Radar Sensing with Compressive Illumination

We present a compressive radar design that combines multitone linear fre...

Weather Radar in Nepal: Opportunities and Challenges in Mountainous Region

Extreme rainfall is one of the major causes of natural hazards (for exam...

1. Introduction

For camera based gesture recognition, there are many commercial solutions, such as Kinect, leap motion, RealSense. These solutions suffer from privacy issues, while FMCW radar has no such limits. As FMCW radar can only estimate object’s range, speed and angle information, so it capture human actions, but the information is not enough to identify the user. Detailed comparison among different sensors are shown in Figure

1. It is also friendly for industrial design that radar doesn’t need hole-punch, while microphone and camera does.

Figure 1. Sensors comparison

1.1. FMCW Radar Basics

FMCW radar is short for Frequency Modulated Continuous Wave Radar. Radar transmits a continuous carrier modulated by a periodic function such as a sinusoid wave to provide range data. At each period (also called a chirp), radar transmitter emits a sinusoidal wave, with frequency modulated from to , Bandwidth , which is proportional to radar spatial resolution. We set modulation slope . In this paper, we use a 2Tx, 4Rx FMCW Radar, sweeping frequency 57-64GHz,

Consider an object, e.g. palm of hand, with an initial range at , and radial velocity , where , so the signal travel time . FMCW radar transmitter emits sinusoidal signal, , the received signal , mixed signal , where the beat frequency in is the range of the object; and the speed of the object can be estimated by frequency shift , . Angle-of-Arrival is calculated via phase difference between 2 receivers. Same range bin and speed bin in Range-Doppler Map (RDM), Receiver 1 and Receiver 2 , Angle-Of-Arrival , where , , illustrated in Figure 3. How FMCW radar estimate hand movement is illustrated in Figure 2, and readers can refer to (Patole et al., 2017) for technical details.

All in all, for a Multiple Input Multiple Output (MIMO) radar system, objects’ range, speed, angle can be estimated via 3D FFT, where l is the antenna index, is sampling index, is the chirp index, is the number of object, is the amplitude factor, is modulation slope, is range of th object, is the frequency shift of th object, is radar’s center frequency, is the distance between antenna, is the azimuth of th object, is the speed of light.

Figure 2. Radar basics
Figure 3. Angle of Arrival estimation

1.2. Gesture Definitions

we defined four gestures: left wave, right wave, click, and wrist in Table 1. We also plot theoretical analysis of range, speed, azimuth trajectories on the predefined gestures set, shown in Figure 4. From Figure 4, those gesture trajectories are quite distinctive. We build a template-matching algorithm for gesture recognition, the accuracy is  70-80%, and it is hard to extract trajectories from noisy radar signal.

Gesture Hand Movement Meaning of gesture
LEFT Move from right to left Browse previous item
RIGHT Move from left to right Browse next item
CLICK Finger pointing to radar Select an item
WRIST Hand making fist Return to main menu
Table 1. Gesture Definitions
Figure 4. Theoretical analysis of hand movement

Theoretical analysis of hand movement

2. Related Work

Previous work (Hazra and Santra, 2018),(Wang et al., 2016), (Zhang et al., 2018) are based on CNN+LSTM. (Wang et al., 2016) first introduced CNN+LSTM architecture to process radar based gesture recognition. CNN learns spatial patterns inside RDM, then feed into LSTM to learn temporal patterns among RDMs. CNN+LSTM achieved 87% accuracy on 11 gestures. (Hazra and Santra, 2018) replace CNN with AllConvNet to reduce parameters and inference time. (Zhang et al., 2018) introduced 3D CNN to learn spatial and temporal patterns at the same time.

Solutions above (Hazra and Santra, 2018),(Wang et al., 2016), (Zhang et al., 2018) use real value Range-Doppler map as input, they ignore Angle-of-Arrival information. In order to take Angle-Of-Arrival into consideration, we need extra signal processing procedure.

3. Proposed System

To explicitly extract Range, Speed and Azimuth trajectory, we merge 128 RDM frames, each frame size is , into a 3-channel input frame, representing range-time, speed-time, azimuth-time respectively, RSA for short The merged RSA input shape is , then feed it into CNN. Gesture recognition pipeline is shown in Table 2 and Figure 5.

Step input shape procedure output shape
1 2D FFT, to convert raw signal into RDM
2 Do Constant False Alarm Rate (CFAR) (Blake, 1988) on RDM to detect hand and body
3 Crop RDM to keep body and hand, generate subset of RDM
4 For each range bin, calculate maximum speed, average azimuth, generate a frame
5 merge 128 frames of above output into one frame
Table 2. Radar signal pre-Processing
Figure 5. Radar data processing pipeline

radar data processing pipeline

3.1. Neural Network Architecture Desgin

Firstly, we design a VGG-like neural network, called VGG-10 in Figure 6

(a), follows {Conv3x3, Conv3x3, MaxPooling} building block, and 2 fully Connected layers. ADAM optimizer, early stopping and reduced learning rate is applied. VGG-10 converged at 10th epoch with validation accuracy 92%.

To improve the performance, we add residual block between convolution layers, batch normalization is also added between each residual block to make back-propagation more robust

(He et al., 2016), and we called it ResNet-20 in Figure 6(b). ResNet-20 outperforms VGG-10 achieved 98% validation accuracy.

We also build CNN+LSTM model for comparison, in Figure 6(c). CNN+LSTM needs RDM sequence as input, first we resize original RDM () to , and feed 64 resized RDM frames into CNN. CNN module follows {Conv5x5, Conv5x5, MaxPooling} building block, and 1 fully Connected layer to encode feature, then put feature encoding into LSTM.

Figure 6. Network architecture, (a) VGG-10, (b) ResNet-20, (c) CNN+LSTM

Network Architecture

4. Dataset and Experiment results

We collected 50 subjects’ gesture data, each subject did 4 gestures 10 times with left hand and right hand, total 3652 valid records, validation-train split ratio is 0.3.

We also do data-augmentation to enrich the dataset. First, we draw a block containing gesture movements, then crop the area randomly to generate training data, shown in Figure 7. At last, we obtained more than 400k training data.

We compare VGG-10, ResNet-20 and CNN+LSTM on a same dataset, and calculate average accuracy. CNN+LSTM has the lowest accuracy on LEFT/RIGHT, due to lack of Angle-Of-Arrival information; Deep CNN outperforms shallow CNN, which is aligned with experiment results in (Wang et al., 2016).

Figure 7. Data augmentation

data augmentation

Network Architecture Avg. Acc. LEFT RIGHT CLICK WRIST
VGG-10 91.0% 94.9% 80.7% 95.5% 97.0%
ResNet-20 98.7% 99.1% 99.0% 97.9% 98.9%
CNN+LSTM 78.0% 69.0% 49.5% 84.6% 90.1%
Table 3. Accuracy comparison among models
Figure 8. Validation loss of VGG-10 and ResNet-20

validation loss of VGG-10 and ResNet-20

Figure 9. Confusion matrix on test set

Confusion Matrix on test set.

In Figure 9, our model achieves 98% average accuracy on test set.

4.1. Error Analysis

  • LEFT is misclassified as RIGHT, and RIGHT is misclassified as LEFT. The gesture movement consists of three temporally overlapping phases: preparation, nucleus and retraction. The retraction phase of LEFT is a RIGHT, and the retraction phase of RIGHT is a LEFT. A better gesture segmentation preprocessing may help to reduce this kind of error.

  • CLICK is misclassified as LEFT and WRIST is misclassified as RIGHT. Recall Figure 4, the biggest difference between LEFT/RIGHT and CLICK/WRIST is the angle difference, when CLICK/WRIST gesture has large angle changes, they are easily misclassified as LEFT/RIGHT. More accurate angle estimation may help reduce this error.

5. Conclusion

FMCW radar is a low cost/high spatial/speed resolution sensors, it can detect anonymous object movements, suitable for privacy-concerned interaction application. We designed a 20-layer residual network to recognize gestures, and the model could achieve 96% accuracy on real-time test. In the future, we plan to support user defined gestures.


  • S. Blake (1988) OS-cfar theory for multiple targets and nonuniform clutter. IEEE transactions on aerospace and electronic systems 24 (6), pp. 785–790. Cited by: Table 2.
  • S. Hazra and A. Santra (2018) Robust gesture recognition using millimetric-wave radar system. IEEE sensors letters 2 (4), pp. 1–4. Cited by: §2, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §3.1.
  • S. M. Patole, M. Torlak, D. Wang, and M. Ali (2017) Automotive radars: a review of signal processing techniques. IEEE Signal Processing Magazine 34 (2), pp. 22–35. Cited by: §1.1.
  • S. Wang, J. Song, J. Lien, I. Poupyrev, and O. Hilliges (2016) Interacting with soli: exploring fine-grained dynamic gesture recognition in the radio-frequency spectrum. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology, pp. 851–860. Cited by: §2, §2, §4.
  • Z. Zhang, Z. Tian, and M. Zhou (2018) Latern: dynamic continuous hand gesture recognition using fmcw radar sensor. IEEE Sensors Journal 18 (8), pp. 3278–3289. Cited by: §2, §2.