Attention-based Walking Gait and Direction Recognition in Wi-Fi Networks

11/17/2018 ∙ by Yang Xu, et al. ∙ 0

The study of human gait recognition has been becoming an active research field. In this paper, we propose to adopt the attention-based Recurrent Neural Network (RNN) encoder-decoder framework to implement a cycle-independent human gait and walking direction recognition system in Wi-Fi networks. For capturing more human walking dynamics, two receivers together with one transmitter are deployed in different spatial layouts. In the proposed system, the Channel State Information (CSI) measurements from different receivers are first gathered together and refined to form an integrated walking profile. Then, the RNN encoder reads and encodes the walking profile into primary feature vectors. Given a specific recognition task, the decoder computes a corresponding attention vector which is a weighted sum of the primary features assigned with different attentions, and is finally used to predict the target. The attention scheme motivates our system to learn to adaptively align with different critical clips of CSI data sequence for human walking gait and direction recognitions. We implement our system on commodity Wi-Fi devices in indoor environment, and the experimental results demonstrate that the proposed model can achieve promising average F1 scores of 95.35 group of 11 subjects and 96.92 directions.



There are no comments yet.


page 1

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Gait can be regarded as an unique feature of a person, and it is usually determined by an individual’s physical characters, e.g., height, weight, limb length, and walking habit, e.g., walking speed, posture combined with characteristic motions. The study of human gait recognition has been becoming an active research field, especially in the era of Internet of Things (IoT). It is appealing that an intelligent house can automatically recognize its owner’s gait and offer customized services such as turning on the light and running a bath in advance when the owner comes back outside and walks towards the door, or a special nursing ward can sense an unknown subject in terms of its gait and alert the medical staffs or family members in time. Compared with the human identification systems based on other biometrics, like fingerprints, foot pressure, face, iris and voice, which need to be captured by physical contact or at a close distance from the devices, gait-based systems have the potential of unobtrusive and passive sensing [1].

Coincidentally, the emerging Wi-Fi-based human sensing techniques have shown us their potentials of Device-free Passive (DfP) sensing, and have inspired researchers to design and propose many DfP human sensing applications, such as fall detection [2], human daily activity classification [3], keystroke [4] and sign language recognition [5, 6]. The theoretical underpinning of the Wi-Fi-based DfP sensing systems is the Doppler shift and multipath effect of radio signals. In Wi-Fi networks, different human activities can induce different Doppler shifts and multipath distortions in Wi-Fi signals, which are depicted and quantized by Channel State Information (CSI). For walking movement, the torso and limbs of the walker always move at different speeds, which modulates Wi-Fi signals to the propagation paths with different lengths and introduces different frequency components into the CSI measurements. By extracting the very fine-grained and idiosyncratic features from the CSI measurement as human gait representation, some Wi-Fi-based gait recognition systems [7, 8] are proposed. Different from the traditional gait recognition systems, which usually rely on video cameras [9, 10], floor force sensors [11], or wearable devices [12, 13], etc. to capture human walking dynamics, Wi-Fi-based systems are unconstrained by light and Line-of-Sight (LoS) conditions, and there is no need to deploy dense specialized sensors or require people to carry or wear some devices. Besides, the concern about leaking private data, e.g., image data, is naturally eliminated in the Wi-Fi-based gait recognition systems.

The two critical processes of the existing Wi-Fi based gait recognition systems are gait cycle detection and gait feature extraction

[14]. However, CSI measurements obtained from commercial Wi-Fi devices contain much noise, which makes it difficult to detect gait cycles. Some sophisticated signal processing techniques, like spectrogram enhancement and autocorrelation analysis, are employed to denoise and emphasize the cycle patterns [7]

. After getting the data of each cycle, the previous work proposes to generate some experientially hand-crafted features from time-domain and frequency-domain of the cycle-wise data for gait recognition. The previous methods are basically driven by traditional techniques of signal processing and machine learning, which have limitations in extracting high-quality representation of the data.

In this paper, we propose to adopt a much advanced deep learning framework, namely attention-based Recurrent Neural Network (RNN) encoder-decoder, to implement a cycle-independent human gait and walking direction recognition system with Wi-Fi signals. The attention-based RNN encoder-decoder neural networks are initially proposed for machine translation

[15]. Owing to the attention scheme, the trained networks can adaptively focus the attentions on important words in the source sentence when generating the target word. This distinguishing characteristic motivates us to create our cycle-independent gait and direction recognition system given the arbitrarily segmented CSI data. Identifying human gait combined with walking direction can enable much more practical and interesting applications while the existing Wi-Fi based gait recognition methods can not cope with these two tasks simultaneously. With the attention-based RNN encoder-decoder architecture, the proposed model can jointly train the networks for the two objectives. For capturing more human walking dynamics, two receivers and one transmitter are deployed in different spatial layouts. In the proposed system, the CSI measurements from the two receivers are first gathered together and refined to form an integrated walking profile. Then, the RNN encoder reads and encodes the walking profile into a hidden state sequence, which can be regarded as a primary feature representation of the profile. Subsequently, given a specific recognition task (gait or direction), the decoder computes a corresponding attention vector which is a weighted sum of the hidden states assigned with different attentions, and is finally used to predict the recognition target. The attention scheme gives the proposed method the ability to align with some critical clips of the input data sequence for the two different tasks. The main contributions of this work are summarized as follows:

  • By adopting the attention-based RNN encoder-decoder framework, we propose a cycle-independent human gait and walking direction recognition system while the existing Wi-Fi based gait recognition approaches are not reported to cope with these two tasks simultaneously. Given a specific recognition task, the proposed system can adaptively align with the clips, which are critical for that task, of the CSI data sequence. To the best of our knowledge, we are among the first to introduce the attention-based RNN encoder-decoder framework to the Wi-Fi based human gait recognition application scenario.

  • In order to capture more human walking dynamics, we deploy two receivers together with one transmitter in different spatial layouts and splice the spectrograms of CSI measurements received by different receivers to construct integrated walking profiles for recognition purpose. A profile reversion method is proposed to augment the training data and our system is trained to cope with multi-direction gait recognition task, thus individuals aren’t required to walk along a predefined path in a predefined direction as some existing Wi-Fi-based gait recognition systems do.

  • We implement our system on commodity Wi-Fi devices and evaluate its performance by conducting the walking experiment, where 11 subjects are required to walk on 12 different paths in 8 different directions. The experimental results show that our system can achieve average scores of 89.69% for gait recognition from a group of 8 randomly selected subjects and 95.06% for direction recognition from 8 directions, and the average accuracies of these two recognition tasks both exceed 97%.

Ii Related Work

In the 1970’s, Johansson et al. [16] and Cutting et al. [17] had conducted similar research and found that viewers could determine the gender of a walker or even recognize the walker who they were familiar by just watching the video pictures of prominent joints of the walker. Based on these study foundations, the early human walking activity and gait recognition applications were mainly based on video or image sequences [18, 19, 20, 9, 1, 10]. Polana et al. [18] proposed to recognize human walking activity by analyzing the periodicity of different activities in optical flow. Besides periodicity, Little et al. [19]

also extracted moment features of foreground silhouettes and optical flow for walking identification. Moreover, Lee et al.

[9] divided the silhouettes into 7 regions that roughly correspond to different body parts and computed statistics on these regions to construct gait representation and realize human identification and gender classification. And some other useful methods like Gait Energy Image (GEI) [21] and Gait Flow Image (GFI) [10] were proposed to further enhance the equality of gait representation created from silhouettes and improve the rate of gait recognition. However, the video-based methods always require subjects to walk in the direction perpendicular to the optical axis of cameras so as to get more gait information [1], and also introduce many other non-negligible drawbacks, such as LoS condition, light condition and personal privacy. Besides, some other data sensors were employed for gait recognition. Orr et al. [11] adopted floor force sensors to record individuals’ footstep force profiles, based on which the footstep models were built and achieved an overall gait recognition rate of 93% for identifying a single subject. Sprager et al. [12] and Primo et al. [13] proposed to use the built-in accelerometers of mobile phones to collect walking dynamics and extract gait features for human recognition. These sensor-based approaches either rely on specially deployed sensors (e.g., floor sensors) or require people to carry or wear wearable devices, which limits their applicability.

Recently, the emerging Wi-Fi-based (mainly, CSI-based) sensing techniques are widely applied to produce many DfP applications, such as human activity recognition [2, 3, 4, 5, 6], indoor localization [22], gait recognition [8, 7]. Since we mainly concern the problem of using Wi-Fi CSI to implement gait recognition, here we introduce some previous work on CSI-based gait recognition. By utilizing the time-domain and frequency-domain gait features extracted from CSI, Zeng et al. [8], Zhang et al. [23] and Wang et al. [7] separately proposed three gait recognition systems called WiWho, WiFi-ID and WifiU. WiWho mainly focused on the low frequency band 0.32 Hz of CSI, which contains much interference induced by slight body movements and environment changes [24]. This hinders WiWho from working when the subject is as far as more than 1 meter from the LoS path of its sender and receiver. While WiFi-ID and WifiU concentrated on the frequency components of 20

80 Hz in CSI measurements. In WiFi-ID, Continuous Wavelet Transformation (CWT) and RelieF feature selection algorithm were applied to extract gait features in different frequency bands, and the Sparse Approximation based Classification (SAC)


was chosen as the classifier. In WifiU, Principal Component Analysis (PCA) and spectrogram enhancement techniques were utilized to generate a synthetic CSI spectrogram from which a set of 170 features were derived, and the SVM classifier was finally employed for human identification. Based on WiWho, Chen et al.

[14] introduced an acoustic sensor (i.e., a condenser microphone) as a complementary sensing module to implement a multimodal human identification system, named Rapid, which could guarantee a more robust classification result in comparison to WiWho. Most of these systems could achieve an average human identification accuracy of around 92% to 80% from a group of 2 to 6 subjects. And one important function of the microphone used in Rapid is to detect the start and end points of each walking step, i.e., gait cycle detection, which is an indispensable part of the vast majority of video-, sensor- and CSI-based gait recognition systems.

However, segmenting gait cycles from CSI measurements is difficult since the variation patterns induced by walking are sometimes buried in the noise [7], and some sophisticated signal processing techniques are needed to carefully refine the data. After gait cycle detection, most existing methods (like Rapid, WifiU, etc.) try to generate some experientially hand-crafted features from the cycle-wise gait data and then train the gait classifiers. Different from all the aforementioned work, our system introduces the attention-based RNN encoder-decoder architecture to (i) adaptively align with some important time slices of CSI data sequence, which means there is no cycle partitioning needed, (ii) automatically learn to extract effective feature representations for gait recognition. Another major difference between our system and the above work is that our system is trained to cope with multi-direction gait recognition (12 walking paths and 8 walking directions relative to the system equipments), where individuals don’t have to walk along a predefined path in a predefined direction as WiWho or WifiU does. In what follows, we will specify the design of the proposed system.

Iii Background & Motivation

To understand how human walking activity exerts impacts on Wi-Fi signals, we first give a brief overview of CSI and the multipath effect of wireless signals. And then, we explain the network architecture of the attention-based RNN encoder-decoder, which powers our system to run the core functions.

Iii-a Channel State Information

As for wireless communication, Channel Frequency Response (CFR) characterizes the small-scale multipath effect and the frequency-selective fading of the wireless channels. Let and separately denote the frequency-domain transmitted and received signal vectors, and the relation between and can be modeled as , where is the carrier frequency, indicates that the channel is time-varying, is the complex valued CFR and represents the noise. Furthermore, in Wi-Fi networks, Channel State Information (CSI) is used to monitor the channel properties, and it is the simple version of CFR with discrete subcarrier frequency (), where denotes the total number of subcarriers across an Wi-Fi channel. With the CSI tool [26], an CSI measurement, which contains 30 matrices with dimensions , can be got from one physical frame, and represent the number of antennas of the transmitter (Tx) and the receiver (Rx), respectively. We regard the CSI measurements collected from each Tx-Rx antenna pair as a CSI stream, thus there are time-series CSI streams.

Fig. 1: Multipath effect of Wi-Fi signals in indoor environment.

Iii-B Multipath Effect of Wi-Fi Signals

In Fig. 1, one transmitted signal can directly travel through the Line-of-Sight (LoS) path, or be reflected off the wall and the walking subject and propagate through multiple paths before arriving at the receiver, this phenomenon is called multipath effect [27]. The multiple paths can be divided into two categories: static paths (solid lines in Fig. 1) and dynamic paths (caused by walking subject as expressed by dotted lines in Fig. 1). If there totally exist propagation paths among which paths are dynamically varying, then of the subcarrier at time can roughly be expressed as


where is the sum of responses of the static paths and can be regarded as a constant [3], since signals traveling through the static paths have relatively invariable path length and propagation attenuation, represents the attenuation of the dynamical path, and separately denote the propagation delay and the phase shift when the path length is , is the wavelength of subcarrier . In terms of Fig. 1, at time , the length of path can be expressed as , where represent change speed of path . Therefore, . Usually, a displacement of the subject can roughly cause a length change of the dynamical path (round trip), which introduces phase change. According to the principle of superposition of waves, the phase change finally induces 2 cycles of the amplitude change of CSI values [3], which reveals an approximate relation between human moving speed () and frequency of amplitude variation of CSI values, i.e., .

In addition, according to the Friis free space propagation model, the transmitting power () and the receiving power () of Wi-Fi signals have the following relation [27]:


and separately are the gains of Rx and Tx antennas, is the length of LoS path. Besides, the receiving power () of the subcarrier is proven to be basically proportional to the CSI power () of subcarrier [28, 29], i.e., . Thus, combined with equation (2), we can get the relation ,

Based on the above explanation, we can get some useful information:

  1. Since every individual has unique gait while walking, which means different people can induce Wi-Fi signals propagating through different paths and result in quite different changing patterns of the paths. Fortunately, these different patterns can be imprinted on CSI measurements and be reflected by the changing patterns of CSI amplitudes and we can probably realize gait recognition task by digging into the CSI measurements.

  2. As shown in Fig. 1, assuming that Tx and Rx are fixed, i.e.

    , the distance of the LoS path is constant. If there is no moving object in the environment, the power of the received CSI measurements will be relatively steady. However, when a person walks towards the Tx and the Rx, the distances of static paths are still constant while the distances of dynamic paths induced by the person is getting shorter, and the receiving power of the dynamic paths is getting higher, thus the variance of CSI power or CSI amplitude becomes larger and larger, and vice versa. Fig.

    2 displays the variation of CSI amplitude when a person walks in different directions relative to the Tx and the Rx. Therefore, we can roughly regard that , and [14] also drew the similar conclusion. Based on this, we can deduce the walking direction of a person by analyzing the variation trend of CSI power or CSI amplitude.

By now, we have explained the basis and feasibility of walking gait and direction recognition using Wi-Fi signals. Next, we will introduce the core model employed in the proposed method, namely the attention-based RNN encoder-decoder.

(a) The variation of CSI amplitude when a person walks away from the Tx and the Rx.
(b) The variation of CSI amplitude when a person walks towards the Tx and the Rx.
Fig. 2: The variation of CSI amplitude when a person walks in different directions relative to the Tx and the Rx (in Fig. 1).

Iii-C Attention-based RNN encoder-decoder

Fig. 3: The standard architecture of RNN encoder-decoder.

Iii-C1 Standard RNN encoder-decoder

The time-series CSI measurements are a kind of sequence data whose lengths can be arbitrary, given a sequence of CSI inputs (), the target of our system is to output another predicted sequence (), such as walking direction and walker identity. Therefore, this process can be formulated as a sequence to sequence learning problem [30], and the RNN encoder-decoder is naturally adopted in this work. Fig. 3 illustrates the architecture of a standard RNN encoder-decoder [31], which learns to encode a input sequence into a fixed-length summary vector and to decode the vector into an output sequence. At each time step , the hidden state of the RNN encoder is updated by


and the summary vector is generated by



is a non-linear activation function, and it can be a Long Short-Term Memory (LSTM) unit


or a Gated Recurrent Unit (GRU)

[31], which can automatically extract high-quality features of the input data, and can be a simple function of picking the last hidden state, i.e., .

The decoder is another RNN, and it is trained to learn the following conditional distribution:




here, is the predicted result, is the hidden state of the prediction step, also is a non-linear function and usually is a softmax function.

With the help of RNN encoder-decoder model, the proposed system can jointly train to maximize the probability of a series of recognition tasks (e.g., direction estimation and human identification) given an CSI sequence


However, a major concern is that the standard RNN encoder-decoder model tries to compress all input data into a single fixed-length vector, which is used to predict all the output labels. Since different outputs probably have different connections to the inputs, for example, prediction of walking direction may need to sense the variation trend of the entire CSI power sequence while human identification may focus more on some critical clips of the CSI sequence. To address this concern, the attention-based RNN encoder-decoder is adopted in our method.

Fig. 4: The architecture of attention-based RNN encoder-decoder.

Iii-C2 RNN encoder-decoder with attention scheme

As Fig. 4 shows, the key difference between attention-based and standard RNN encoder-decoder is that the attention-based model adaptively encodes the input sequence into different summary vectors, which we call attention vectors, for different predictions. The attention-based RNN encoder-decoder model is firstly proposed for machine translation [15], and it can learn to align and translate simultaneously. In this model, the conditional probability of equation (5) is rewritten as


and . The derivation of distinct attention corresponding to the target is expressed as the weighted sum of all the hidden states of RNN encoder ():


where is the attention weight, and it is computed by


the function scores how well the input data at each time step and the current output match [15], which enables the model to adaptively align with some important parts of the inputs when predicting a certain target. It’s is noticed that the initial hidden state of the RNN decoder is set as in the specific implementation of [15].

Iv System Design

The proposed system consists of four essential modules, which are CSI Collection, Raw CSI Processing, Walking Profile Generation and Gait and Direction Recognition. In what follows, the detailed processing procedures of each module will be explained.

Iv-a CSI Collection

Since the 2.4 GHz Wi-Fi band is narrower and more crowded than the 5 GHz band, the latter is a much better choice for less inter-channel interference and more reliable communication. Therefore, our system is set to run on the 5 GHz Wi-Fi band. A laptop equipped with an Intel 5300 wireless card and 2 omni-directional antennas serves as the transmitter. In order to capture more walking dynamics, two laptops equipped with Intel 5300 wireless cards (each with 3 omni-directional antennas) are deployed as the receivers. For concentrating on different body parts of a walker, the receivers are placed at different heights, where one is at a height of 0.5 m and the other is 1.0 m above the ground level, and the transmitter is placed at a height of 0.75 m. The transmitter continuously sends 802.11n data packets to the receivers, to which the CSI measurements of correctly received packets are reported with the tool released in [26]. Fig. 5 illustrates the amplitude variance of CSI received by the 2 receivers when a subject performs two different movements (in-place walking w/o swing arms and swing arms w/o walking), and we can find that the lower receiver is more sensitive to leg movements while the higher is more sensitive to arm movements. Considering human activities in traditional indoor environment introduce frequencies of no more than 300 Hz in CSI measurements [3], in terms of the Nyquist sampling theorem, our system is configured with a sampling rate () of 1000 Hz. For each receiver, the received data of each CSI stream forms a matrix with dimensions , where and separately denote the number of streams and subcarriers in each stream. is the data length. To eliminate the impact of Carrier Frequency Offset (CFO), we only reserve the CSI amplitude and ignore the CSI phase in our system as [3] suggested.

Iv-B Raw CSI Processing

Iv-B1 Long Delay Removal

Channel Impulse Response (CIR), which is the inverse Fourier transformation of CFR, can characterize the propagation delays of the received signals. The signals with long propagation delay probably are reflected by some static or dynamic objects which are far away from the transceivers, and these signals are useless and can distort the CSI amplitudes. Theoretically, every signal with a certain propagation delay can be separated from CIR, but limited by the bandwidth of Wi-Fi channel (

i.e., 20 MHz), the time resolution of CIR is approximately 120MHz 50 ns [33]. Therefore, we can only distinguish a series of signal clusters with discrete time delays. Besides, previous study shows the maximum delay in general indoor environment is less than 500 ns [34]. Thus, we transform each CSI measurement into time-domain CIR by Inverse Fast Fourier transformation (IFFT) and remove the components whose propagation delays are longer than 500 ns, and then we convert the processed CIR back to CSI by Fast Fourier Transformation (FFT).

Iv-B2 CSI Denoising

After removing the paths with long delays, the CSI values still contain significant high-frequency noise and low-frequency interferences [24]. Moreover, the frequency components, i.e., the frequencies of CSI amplitude variation, induced by walking are approximately within the range of 2060 Hz given the 5 GHz Wi-Fi band[7, 24], the proposed system adopts the Butterworth bandpass filter, which guarantees high fidelity of reserved signals in the passband, to eliminate the high-frequency and low-frequency noise. The upper and lower cutoff frequencies of the Butterworth filter are empirically set as 90 Hz and 5 Hz, respectively. The direct current component (0 Hz) of each subcarrier is also filtered by the bandpass filter. Subsequently, the proposed system introduces weighted moving average to further denoise and smooth the CSI amplitudes.

Iv-B3 CSI Refining

As mentioned above, the time-series CSI amplitudes of 30 subcarriers within one CSI stream come from one Tx-Rx antenna pair, which means they reflect quite similar multipath propagation of Wi-Fi signals, and the amplitudes have correlated changing pattern. However, different CSI subcarriers have slightly different carrier frequencies, which results in some phase shifts and a little different attenuations of CSI amplitudes in each CSI stream. Directly using all the correlated data may push our system towards some deep fading and unreliable subcarriers, to ensure better recognition results, the system utilizes Principal Component Analysis (PCA) to automatically discover the correlation between CSI amplitudes in each CSI stream and produce synthesized combinations (principal components) [3]. Fig. 6 displays the comparisons between original CSI amplitudes of the 7# subcarrier and the first PCA components of total 30 subcarriers for the two walking instances mentioned above. We can see that the first PCA component can better depict the changing pattern and trend of the CSI variation induce by walking. Here the first three principal components, which capture the most details of walking movement, are selected as the refined CSI data. For each receiver, we can totally obtain 6 groups of refined CSI data sequences since there are 6 CSI streams.

(a) Amplitude variance of in-place walking w/o swing arms.
(b) Amplitude variance of swing arms w/o walking.
Fig. 5: The amplitude variance of CSI received by different receivers.
Fig. 6: Comparisons between the original CSI amplitudes of the 7# subcarrier and the first PCA components of total 30 subcarriers for the two walking instances: (i) walking away from and (ii) walking towards the devices. (a) and (b) are separately the original CSI amplitude and the PCA component of instance (i); (c) and (d) are separately the original CSI amplitude and the PCA component of instance (ii).

Iv-C Walking Profile Generation

Iv-C1 Walking Detection

Compared with other daily activities, such as sitting down and cooking, walking has some special characteristics: (i) it involves the motions of many different body parts, (ii) the moving speeds of different body parts are relatively high, (iii) it can last for a bit long time. Since the frequencies induced by walking are mainly between 20 Hz to 60 Hz, the Energy of Interest (EI), which equals the sum of (normalized) magnitudes squared of FFT coefficients in the frequency range 2060 Hz, is calculated to detect walking movement in terms of an appropriate threshold like that in [24, 8]. The width of the FFT window is set as 256 () based on the trade-off between detection accuracy and system response rate. Whenever a walking movement is detected, the system starts its core functions immediately, namely generating the walking profile and further recognizing walking gait and direction.

Iv-C2 Spectrogram Generation

The refined CSI data can only depict the signal changing patterns in the time domain, where the signal reflections of different body parts are mixed together. When a subject is walking, its body parts (such as legs, arms and torso) have different moving speeds, and the signals reflected by different body parts have quite different energy considering different parts have different reflection area. Specifically, the swing leg (especially, the lower leg) has the highest moving speed, and the supporting leg and the torso have low moving speed; the signals reflected from the torso have the strongest energy while the arm reflections have much weak energy. By utilizing the Short-Time Fourier Transform (STFT), the system converts the time-domain refined CSI data sequence to the spectrogram of time-frequency domain. In practice, the CSI data sequences are first segmented into fixed-length chunks, which usually overlap each other so as to reduce artifacts at the boundary. Then each chunk is transformed by FFT, and the logarithmic magnitudes squared of the FFT yield the final spectrogram. The spectrogram (of the PCA component) of the “walking away” instance is illustrated in Fig. 7, where the relatively “hot” colored areas (have strong energy) in each chunk mainly present the torso reflections and some orange or yellow areas indicate the reflections of legs or arms. The time-varying trend of energy is still maintained in the spectrogram, which is a good evidence to judge the walking direction. Some advanced signal processing techniques used in [7], like spectrogram enhancement, are not applied in our system to further refine the spectrogram, in contrary, we delegate more power to the RNN encoder-decoder network and let it learn to find out the important information from the noisy data.

Fig. 7: The spectrograms of the refined CSI data of a “walking away” instance.

Iv-C3 Profile Splicing

Considering that original CSI amplitudes are denoised using a bandpass filter, the frequencies within [5, 90] (Hz) are reserved, and the frequency components induced by walking movement are mainly in the range of 2060 Hz. The proposed method keeps the data of 64 out of the 85 FFT points in the spectrogram, which corresponds to the frequency range of around 1070 Hz, and stacks the data of each PCA component to form the primary walking profile with dimensions for each PCA component, where is the length of the spectrogram (i.e., the amount of chunks). Thus the stacked For enhancing the primary profile, the proposed system generates the dynamical features (also known as delta features) by applying discrete time derivatives on the primary profile, the dynamical features are proven to have excellent performance in many Automatic Speech Recognition (ASR) systems[35, 36], since the dynamical features can reveal some underlying connections and characteristics of adjacent speech frames. By concatenating the first-order derivative, namely the first-order difference, of the primary profile, the proposed method gets a dimensional walking profile for each CSI stream of a certain receiver.

In addition, the proposed system is designed to splice the walking profiles corresponding to the same Tx-Rx antenna pair of the lowly-placed and the highly-placed receivers to further add more walking dynamics into the profile. Finally, a “rich” and integrated walking profile with dimensions is constructed, and totally 6 integrated walking profile can be produced in each processing cycle.

Iv-C4 Profile Reversion and Standardization

From Fig. 6, we can find that the CSI amplitudes have relatively symmetrical time-varying patterns when the subject walks in reverse directions, which inspires us to reverse the data sequences of walking profiles along the time dimension to augment the instance data of the opposite walking movements. Usually, the performance of neural networks, like RNN, can be improved with more training data [37], in the proposed system the operation of profile reversion doubles the data used for training or recognition, and it is expected to promote the performance of our system.

Before feeding data to the neural networks, it is necessary to standardize the data in advance, i.e., putting all the variables on the relatively same scale (with zero mean and unit variance), which can help to speed up the convergence and improve the performance of the networks [38]. The statistical standardization method is applied in the system, to be specific, the proposed system calculates the global mean

and standard deviation

of a integrated walking profile, and then subtracts from each variable of the profile and subsequently divides the difference by .

So far, all the preparation work has been done, and it’s time to build our attention-based RNN encoder-decoder networks.

Iv-D Model Customization for Gait and Direction Recognition

The existing attention-based RNN encoder-decoder neural networks are specialized for Natural Language Processing (NLP) applications, especially for the machine translation task

[15], and the trained networks can adaptively concentrate on important words in the source sentence when generating the target word. This distinguishing characteristic motivate us to create a cycle-independent gait and direction recognition system given the arbitrarily segmented CSI data. However, there are two main differences between the tasks of machine translation and ours. Firstly, machine translation is a single task learning problem, while our system aims to deal with two different tasks (direction and gait recognitions), namely multitask learning [39]. Secondly, apart from the source sentence, there exist statistical relations among the target words, which are basically described by statistical language model. Therefore, at the decoder side, the predicted target word is subsequently set as the input to predict the next word as illustrated in Fig. 4. However, in our system there is no explicit relation between human gait and walking direction, in order to jointly train the networks for our objectives, we need to customize our own model.

  • Encoder: In the system, the input of the RNN encoder is the CSI data sequence (the integrated dimensional walking profile), which is denoted as , and each is a vector with dimensions . The output of the RNN decoder is denoted as , where and , and separately are the number of walking directions and the number of subjects. Moreover, we can define more variables for many other tasks such as gender classification. In this work, we mainly concern gait and direction recognition. Considering that human gait and walking direction have no explicit relation with each other, the conditional probability of equation (7) needs to be rewritten as


    and the computation of is also isolated to the former predicted target, namely . As suggested in [15], a bidirectional RNN (BiRNN) framework is utilized to create our RNN encoder. BiRNN presents each input sequence forwards and backwards to two separate recurrent hidden layers, which are connected to the same output layer, and it’s reported to perform better than the unidirectional RNN [40, 41]. In BiRNN, the hidden state is expressed as the concatenation of the forward hidden state and the backward one , i.e., .

  • Decoder: As for the decoder, because the computation of attention weight and attention vector doesn’t depend on the predicted targets, which implies that the proposed method can directly and faithfully implement the computation processes of attention weights and attention vectors as described in [15] without any major modification. The proposed method is expected to learn and compute valuable attention weights which enable the system to automatically align with some critical clips of the input data sequence for the two different tasks. Due to the lack of future context, a unidirectional RNN which has the same number of hidden layers as the encoder is employed in the proposed system. Thus, at the end of the encoding stage, the bidirectional hidden state of the last input of the encoder needs to be transformed to meet the size of the unidirectional hidden state of the decoder, and the transformation in this work is simply performing an additive operation between and . For the networks with multiple hidden layers, the transformation can be executed on each layer of the networks accordingly. Followed by the hidden layer(s), two full connection layers are adopted for the two different tasks.

V Implementation & Evaluation

V-a Experiment Setup

In order to conveniently collect more CSI data of walking in different directions, we find a spacious laboratory with size 9.5m7.8m as the CSI data collection environment, which is shown in Fig. 8. The transmitter and the receivers are placed abreast at the marked positions and the distance between each receiver and the transmitter is 1.0 m. As we mentioned in subsection IV-A, for concentrating on different body parts of a walker, the two receivers are separately placed at heights of 0.5 m and 1.0 m above the ground level, and the transmitter is placed at a height of 0.75 m. The devices all run on the 149# Wi-Fi channel (its central frequency is 5.745 GHz) without the dynamic frequency selection (DFS) and transmit power control (TPC) constraints of Federal Communications Commission (FFC). By using the CSI tool [26], the transmitter and the receivers are configured to work in monitor mode, where the data packets injected by the transmitter can simultaneously be captured by the two receivers. Besides, the clocks of the two receivers are synchronized by the Network Time Protocol (NTP) so as to obtain CSI measurements with synchronized timestamps. A deep learning server, which has two high-performance NVIDIA GeForce GTX 1080Ti Graphics Processing Units (GPUs), is connected to the receivers by cables. The server is specialized for data processing and model training.

A 6.0m6.0m grid-layout area, 1.5 m away from the transmitter, is planned for the walking experiment, and the size of each grid is 1.0m1.0m. As shown in Fig. 8, 12 specific paths (marked with solid lines) are assigned for subjects to walk on, and the labels of 8 walking directions are annotated on the bottom left of the layout plan. The detailed experiment process is described in the next subsection.

Fig. 8: Experiment environment.
Fig. 9: Time distribution of subjects walking on different paths.

V-B Experiment Process

V-B1 Dataset Creation

In this work, 11 subjects (8 male and 3 female graduate students) are invited to carry on the walking experiment. For privacy concern, we don’t record some private information like age, height and weight of each subject. Every time, a subject is asked to walk between the two endpoints of one straight path in its natural way, for example walking between D and D’ of path DD’ in Fig. 8. Meanwhile, each receiver sends its CSI measurements companied by timestamps to the server, and the server stores the CSI data and the timestamps for further processing. When the subject arrives at a endpoint, it turns around and walks back to the other endpoint. For each path, the subject is required to walk for 5 minutes without break (a alarm clock is provided to remind the subject), then turns to another path. Therefore, we can totally get 60-minute data involving 12 walking paths and 8 different directions from each subject. During the experiment, two video cameras are applied to record the whole experimental process like that in [42], and the CSI data corresponding to walking (except for turning and break) are manually labeled by our recorders. We promise that all the video resources are only used for the experiment and will be cleared for protecting the privacy of subjects as soon as we finish the paper.

An CSI data sequence, which involves a single-trip walking of a path, is regarded as an CSI walking instance, and we totally collect 10626 walking instances in the experiment. Since the shortest and the longest path lengths of the 12 paths are about 5.66 m and 8.49 m respectively, the time distribution of all subjects walking on different paths is illustrated in Fig. 9, where the minimum and the maximum single-trip walking time separately is 4.135 s and 10.626 s. For each subject, we randomly select 20% of the walking instances corresponding to a specific path and direction for validation and testing, and the remaining instances are for training. Given the 1000 Hz sampling rate of the system, for brevity, a 4000-point sliding window segmentation with step size of 500 points is performed on each labeled CSI walking instance to get multiple slices used for training, validation and testing. Note that thanks to the recurrent structure of RNN, the data with arbitrary lengths can be handled by our model. By conducting profile splicing and reversion, for training set, we get 172088 integrated walking profiles of all the 11 subjects. Moreover, we bisect the slices used for validation and testing and generate our validation and testing sets, each of which has 14858 profiles of all subjects. We randomly select the data of 8 subjects to train and evaluate our model.

V-B2 Model Training

The training of our attention-based RNN encoder-decoder model is performed in PyTorch using one GTX 1080Ti GPU. Given the 11 GB memory size and the 11.3 TFLOPs (in single precision) processing power of an GTX 1080Ti GPU, it enables us to train a complex and relatively deep RNN encoder-decoder model. As a reference, a PyTorch implementation of the attten-based RNN encoder-decoder for machine translation

[15] is available on GitHub111

In our specific implementation, the RNN architectures of our encoder and decoder both are GRU, which is reported to have simpler structure but better performance than LSTM [43]

, and the numbers of hidden layers are set as 3 in the encoder and decoder. The encoder and decoder of the proposed system have 1024 hidden units each, and the encoder has 256 input units while the decoder has no input layer. We use a minibatch stochastic gradient descent (SGD) algorithm to train the encoder and decoder, each SGD update direction is computed using a minibatch of 64 instances. We adjust (decrease) the learning rate from 1e-4 to 1e-6 based on the training epochs have been done, and the total training epochs of the system are set as 32. The validation set is employed to check if the error is within a reasonable range. For better preventing our model from overfitting, some effective techniques, such as dropout

[44] and training with noise [45], are introduced in the system.

V-B3 Evaluation Metrics

For evaluating the performance of the proposed system, two major metrics are employed, namely and score, where is for primary evaluation of the system’s gait and direction recognition ability since it only takes true positives of each class into consideration, and

score, which is the weighted average of precision and recall, is used to evaluate the comprehensive performance of the system. Besides, the confusion matrices are also posted to illustrate the detailed recognition results.

(a) Human gait recognition.
(b) Walking direction recognition.
Fig. 10: Confusion matrices of gait and direction recognition results.
(a) Human gait recognition.
(b) Walking direction recognition.
Fig. 11: The detailed recognition accuracies and scores of human gait and walking direction.

V-C Experiment Results

The performance of the proposed system is evaluated on the testing dataset, and each instance in the test set has two labels, i.e., subject label (from “S1” to “S8”) and direction label (from “D0” to “D7”). The confusion matrices of human gait and walking direction recognition results are shown in Fig. 10, where the diagonal entries represent the numbers of true positives of different classes, intuitively, we can find most of the instances are correctly predicted by the sytem. Fig. 11, which is derived from the confusion matrices, illustrates the gait and direction recognition accuracies and scores for specific classes. The proposed system achieves relatively high accuracies (all above 95%) and high scores of all the classes given different tasks, especially for the direction recognition task. Concretely, the average gait recognition accuracy and score of the 8 selected subjects are 97.68% and 89.69% respectively, while the average accuracy and score for direction recognition are separately 98.75% and 95.06%. These results demonstrate that by adopting the attention-based RNN encoder-decoder architecture can achieve promising recognition results on the human gait and walking direction recognition tasks.

(a) Instance of “S8’ and “D1”.
(b) Instance of “S4” and “D4”.
Fig. 12: Visualization of attention weights for 2 test instances.

V-D Attention Visualization

Given a certain test instance, the proposed system first encodes the instance into a particular attention vector, which is the weighted sum of all the hidden states of the system’s encoder, then the system decodes the attention vector and outputs the prediction for a specific task. The attention weights computed by the equation 9 score how well the walking profile at each time step and the current prediction match. Fig. 12 visualizes the attention weights (in the middle of each subfigure, where the upper row and the lower row correspond to direction and gait respectively) of 3 test instances in grayscale (where 0 is black and 1 is white), and the top and bottom parts of each subfigure are the spectrograms of the highly-placed Rx (HighRx) and lowly-placed Rx (LowRx), respectively. We observe that (i) for different recognition tasks, the computed weights are different, which means the proposed system can automatically focus its attentions on different time steps of the spectrograms when coping with different tasks; (ii) the large weights (more brighter weights) are basically align with some high-energy clips in the spectrograms, where these clips usually contain much more critical and apparent features of human walking dynamics. Based on the observations, we can conclude that even without portioning CSI data sequence into cycle-wise time slices the proposed system can still learn to adaptively align with some important parts of the sequence and realize cycle-independent gait and walking direction recognition.

Vi Limitations

Although we have proven the feasibility and shown the promising results of jointly recognizing human walking gait and direction with the attention-based RNN encoder-decoder framework in Wi-Fi networks. There are still some limitations. We only evaluate the performance of our system on a group of 8 subjects, and some external influence factors, like footwear, floor surface, room layout, and internal influence, like length of walking profile, haven’t taken into consideration. We are trying to explore the impacts of subject group size and other specific factors in our future work. Limited by the bandwidth and synchronization problem of Wi-Fi networks, if there are multiple people walking at the same time, the signals reflected off different people are mixed together at the receiver side. As many existing WiFi-based systems, for now, we haven’t found some effective methods to separate the mixed Wi-Fi signals from multiple moving individuals, and now the proposed system can’t be applied in the scenario where multiple individuals walk simultaneously.

Vii Conclusions

By adopting the attention-based RNN encoder-decoder framework, we proposed a new cycle-independent human gait and walking direction recognition system which was jointly trained for walking gait and direction recognition purposes. For capturing more human walking dynamics, two receivers and one transmitter were deployed in different spatial layouts. The CSI measurements from the two receivers were first gathered together and refined to form an integrated walking profile. Then, the RNN encoder read and encoded the walking profile into primary feature vectors, based on which the decoder computed different attention vectors for different recognition tasks. With the attention scheme, the proposed system could learn to adaptively align with different critical clips of the CSI data sequence for walking gait and direction recognitions. We implemented our system on commodity Wi-Fi devices in indoor environment, and the experimental results demonstrated that the proposed system could achieve average scores of 89.69% for gait recognition from a group of 8 subjects and 95.06% for direction recognition from 8 directions, besides the average accuracies of these two recognition tasks both exceeded 97%. Our system is expected to enable more practical and interesting applications.


  • [1] N. V. Boulgouris, D. Hatzinakos, and K. N. Plataniotis, “Gait recognition: A challenging signal processing technology for biometric identification,” IEEE Signal Processing Magazine, vol. 22, no. 6, pp. 78–90, Nov 2005.
  • [2] C. Han, K. Wu, Y. Wang, and L. M. Ni, “Wifall: Device-free fall detection by wireless networks,” in Proceedings of IEEE INFOCOM, 2014, pp. 271–279.
  • [3] W. Wang, A. X. Liu, M. Shahzad, K. Ling, and S. Lu, “Understanding and modeling of wifi signal based human activity recognition,” in Proceedings of ACM MobiCom, 2015, pp. 65–76.
  • [4] K. Ali, A. X. Liu, W. Wang, and M. Shahzad, “Keystroke recognition using wifi signals,” in Proceedings of ACM MobiCom, 2015, pp. 90–102.
  • [5] H. Li, W. Yang, J. Wang, Y. Xu, and L. Huang, “Wifinger: Talk to your smart devices with finger-grained gesture,” in Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing.   ACM, 2016, pp. 250–261.
  • [6] Y. Ma, G. Zhou, S. Wang, H. Zhao, and W. Jung, “Signfi: Sign language recognition using wifi,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 2, no. 1, p. 23, 2018.
  • [7] W. Wang, A. X. Liu, and M. Shahzad, “Gait recognition using wifi signals,” in Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing.   ACM, 2016, pp. 363–373.
  • [8] Y. Zeng, P. H. Pathak, and P. Mohapatra, “Wiwho: Wifi-based person identification in smart spaces,” in Proceedings of the 15th International Conference on Information Processing in Sensor Networks.   IEEE Press, 2016, p. 4.
  • [9] L. Lee and W. E. L. Grimson, “Gait analysis for recognition and classification,” in Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition, May 2002, pp. 155–162.
  • [10] T. H. Lam, K. Cheung, and J. N. Liu, “Gait flow image: A silhouette-based gait representation for human identification,” Pattern recognition, vol. 44, no. 4, pp. 973–987, 2011.
  • [11] R. J. Orr and G. D. Abowd, “The smart floor: A mechanism for natural user identification and tracking,” in CHI’00 Extended Abstracts on Human Factors in Computing Systems.   ACM, 2000, pp. 275–276.
  • [12]

    S. Sprager and D. Zazula, “A cumulant-based method for gait identification using accelerometer data with principal component analysis and support vector machine,”

    WSEAS Transactions on Signal Processing, vol. 5, no. 11, pp. 369–378, 2009.
  • [13] A. Primo, V. V. Phoha, R. Kumar, and A. Serwadda, “Context-aware active authentication using smartphone accelerometer measurements,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition workshops

    , 2014, pp. 98–105.
  • [14] Y. Chen, W. Dong, Y. Gao, X. Liu, and T. Gu, “Rapid: A multimodal and device-free approach using noise estimation for robust person identification,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 3, p. 41, 2017.
  • [15] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
  • [16] G. Johansson, “Visual perception of biological motion and a model for its analysis,” Perception & Psychophysics, vol. 14, no. 2, pp. 201–211, Jun 1973.
  • [17] J. E. Cutting, D. R. Proffitt, and L. T. Kozlowski, “A biomechanical invariant for gait perception,” Journal of Experimental Psychology: Human Perception and Performance, vol. 4, no. 3, p. 357, 1978.
  • [18] R. Polana and R. Nelson, “Detecting activities,” Journal of Visual Communication and Image Representation, vol. 5, no. 2, pp. 172–180, 1994.
  • [19] J. Little and J. Boyd, “Recognizing people by their gait: The shape of motion,” Videre: Journal of computer vision research, vol. 1, no. 2, pp. 1–32, 1998.
  • [20] M. S. Nixon, J. N. Carter, J. M. Nash, P. S. Huang, D. Cunado, and S. V. Stevenage, “Automatic gait recognition,” pp. 3–3(1), January 1999.
  • [21] J. Han and B. Bhanu, “Individual recognition using gait energy image,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 28, no. 02, pp. 316–322, 2006.
  • [22] S. Sen, B. Radunovic, R. R. Choudhury, and T. Minka, “You are facing the mona lisa: Spot localization using phy layer information,” in Proceedings of the 10th international conference on Mobile systems, applications, and services.   ACM, 2012, pp. 183–196.
  • [23] J. Zhang, B. Wei, W. Hu, and S. S. Kanhere, “Wifi-id: Human identification using wifi signal,” in 2016 International Conference on Distributed Computing in Sensor Systems (DCOSS).   IEEE, May 2016, pp. 75–82.
  • [24] Y. Xu, W. Yang, J. Wang, X. Zhou, H. Li, and L. Huang, “Wistep: Device-free step counting with wifi signals,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 4, p. 172, 2018.
  • [25]

    J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,”

    IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 2, pp. 210–227, 2009.
  • [26] D. Halperin, W. Hu, A. Sheth, and D. Wetherall, “Tool release: Gathering 802.11n traces with channel state information,” ACM SIGCOMM Computer Communication Review, vol. 41, no. 1, pp. 53–53, 2011.
  • [27] T. S. Rappaport, Wireless Communications: Principles and Practice.   Prentice Hall PTR New Jersey, 1996, vol. 2.
  • [28] Z. Yang, Z. Zhou, and Y. Liu, “From rssi to csi: Indoor localization via channel response,” ACM Computing Surveys (CSUR), vol. 46, no. 2, p. 25, 2013.
  • [29] K. Wu, J. Xiao, Y. Yi, M. Gao, and L. M. Ni, “Fila: Fine-grained indoor localization,” in INFOCOM, 2012 Proceedings IEEE.   IEEE, 2012, pp. 2210–2218.
  • [30] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
  • [31] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
  • [32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [33] Y. Xie, Z. Li, and M. Li, “Precise power delay profiling with commodity wifi,” in Proceedings of the 21st Annual International Conference on Mobile Computing and Networking.   ACM, 2015, pp. 53–64.
  • [34] Y. Jin, W.-S. Soh, and W.-C. Wong, “Indoor localization with channel impulse response based fingerprint and nonparametric regression,” IEEE Transactions on Wireless Communications, vol. 9, no. 3, 2010.
  • [35] S. Furui, “Speaker-independent isolated word recognition based on emphasized spectral dynamics,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’86., vol. 11.   IEEE, 1986, pp. 1991–1994.
  • [36] F. Zheng, G. Zhang, and Z. Song, “Comparison of different implementations of mfcc,” Journal of Computer science and Technology, vol. 16, no. 6, pp. 582–589, 2001.
  • [37] T. Mikolov, “Statistical language models based on neural networks,” Presentation at Google, Mountain View, 2nd April, 2012.
  • [38] M. Shanker, M. Y. Hu, and M. S. Hung, “Effect of data standardization on neural network training,” Omega, vol. 24, no. 4, pp. 385–397, 1996.
  • [39] R. Caruana, “Multitask learning,” Machine learning, vol. 28, no. 1, pp. 41–75, 1997.
  • [40]

    M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”

    IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
  • [41] A. Graves, “Supervised sequence labeling with recurrent neural networks,” ISBN 9783642212703. URL, 2012.
  • [42] A. Brajdic and R. Harle, “Walk detection and step counting on unconstrained smartphones,” in Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing.   ACM, 2013, pp. 225–234.
  • [43] R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical exploration of recurrent network architectures,” in International Conference on Machine Learning, 2015, pp. 2342–2350.
  • [44] V. Pham, T. Bluche, C. Kermorvant, and J. Louradour, “Dropout improves recurrent neural networks for handwriting recognition,” in Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on.   IEEE, 2014, pp. 285–290.
  • [45]

    G. An, “The effects of adding noise during backpropagation training on a generalization performance,”

    Neural computation, vol. 8, no. 3, pp. 643–674, 1996.