Temporal Unet: Sample Level Human Action Recognition using WiFi

04/19/2019 ∙ by Fei Wang, et al. ∙ Zhejiang University Carnegie Mellon University 0

Human doing actions will result in WiFi distortion, which is widely explored for action recognition, such as the elderly fallen detection, hand sign language recognition, and keystroke estimation. As our best survey, past work recognizes human action by categorizing one complete distortion series into one action, which we term as series-level action recognition. In this paper, we introduce a much more fine-grained and challenging action recognition task into WiFi sensing domain, i.e., sample-level action recognition. In this task, every WiFi distortion sample in the whole series should be categorized into one action, which is a critical technique in precise action localization, continuous action segmentation, and real-time action recognition. To achieve WiFi-based sample-level action recognition, we fully analyze approaches in image-based semantic segmentation as well as in video-based frame-level action recognition, then propose a simple yet efficient deep convolutional neural network, i.e., Temporal Unet. Experimental results show that Temporal Unet achieves this novel task well. Codes have been made publicly available at https://github.com/geekfeiw/WiSLAR.



There are no comments yet.


page 1

page 2

page 3

page 6

page 7

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

WiFi devices have been widely studied in human action recognition, such as the elderly fallen detection wang2017wifall ; wang2017rt , hand sign language recognition li2016wifinger ; wang2018csi , keystroke estimation ali2015keystroke ; li2016csi , etc. Compared with cameras, WiFi devices as ubiquitous sensors for human action recognition are more resilient to illumination and occlusion, meanwhile arise less privacy concerns. As our best survey, past work does action recognition by categorizing one complete WiFi distortion series into one action. For this, past work usually (1) applies action detection algorithms to detect the start point and the end point of actions wang2017rt ; ali2015keystroke , (2) segments WiFi distortion series by estimated start/end points, (3) and recognizes action based on the segmented series with series-matching algorithms such as the dynamic time warping berndt1994using

. From above perspective, past work can be called series-level action recognition, which classifies the whole WiFi distortion series with one single action label.

In this paper, we introduce a much more fine-grained action recognition task into WiFi sensing domain, i.e., sample-level action recognition, to enlarge the sensing ability of WiFi sensors. In this task, every sample in the whole series should be categorized into one action class, which is critical in precise action temporal localization, continuous action segmentation, and real-time action recognition. More precisely, sample-level action recognition roughly falls into the following two tasks. (1) sample-level action detection, and (2) sample-level action classification. As shown in Fig. 1

 (1st, 2nd), when a user does actions, he distorts WiFi sampling series. The sample-level action detection is to estimate whether a user is doing actions or not for every sampling moment, shown in Fig 

1 (3rd). Besides, the sample-level action classification also aims to categorize action for every WiFi sample, shown in Fig 1 (4th).

However, there are two severe challenges to achieve WiFi-based sample-level action recognition. (1) Since it is hard to identify action only based on one sample, to categorize one sample, we must learn relations from its neighbouring samples. However it is challenging to figure out how many neighbours should be taken into account for sample-level action recognition. (2) The proposed approach should better be unified and applicable for both of aforesaid two tasks, i.e., sample-level action detection and action classification. Nonetheless, these two tasks may require distinct features, which increases the difficulty of approach proposal.

To solve these challenges, we propose the novel Temporal Unet that enables Unet ronneberger2015u

ability of learning action features from time-serial WiFi distortion that induced by human actions. Precisely we apply temporal convolutional layers, temporal max pooling layers, and temporal deconvolutional layers along the time axis of sampled CSI series to learn features from the low-level variance and high-level profile. Besides, the stacked temporal layers could cover neighbours of target sample in various time of views 


. Further in Temporal Unet, shortcut links between low layers and high layers can combine features in various levels to boost action recognition. With advancements in deep learning design, experiment results show that Temporal Unet achieves good results in WiFi-based sample-level action recognition. Contributions of this paper can summarized as follows.

(1) We introduce WiFi-based sample-level action recognition task into WiFi sensing domain to enlarge WiFi abilities in human sensing. In addition, we fully discuss potential techniques and insights that can achieve this task.

(2) We propose the novel Temporal Unet for this task, which has been evaluated simple yet efficient both in WiFi-based sample-level action detection and WiFi-based sample-level action classification.

Figure 1: WiFi-based action recognition. When one conducts actions (1st), channel state information (CSI) of WiFi varies (2nd). Sample-level action detection is to identify whether one action is conducted or not for every sample on the series (3rd), and sample-level action classification is further to identify which action is conducted (4th).

2 Related Work

2.1 WiFi-based Action Recognition

WiFi devices and signals have been explored for human action recognition such as the daily activity recognition pu2013whole ; wang2014eyes ; xi2014electronic ; fang2016bodyscan ; abdelnasser2015wigest , health-care usages wang2017wifall ; wang2017rt ; palipana2018falldefi ; wang2018csi , and hand gesture recognition in privacy and security or human-computer interaction applications li2016wifinger ; wang2017csi ; ali2015keystroke ; wang2019joint . Past WiFi-based action recognition work follows 3 main technology roadmaps. (1) Utilizing the statistical values such as the average, variance, and entropy as features of the WiFi time-series to train action classifiers, then applying the trained classifiers to categorize action by inputting the whole WiFi series that correspond to one action wang2017wifall ; wang2017rt ; wang2018continuous ; zeng2016wiwho ; pu2013whole . (2) Using Dynamic Time Warping berndt1994using to measure distances between the test WiFi series and all training series, then applying K Nearest Neighbors algorithm to predict one action for the test set xi2014electronic ; fang2016bodyscan ; abdelnasser2015wigest ; li2016wifinger ; wang2017csi ; ali2015keystroke

. And to date, (3) designing deep learning approaches such as deep Boltzmann Machine 

salakhutdinov2009deep and Convolutional Neural Networks lecun1998gradient ; krizhevsky2012imagenet as the WiFi-based action classifiers wang2017csi ; wang2017cifi ; ma2018signfi ; zhou2018signal ; zhang2018enhancement ; wang2019joint .

Since past work predicts whole WiFi distortion series as one action label, we call it series-level action recognition. In this paper, we do action recognition on every sampled WiFi data, which is much more fine-grained and challenging. To our best survey, this is the first work on the sample-level WiFi-based action recognition.

2.2 WiFi-based Action Detection

Our approach can also be used to detect human action, shown in Fig. 1, where ‘1’ and ‘0’ represent ‘doing an action’ and ‘no action’, respectively. To this end, our work is related to previous WiFi-based action detection work that detects the start time and the end time of an action. As shown in above studies, action detection usually serves as one part of techniques to segment WiFi series for further series-level action recognition ali2015keystroke ; wang2017rt ; wang2017wifall ; wenyuan2018lens ; lin2019concurrent . In addition, past work usually applies threshold-based sliding window in amplitude, variance, entropy, etc., to detect the start time and end time of an action. For example, Wang et al. wang2017rt first computes the mean

and the normalized standard deviation

of the non-action state, then uses as the threshold to determine the non-action and doing-action state for further (fine-grained?) start time and end time detection. Similarly, in wang2017wifall , the local density breunig2000lof of WiFi series in non-action state is used as a threshold to detect human falling state. In wenyuan2018lens ; lin2019concurrent , the first-order difference of WiFi series is computed and applied for starting point detection. And in ali2015keystroke , the mean absolute deviation of WiFi series is utilized for a threshold-based for action detection.

There are two major shortcomings in above approaches. First, thresholds require hyper-parameters, such as the in  wang2017rt , which decreases the effectiveness and generalization ability of these methods. Second, above approaches cannot segment a series of continuous actions those have little transferring state between two consecutive actions. Our approach requires no hyper-parameters and no thresholds, whereas it learns boundaries of actions directly from WiFi distortion series, thus overcoming these two shortcomings significantly.

2.3 Dense Task in Computer Vision

Unlike generating one action label from the whole WiFi series (sparse labeling), in our proposed sample-level WiFi sensing task, every sampled WiFi distortion should be categorized to one action. We name it dense WiFi sensing task, which is related to some dense tasks in computer vision domain such as image-based semantic segmentation 

long2015fully ; ronneberger2015u ; badrinarayanan2017segnet ; chen2018deeplab and video-based frame-level action recognition tran2015learning ; hara2018can ; shou2017cdc ; yang2018exploring . In image-based semantic segmentation, every pixel of the image is labeled as one thing/stuff class such as the sky, giraffe, person, etc., in Fig. 2. That is, the output is with the same size, i.e. height and width, as the input image. Similarly, in WiFi-based sample-level action recognition, the output should also be with the same size as the input WiFi series. This similarity inspires us to explore the feasibility of applying deep frameworks that meet the same-size requirement such as FCN long2015fully , U-Net ronneberger2015u and FPN lin2017feature in our proposed task. However it still requires careful designs on the framework targeting to WiFi time series. Besides, our task is related to video-based frame-level action recognition that generates an action label for every frame, shown in Fig. 3. The main purpose of this work is to find relations between frames in temporal tran2015learning ; hara2018can ; shou2017cdc ; yang2018exploring as well as the relations between objects including persons in one frame gkioxari2018detecting . One main purpose of this category of work is to find action representation from multiple continuous frames  tran2015learning ; hara2018can ; shou2017cdc ; yang2018exploring , which inspires us that learning action representation in temporal would facilitate our sample-level action recognition task.

In section 3, we describe more details on our approach inspiration from previous dense tasks.

3 Background and Analysis

For simplicity, in this section, we name WiFi-based sample-level action recognition task as WiSLAR. We next analyze relations between WiSLAR and image-based semantic recognition as well as video-based action recognition, show insights from the analysis, and propose our approaches.

Figure 2: Image-base semantic segmentation samples from COCO-Stuff dataset caesar2018coco . Every pixel in the image is labeled as one thing.
Figure 3: One video-based frame-level action recognition example from Kinetics dataset kay2017kinetics . Each frame of the video is labeled as to one action or the background (the background means no interest-of-action in this frame).
Figure 4: Comparison between the spatial convolutional operation (Left), and the spatio-temporal convolutional operation (Right).

3.1 Insights from Image-based Semantic Segmentation

Analysis. As COCO-Stuff dataset caesar2018coco examples shown in Fig. 2, the task of image-based semantic segmentation is to generate an object or thing label, such as giraffe, person, boat, sea and sand, for every pixel of images. It is an intuition that we can hardly tell that one blue pixel (BP) belongs to sky if we only stare at this BP. However if we view more neighboring blue pixels even the white cloud, we then can say this BP must be sky confidently. Intuitively as shown in Fig. 4 (left), it doesn’t make sense to label a blue pixel (BP) as sky if not taking the neighboring pixels into account. However, when the blue pixel is surrounded by a batch of blue pixels or even white pixels labeled as cloud, it is much more reasonable to regard the BP as part of the sky. The example above illustrates that pursuing larger field of view (FOV) is critical for semantic segmentation. In computer vision, spatial convolutional kernels generally sweep along the height and width of one image to generate semantic features of the image, as shown in Fig. 4. And to gain larger FOVs, global pooling layers are employed for the whole-image view which advances the high-level understanding on images long2015fully ; zhao2017pyramid ; chen2018deeplab . Besides, dilated convolutions yu2015multi enable deep networks to gain multi-scale FOVs and are widely used in semantic segmentation lin2017feature ; chen2018deeplab ; hamaguchi2018effective ; lin2017refinenet . In addition, frameworks that enable combination between low-level features (small FOV) and high-level features (large FOV) of images are also popular in semantic segmentation ronneberger2015u ; lin2017feature ; chen2018encoder .

Insight. Similar to image-based semantic segmentation, in WiSLAR, doing an action usually lasts for a while, thus we should enable our deep network larger and various FOVs over WiFi distortion for the local and global understanding of the distortion.

3.2 WiSLAR and Video-based Action Recognition

Analysis. As one Kinetics kay2017kinetics example shown in Fig. 3

 (one awarded Oscar and audiences applauding), video-based action recognition generates one action label or background (no interest-of-action) for every frame of videos. As a useful practice, stacking multiple continuous frames to form a tensor with multiple channels can promote the performance of action recognition 

simonyan2014two ; karpathy2014large ; lan2015beyond . In the early years, this tensor is considered as one frame with thick channels and swept by 2D spatial convolutional kernels that resemble Fig. 4 (left). Then features learned among all stacked frames are severed for video-based action recognition. Recently, 3D convolutional kernels sweeping along the height, width, and channel of stacked tensor is widely applied to improve the performance of action recognition ji20133d ; hara2018can ; carreira2017quo ; kay2017kinetics , illustrated in Fig. 4 (right). Compared to 2D spatial convolutional kernels, 3D convolutional kernels can learn extra relations between neighboring frames in multiple scales, which we call time of view (TOV). Despite 3D convolutional kernels, Convolution-Deconvolution-Convolution shou2017cdc and Temporal Preservation Convolution yang2018exploring are also proposed to enhance the understanding in TOV for video-based action recognition.

Insight. In WiSLAR, compared to applying convolutional layers over whole samples like 2D spatial convolutional kernels on all stacked frames in video-based action recognition, we should better apply temporal convolutional kernels in WiFi distortion for multi-scale TOV, which is proven useful for action recognition.

3.3 Deep Network Design Consideration

Based on above two insights, we list three keywords, i.e., multi-scale, FOV, and TOV, to highlight the guidelines in network design. More precisely for WiSLAR, the network (1) should take features in small FOVs and large FOVs into account for action recognition, and (2) should apply convolutions to sweep in temporal for multi-scale TOVs. To achieve WiSLAR, we enable Unet ronneberger2015u  (meeting the first requirement), the ability of learning temporal features, which meets the second requirement. The network is extremely simple yet efficient and targets to WiFi-based sample-level action recognition. We call it Temporal Unet and describe it in detail in the following section.

4 Temporal Unet

Figure 5: Temporal Unet framework.

4.1 Overview

Channel state information halperin2011tool is used as WiFi distortion for sample-level action recognition, which is comprised of the information of all Orthogonal Frequency Division Multiplexing nee2000ofdm carriers between the WiFi sender and the WiFi receiver. We denote one CSI distortion series as , where is the i-th CSI sample, and is the length of sampled CSI distortion series. Meanwhile we denote the corresponding action label series as , where is the action label of . As shown in Fig. 5, Temporal Unet (T-Unet) is to learn a mapping function from CSI sample series to action label series, i.e., .

4.2 Details

T-Unet contains three main components, i.e., the ‘down’, the ‘up’ and the ‘shortcut links’. The ‘down’ reduces the size of input CSI series by convolutional layers and max pooling layers that work along the time axis. With layers being deeper and deeper, the ‘down’ can learn action features in larger and larger time of views, which has proven efficient in action recognition as described in Section. 3.2. The ‘up’ increases the feature map size to meet the requirement that the lengths of and should be the same. Besides, with more convolutional layers and deconvolutional layers, T-Unet can gain deeper features for WiFi-based sample-level action recognition. Last but not least, the ‘shortcut links’ (grey arrows in Fig. 5) advance feature maps in deeper layers with information directly from lower layers, promoting T-Unet with multi-scale filed of views over the CSI series.

Figure 6: Block examples of T-Unet, i.e., ‘down-96-48’ (left) and ‘up-48-96’ (right).

Next we go more details in the main block by specifying network parameter setting. As shown in Fig. 5, every two convolutional layers and one max pooling contribute one ‘down’ block, one of which is shown in Fig. 6 (left). We name it ‘down-96-48’, where the 96 and 48 are the sizes of input and output in time axis, respectively. In ‘down-96-48’, the input size is , and is converted to feature maps with size of by two consecutive convolutional layers (kernel size =

, stride=1, padding=1). Then a max pooling layer (kernel size=

, stride=2, padding=0) makes the converted feature maps . Then the output of the ‘down-96-48’ is the input of ‘down-48-24’ for further operations. In addition, a ‘up-48-96’ is shown in Fig. 6 (right), in which the input () is up-sampled by a deconvolutional layers (kernel size =, stride=2, padding=0) to be . Meanwhile the ‘shortcut link’ concatenates the up-sampled feature maps with ones in low layer along the channel axis, doubling the channel. Then the doubled feature maps are inputted to two convolutional layers for the output, and the output ‘up-48-96’ is the input of the next ‘up’. With these three components, T-Unet achieves WiFi-based sample-level action recognition.

4.3 Loss Function

We apply Cross-Entropy loss function 

krizhevsky2012imagenet to optimize T-Unet.

Sample-level action detection. As shown in Fig. 1, sample-level action detection is a binary classification task, detecting whether or not one is doing interest-of-action. predicting whether one does interest-of-action or not at the moment when corresponding sample recorded. In this case, belongs to . We compute Cross-Entropy loss between the action label annotations and the predictions.

Sample-level action classification. Also shown in Fig. 1, sample-level action classification is a multi-class classification task, classifying what interest-of-action one is doing. predicting what interest-of-action one does at the moment when corresponding sample recorded. In this case, belongs to , where is the number of interest-of-action categories, and is for the non-action state. Similar to the above, we compute Cross-Entropy loss between the action label annotations and the predictions.

4.4 Implementation

We implement T-Unet with PyTorch 1.0.0 with the batch size of 128 and the initial learning rate of 0.005. We train T-Unet with Adam optimizer 

kingma2014adam  (

) for 200 epochs, and decrease the learning rate with a decay of 0.5 every 10 epochs. T-Unet is trained in a desktop with one Titan XP GPU. Before each epoch, the training dataset is shuffled.

5 Experiments

5.1 Dataset

We use CSI distortion dataset released in wang2019joint . The dataset contains CSI series that correspond to six gesture actions, i.e, ‘hand up’, ‘hand down’, ‘hand left’, ‘hand right’, ‘hand circle’, and ‘hand cross’. The action start point and end point of all CSI series are manually annotated. The training set is comprised of 1116 WiFi series, with 6 categories of gestures conducted by one volunteer at 16 indoor locations. The size of test set is 278. The size of one CSI series is , where the 192 is the number of samples in one series, and the 52 is the number of OFDM data carriers.

5.2 Metrics

We use prediction accuracy over all samples as one metric to evaluate T-Unet on this dataset. Meanwhile we propose the action recognition average precision (AP) as another metric for more fine-grained evaluation.


where is the index of test series; is the volume of test set (in this test set, N=278), is the sample-level action recognition on the -th CSI series; is an indicator that outputs if the input is true, whereas outputs 0. The higher of , the better of T-Unet on the dataset.

5.3 WiFi-based Sample-level Action Detection

The overall accuracy of action detection is 95.09% and the confusion matrix (Fig. 

7 left) demonstrates more details about the results. According to the confusion matrix, we see T-Unet works well in distinguishing the states between ‘non-action’ and ‘doing-action’. In addition, AP curves shown in Fig. 8 show performance in more fine-grained view. Taking the point of as an example to explain, if we consider it one success when the accuracy is greater than or equals to 0.9, then the success rate of T-Unet on the test set is 0.92, indicating that CSI series are detected successfully. Thus with increasing, the success rate () would decreases. As shown in Fig. 8, of sample-level action detection are with high values until , which means T-Unet works well even we take 0.9 as the success threshold. Further we average AP@0.5, AP@0.6, AP@0.7, AP@0.8, and AP@0.9, for the mean AP, i.e., 0.98, as one comprehensive metric to assess the performance of T-Unet, which is pretty high.

Figure 7: Confusion matrix of T-Unet on wang2019joint .
Figure 8: AP curves of T-Unet wang2019joint .
mean AP
Action Detection 1 1 1 1 0.92 0.98
Action Classification 0.99 0.94 0.87 0.81 0.69 0.86
Table 1: APs and mean APs of T-Unet on wang2019joint .

5.4 WiFi-based Sample-level Action Classification

Similarly, the confusion matrix for action classification is also shown in Fig. 7. We find the major error is wrongly predicting 20% CSI series of ‘hand right’ to ‘hand left’. Except some misclassification between ‘hand right’ and ‘hand left’, we notice that classifying other actions as ‘non-action’ state contributes a large part of the misclassification. When looking deep into these wrongly predicted samples, we find that these type of error occurs during the transition series between ‘non-action’ state and ‘doing-action’ state. In all, T-Unet achieves an average accuracy of 88.60% for identifying different categories of actions. In addition, AP curve is plotted in Fig. 8, it gradually decreases when and has a sharp decline till , which indicates that T-Unet performs satisfactorily when the success threshold is not set severe. APs and the mean AP are listed in Table. 1. Compared with quantitative results with WiFi-based action detection, we find WiFi-based action classification is more challenging.

5.5 Result Visualization

We present a result example of WiFi-based sample-level action detection and WiFi-based sample-level action detection on ‘hand cross’ in Fig. 9. The first sub-figure illustrates time-serial CSI distortion of 3 OFDM carriers (the 8th, 27th, and 40th carriers), where the blue duration and the red duration are manually labeled as ‘non-action’ and ‘doing-action’, respectively. The blue and the red curves in the middle sub-figure represent sample-level action detection confidence on ‘non-action’ and ‘doing-action’ states, respectively. T-Unet classifies one sample as one specific state that has the highest confidence. Based on this, we can infer that the person first keeps ‘non-action’ state until around 70th (blues higher than reds), then does one action for time of around 100 samples, and completes the action around 170th sample, then keeps ‘non-action’ state till last. Moreover, the last sub-figure illustrates the confidence curves of action classification for all actions. From it, we can not only infer the moments of action start and end, but also the specific action on every sample. Other five action examples on ‘hand up’, ‘hand down’, ‘hand left’, ‘hand right’, and ‘hand circle’ are shown in the Appendix.

Figure 9: One ‘hand cross’ result example. CSI series vary when one does ‘hand cross’ action (1st). The confidence curves of sample-level action detection is shown in 2nd sub-figure. Besides the confidence curves of sample-level action classification are shown in the 3rd su-figure.

6 Conclusion

In this paper we introduce the sample-level action recognition into WiFi sensing community. We fully analysis similar tasks in computer vision, then propose a simple but efficient network, i.e., Temporal Unet, which targets on sample level action recognition by inputting time-serial WiFi distortion induced by human action. Experimental results show advancements of Temporal Unet.


F.W. and YP.S. are supported by China Scholarship Council.


  • (1) Y. Wang, K. Wu, and L. M. Ni, “Wifall: Device-free fall detection by wireless networks,” Transactions on Mobile Computing (TMC), vol. 16, no. 2, pp. 581–594, 2017.
  • (2) H. Wang, D. Zhang, Y. Wang, J. Ma, Y. Wang, and S. Li, “Rt-fall: a real-time and contactless fall detection system with commodity wifi devices,” Transactions on Mobile Computing (TMC), vol. 16, no. 2, pp. 511–526, 2017.
  • (3) H. Li, W. Yang, J. Wang, Y. Xu, and L. Huang, “Wifinger: Talk to your smart devices with finger-grained gesture,” in Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp).    ACM, 2016, pp. 250–261.
  • (4) F. Wang, J. Han, S. Zhang, X. He, and D. Huang, “Csi-net: Unified human body characterization and action recognition,” arXiv preprint arXiv:1810.03064, 2018.
  • (5) K. Ali, A. X. Liu, W. Wang, and M. Shahzad, “Keystroke recognition using wifi signals,” in Proceedings of the Annual International Conference on Mobile Computing and Networking (MobiCom).    ACM, 2015, pp. 90–102.
  • (6) M. Li, Y. Meng, J. Liu, H. Zhu, X. Liang, Y. Liu, and N. Ruan, “When csi meets public wifi: Inferring your mobile phone password via wifi signals,” in Proceedings of the Conference on Computer and Communications Security (SIGSAC).    ACM, 2016, pp. 1068–1079.
  • (7) D. J. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time series,” in Proceedings of Data Mining and Knowledge Discovery (KDD), Workshop, vol. 10, no. 16, 1994, pp. 359–370.
  • (8) O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.    Springer, 2015, pp. 234–241.
  • (9) K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • (10) Q. Pu, S. Gupta, S. Gollakota, and S. Patel, “Whole-home gesture recognition using wireless signals,” in Proceedings of the 19th annual international conference on Mobile computing & networking.    ACM, 2013, pp. 27–38.
  • (11) Y. Wang, J. Liu, Y. Chen, M. Gruteser, J. Yang, and H. Liu, “E-eyes: Device-free location-oriented activity identification using fine-grained wifi signatures,” in Proceedings of the Annual International Conference on Mobile Computing and Networking (MobiCom).    ACM, 2014, pp. 617–628.
  • (12) W. Xi, J. Zhao, X.-Y. Li, K. Zhao, S. Tang, X. Liu, and Z. Jiang, “Electronic frog eye: Counting crowd using wifi,” in Proceedings of the International Conference on Computer Communications (INFOCOM).    IEEE, 2014, pp. 361–369.
  • (13) B. Fang, N. D. Lane, M. Zhang, A. Boran, and F. Kawsar, “Bodyscan: Enabling radio-based sensing on wearable devices for contactless activity and vital sign monitoring,” in Proceedings of the Annual International Conference on Mobile Systems, Applications and Services (MobiSys).    ACM, 2016, pp. 97–110.
  • (14) H. Abdelnasser, M. Youssef, and K. A. Harras, “Wigest: A ubiquitous wifi-based gesture recognition system,” in Proceedings of the Conference on Computer Communications (INFOCOM).    IEEE, 2015, pp. 1472–1480.
  • (15) S. Palipana, D. Rojas, P. Agrawal, and D. Pesch, “Falldefi: Ubiquitous fall detection using commodity wi-fi devices,” Proceedings of the Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), vol. 1, no. 4, p. 155, 2018.
  • (16) X. Wang, L. Gao, S. Mao, and S. Pandey, “Csi-based fingerprinting for indoor localization: A deep learning approach,” IEEE Transactions on Vehicular Technology, vol. 66, no. 1, pp. 763–776, 2017.
  • (17) F. Wang, J. Feng, Y. Zhao, X. Zhang, S. Zhang, and J. Han, “Joint activity recognition and indoor localization,” arXiv preprint arXiv:1904.04964, 2019.
  • (18) F. Wang, Z. Li, and J. Han, “Continuous user authentication by contactless wireless sensing,” arXiv preprint arXiv:1812.01503, 2018.
  • (19) Y. Zeng, P. H. Pathak, and P. Mohapatra, “Wiwho: Wifi-based person identification in smart spaces,” in Proceedings of the International Conference on Information Processing in Sensor Networks (IPSN).    IEEE, 2016, p. 4.
  • (20) R. Salakhutdinov and G. Hinton, “Deep boltzmann machines,” in Artificial intelligence and statistics, 2009, pp. 448–455.
  • (21) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • (22)

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2012, pp. 1097–1105.
  • (23) X. Wang, X. Wang, and S. Mao, “Cifi: Deep convolutional neural networks for indoor localization with 5 ghz wi-fi,” in 2017 IEEE International Conference on Communications (ICC).    IEEE, 2017, pp. 1–6.
  • (24) Y. Ma, G. Zhou, S. Wang, H. Zhao, and W. Jung, “Signfi: Sign language recognition using wifi,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 2, no. 1, p. 23, 2018.
  • (25) Q. Zhou, J. Xing, W. Chen, X. Zhang, and Q. Yang, “From signal to image: Enabling fine-grained gesture recognition with commercial wi-fi devices,” Sensors, vol. 18, no. 9, p. 3142, 2018.
  • (26) T. ZHANG and M. Yi, “The enhancement of wifi fingerprint positioning using convolutional neural network,” DEStech Transactions on Computer Science and Engineering, no. CCNT, 2018.
  • (27) L. Wenyuan, W. Siyang, W. Lin, L. Binbin, S. Xing, and J. Nan, “From lens to prism: Device-free modeling and recognition of multi-part activities,” IEEE Access, vol. 6, pp. 36 271–36 282, 2018.
  • (28) W. Lin, S. Xing, J. Nan, L. Wenyuan, and L. Binbin, “Concurrent recognition of cross-scale activities via sensorless sensing,” IEEE Sensors Journal, vol. 19, no. 2, pp. 658–669, 2019.
  • (29)

    M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identifying density-based local outliers,” in

    ACM sigmod record, vol. 29, no. 2.    ACM, 2000, pp. 93–104.
  • (30) J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2015, pp. 3431–3440.
  • (31) V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
  • (32) L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018.
  • (33) D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of International Conference on Computer Vision (ICCV).    IEEE, 2015, pp. 4489–4497.
  • (34) K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555.
  • (35) Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang, “Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5734–5743.
  • (36) K. Yang, P. Qiao, D. Li, S. Lv, and Y. Dou, “Exploring temporal preservation networks for precise temporal action localization,” in Proceedings of Association for the Advancement of Artificial Intelligence (AAAI), 2018.
  • (37) T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.
  • (38) G. Gkioxari, R. Girshick, P. Dollár, and K. He, “Detecting and recognizing human-object interactions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8359–8367.
  • (39) H. Caesar, J. Uijlings, and V. Ferrari, “Coco-stuff: Thing and stuff classes in context,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1209–1218.
  • (40) W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
  • (41) H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
  • (42) F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
  • (43) R. Hamaguchi, A. Fujita, K. Nemoto, T. Imaizumi, and S. Hikosaka, “Effective use of dilated convolutions for segmenting small object instances in remote sensing imagery,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).    IEEE, 2018, pp. 1442–1450.
  • (44) G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks for high-resolution semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1925–1934.
  • (45) L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 801–818.
  • (46) K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, 2014, pp. 568–576.
  • (47) A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
  • (48) Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj, “Beyond gaussian pyramid: Multi-skip feature stacking for action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 204–212.
  • (49) S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2013.
  • (50) J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
  • (51) D. Halperin, W. Hu, A. Sheth, and D. Wetherall, “Tool release: Gathering 802.11 n traces with channel state information,” SIGCOMM Computer Communication Review, vol. 41, no. 1, pp. 53–53, 2011.
  • (52) R. v. Nee and R. Prasad, OFDM for wireless multimedia communications.    Artech House, Inc., 2000.
  • (53) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.


Figure 10: One ‘hand up’ result example. CSI series vary when one does ‘hand up’ action (1st). The confidence curves of sample-level action detection is shown in 2nd sub-figure. Besides the confidence curves of sample-level action classification are shown in the 3rd su-figure.
Figure 11: One ‘hand down’ result example. CSI series vary when one does ‘hand down’ action (1st). The confidence curves of sample-level action detection is shown in 2nd sub-figure. Besides the confidence curves of sample-level action classification are shown in the 3rd su-figure.
Figure 12: One ‘hand left’ result example. CSI series vary when one does ‘hand left’ action (1st). The confidence curves of sample-level action detection is shown in 2nd sub-figure. Besides the confidence curves of sample-level action classification are shown in the 3rd su-figure.
Figure 13: One ‘hand right’ result example. CSI series vary when one does ‘hand right’ action (1st). The confidence curves of sample-level action detection is shown in 2nd sub-figure. Besides the confidence curves of sample-level action classification are shown in the 3rd su-figure.
Figure 14: One ‘hand circle’ result example. CSI series vary when one does ‘hand circle’ action (1st). The confidence curves of sample-level action detection is shown in 2nd sub-figure. Besides the confidence curves of sample-level action classification are shown in the 3rd su-figure.