The key components of ubiquitous WiFi networks, WiFi devices, have been widely explored in many human sensing work such as indoor localization vasisht2016decimeter ; kotaru2015spotfi ; qian2018widar2 ; li2016dynamic and activity classification wang2015understanding ; virmani2017position ; kotaru2017position . Retracing the development of these work, a natural question arises: whether WiFi devices can work like cameras for fine-grained human sensing task such as the person pose estimation. If the answer is yes, WiFi could be an alternative or supplementary solution for cameras in some situation such as sensing through-wall, under occlusion and in the dark. Besides the advancement in the physical properties comparing to cameras, WiFi devices are prevalent, requiring less cost in deployment, and rise less privacy concerns for the public.
Though estimating person pose estimation with WiFi is with high practical impact for above explanations, it is full of challenges. First, WiFi is designed for wireless communication, which carries no direct information on the person keypoint coordinates. We cannot benefit from the most popular person pose estimation schema in computer vision techniques, inferring the person location from a image then regressing the keypoint heatmapsfang2017rmpe ; chen2018cascaded . Thus in order to learn the mapping from WiFi signals to person pose, pose supervision must be prepared and it must be corresponding with WiFi signals. To deal with this problem, we combine a camera with WiFi antennas to capture person videos. The camera and WiFi are synchronized with Unix time to guarantee the correspondence. The pose supervision is derived from videos through the AlphaPose fang2017rmpe , an accurate yet fast open-source person pose estimation repository.
Second, it is very pioneering that estimating person pose from WiFi signals, thus we have little work to refer. Even in computer vision community, it takes decades to reach an acceptable performance for image or video inputs. Generally, deep networks that estimates with WiFi should be completely different. After abundant survey, we propose WiSPPN (abbreviation of WiFi single person pose networks), which is a selective combination of CSI-Net wang2018csi , ResNet he2016deep and FCN long2015fully . To be exact, we utilize the up-sampling stage of CSI-Net to encoding WiFi signals. Then we use ResNet to extracting feature. Moreover, we propose an innovative pose embedding approach which is inspired by the adjacency matrix in the graph theory. This approach would take the length constraint of pose coordinates and make pose estimation can be done with FCN.
When solve these two challenges, we achieve single person pose estimation with WiFi. Evaluation over 80k images shows that our approach achieve single person pose well.
The contribution of this paper can be summarized as follows.
1. We put forwards a question that whether WiFi can be used like cameras for vision problem. We positively answered this question by demonstrating that WiFi singles can be used for single person pose estimation.
2. To answer this question, we built a multi-modality system, collected a dataset and propose a novel deep networks to learn the mapping from WiFi signals to person keypoint coordinates.
2 Related Work
Camera. Estimating multi-person pose from RGB images is a widely-studied problem wei2016cpm ; cao2017realtime ; newell2016stacked . The leading solutions of COCO Challenge, such as AlphaPose fang2017rmpe and CPN chen2018cascaded , are prone to apply a person detector to crop every person from images, then to do single person pose estimation from cropped feature maps, regressing the heatmap of each body keypoints. Coordinates with highest confidence are the estimation of single person pose.
Other sensors. Due to the potential usages of pose estimation, researchers have applied many other sensors to estimate person body pose or sketch. Wall++ zhang2018wall++ enables a common wall large electrodes with water-based nickel water_based
painted. Then the Wall++ can sense airborne electromagnetic variance caused by human body and estimate person pose. LiSenseli2015human and StarLight li2016practical use ceiling LED and photo-resistor blanket to capture human body shadow on the blanket, then reconstruct body sketch/pose. With promoted technology, person’s hand can also be reconstructed by similar systems li2017reconstructing . RF-Capture adib2015capturing and RF-Pose zhao2018through implements radars with frequency modulated continuous wave (FMCW) equipment to estimate person body sketch/pose. Even, single-photon sensors can be used to reconstruct person body shin2016photon ; o2017reconstructing . Comparing to these sensors, devices with WiFi chips may be the most pervasive, such as routers, cell phones and blooming Internet-of-Things.
3.1 WiFi Signals and Channel State Information
Under IEEE 802.11 n/g/ac protocols, WiFi works around 2.4/5GHz (central frequency) with multiple channels. In each channel, the bandwidth is 20/40/80/160MHz. Within the band, carriers with different frequencies are modulated to carry information for wireless communication in parallel, which is called orthogonal frequency division multiplexing (OFDM) and illustrated in the left of Fig. 2. During propagation, WiFi carriers decay in power and shift in phase. Moreover, their frequencies may also change when encountering a moving object due to the Doppler Effect. Channel State Information (CSI), a physical layer indicator, can be used to represent these variation of carriers.
Take the modulation method of 16-quadrature amplitude modulation (16-QAM) for example 111More modulation method can be found at, http://mcsindex.com/, as shown in the right of Fig. 2, one modulated carrier contains 4bits information one time. When the sender sends a ‘1111’ to the receiver, the carrier would be modulated to . If the receiver receives a . Thus the variation happening during propagation is , which is called CSI of this carrier. For the human sensing application, human body as an object, is able to make carrier change. In this paper, we aims to learn the mapping rule from the change to single person pose coordinates. We set WiFi working within a 20MHz band, the CSI of 30 carriers can be obtained through a open-source tool halperin2011tool . In the remaining content of this paper, WiFi signals and CSI indicate the same thing if not stated specially.
AlphaPose is an open-source multi-person pose estimation repository 222https://github.com/MVIG-SJTU/AlphaPose/tree/pytorch, which is also applicable for single person pose estimation. AlphaPose is a two-step framework, which first detects person bounding boxes by a person detector (YOLOv3 redmon2018yolov3 ) then estimates pose for each detected box by the pose regressor. With the innovative regional multi-person pose estimation framework (RMPE) fang2017rmpe , AlphaPose gains estimation resilience to the inaccurate person detection, which largely facilitates the pose estimation performance. Please refer fang2017rmpe for more details on AlphaPose and RMPE.
When applied to single person pose estimation, AlphaPose generates three-element predictions in the format of , where is the number of keypoints to be estimated, and are the coordinates of the i-th keypoint, and is the confidence of the above coordinates. In this paper, we use the COCO person keypoint setting and is 18. Four estimation examples of AlphaPose are shown in the top of Figure. 1.
4.1 System Build
To do pose estimation from WiFi by learning, we must have pose annotations. However we cannot mark person pose coordinates in the WiFi signals, thus we use a camera aligned with WiFi antennas to capture person videos. Then the video is processed by AlphaPose fang2017rmpe for pose annotations in coordinates and confidences. Besides, the camera and WiFi antennas are synchronized by the their recorded time-stamps. The WiFi CSI recording system is comprised with 2 ends, one 3-antenna sender and one 3-antenna receiver. The sender broadcasts WiFi signals, meanwhile the receiver parses CSI through halperin2011tool
when receiving the broadcasting WiFi. In our setting, the parsed CSI is a tensor with the size of, where the is for the number of received WiFi packages; 30 is for the subcarrier number; the last two 3s represent the antenna numbers of sender and receiver, respectively. The WiFi pose system is shown Fig. 3. In our data acquisition, we set the sampling rate of WiFi devices and the camera as 100Hz and 20Hz, respectively. Thus we have a paired dataset in which every 5 CSI samples and one image frame are synchronized by their time-stamps.
4.2 Pose Adjacent Matrix
As Section. 3.2 said, we have 18 person keypoint coordinates, , for each video frame from those with single person. Note that AlphaPose may predict multiple persons for single-person frame (false-positive), in this situation, we only keep the one with highest confidence. Many work have demonstrated that regressing keypoint coordinates harms the generalization ability in person pose estimation . Thus in this paper, we learn to regress pose adjacent matrix (PAM), instead of directly regressing person keypoint coordinates. The PAM is a matrix, also annotating the pose coordinates and confidences of 18 keypoints. The PAM is comprised of 3 submatrixes, , and . The and are generated by Equ. (1) from the 18 three-element entries: . The is generated similar to generating .
To be specific in the Graph Theory view, we take person skeleton as a directed complete graph (DCG) west1996introduction , each keypoint as a node of the graph. For the and of PAM, the diagonal items are the coordinate values in and axes of these 18 nodes, respectively. Meanwhile, the elements in other indexes are the displacement of two adjacent nodes–hence the name of pose adjacent matrix. For the of PAM, the diagonal items are the confidence values of corresponding nodes. While we think the displacement between two nodes happens independently, thus we computer for other indexes. Finally, we innovatively embed person keypoint coordinates as well as the displacements between keypoints into the PAM.
The main advancement of PAM is that it provides additional constraint of human skeleton shape for person pose estimation. Take the displacement in axis from the nose to the neck for example, the displacement is a negative value in majority of situation because the neck is below the nose in human skeleton for a standing person. The negativity take the direction of nose to neck as constraint. Besides, the absolute value of the displacement take the length of nose to neck as constraint. When we regress PAM, the additional regression on its displacements work as the regularization item, taking person skeleton shape into consideration and highly increasing the approach generalization ability comparing to regressing keypoint coordinates directly.
4.3 Network Framework
We donate the training dataset as , where and are a pair of synchronized image frame and CSI series, respectively;
means the sampling moment; andis the dataset size. We propose a novel deep network to train for the purpose of learning to a mapping rule from CSI series to person body keypoints. The network framework is comprised of AlphaPose fang2017rmpe as a teacher network and WiSPPN as the student network, shown in Fig. 5. The teacher and student network are termed as and , respectively. For each pair, takes as input, and outputs the corresponding body keypoint coordinates and confidence, , with the person detector and pose regressor. We then convert the outputs to a body pose adjacent matrix, , with aforesaid Euqation. 1. We formulize the operation of the teacher network as , where is the cross-modality supervision to teach .
We go into details on , i.e., WiSPPN. In the training stage, takes as input, and outputs a corresponding prediction of pose adjacent matrix. Then is optimized by the supervision of . As shown in Fig. 5, WiSPPN consists of three key modules, i.e., the encoder, feature extractor and the decoder. A is converted to the prediction undergoing these three modules successively. Next we explain our designing intentions and parameter details on these three modules.
1. Encoder. The encoder is designed to upsample to a proper width and height which are suitable for the mainstream convolutional backbone networks such as VGG simonyan2014very and ResNet he2016deep . Recall that our WiFi system is comprised of a sender and a receiver both with 3 antennas, which outputs CSI samples with size of through a open-source tool halperin2011tool , where the 30 is the number of OFDM carriers described in Section. 3.1. As said in Section. 4.1, one image matches with 5 continuous CSI samples due to the sampling rate inconformity, leading to , and we reshape it to be along the time axis, which makes . However, a general RGB image is with size like , where 3 is for the 3 color channels in Red, Green and Blue; and 224s are the height and width of the image. To enlarger the width and height of CSI samples, CSI-Net wang2018csi use 8 stacked transposed convolutional layers to upsample its input from size of to
gradually, which is operation-consuming. In WiSPPN, we apply one bilinear interpolation operation to directly convert
for further feature extraction.
2. Feature extractor. With the upsampled , the feature extractor are used to learn efficient features for the person pose estimation. Because
lacks spatial information of person body keypoints comparing to images, we need a powerful feature extractor to release its spatial information. Conventionally, a deeper network could have a more powerful feature learning ability. Thus we tend to use a deeper network as the feature extractor of WiSPPN. However, deeper networks are prone to be gradient vanishing or gradient exploding because the chain rule in the backpropagation optimization could result in exponential gradients in the very deep convolutional layers. The ResNetshe2016deep
are a cluster of the most widely-used backbone networks in deep learning domain, especially in the computer vision research. The ResNets alleviate this problem by the shortcut connection and residual blocks. Considering this advantage, we stack 4 basic blocks of ResNethe2016deep (16 convolutional layers) as the feature extractor of WiSPPN (shown in Fig 6), which learn features with a size of , termed as . The detailed parameters of feature extractor are listed in Table. 1
. Note that a batch normalizationioffe2015batch
and a rectified linear unit activationkrizhevsky2012imagenet follow every convolutional layer, successively.
|Block name||Output size||Parameters|
3. Decoder. The decoder is designed to do shape adaption between the learned features, , and the supervision outputted by , . As described in Section. 4.2, the pose adjacent matrix is a novel form for embedding the 18 body keypoint coordinates and the corresponding confidences, and is with the size of . In the pose estimation task, a body keypoint can be localized as in two coordinates, i.e., the axis and the axis. Thus the decoder is designed to take as input and predict pose adjacent matrix within dimensions, leading to a predicted pose adjacent matrix . To achiever this purpose, we stack two convolutional layers illustrated in Fig. 7, where is mainly to release channel-wise information (from 300 to 36); and is mainly to further reorganize spatial information of with the convolutional kernels.
Summarily, with the encoder, feature extractor, and decoder, the student network, WiSPPN, predicts a pose adjacent matrix on each CSI input, . We formalize this process as . During training stage, every predicted is supervised by the corresponding result of the teacher network, i.e., . Once the student network learns well, it gains the ability to do single person pose estimation only with CSI input. Next we describe processes of training stage, including the loss computation and implementation details.
4.4 Pose Adjacent Matrix Similarity Loss
As above description, outputs as supervisions, and outputs as predictions. With the supervisions and the predictions, L2 loss is a basic option to be applied to optimize WiSPPN as follows.
where is a operator to compute L2 distance; and are the prediction and supervision of pose adjacent matrix for body keypoint coordinate in the axis, respectively; and are with similar representation while in the axis.
In this paper, we take the prediction confidence of keypoints in to the loss computing as follows.
4.5 Training Details and Pose Association
We implemented WiSPPN with Pytorch 1.0. The network is trained for 20 epochs with initial learning rate of 0.001, batch size of 32 and Adam optimizer. The learning rate decays by 0.5 at the epoch of 5th, 10th and 15th.
Once WiSPPN trained, we use it to estimate person pose from testing CSI samples. Taking one sample for example, we can get a predicted PAM (). We take the diagonal elements in as the body keypoint prediction by following equations.
5.1 Data Collection
We collected data under an approval of Carnegie Mellon University IRB 333No. STUDY2018_00000352. We recruited 8 volunteers, and asked them to do casual daily actions in two rooms of the campus, one laboratory room and one class room. Floor plans and data collection positions are illustrated in Fig. 8. During the actions, we run the system in Fig. 3 to record CSI samples and videos, simultaneously. For each volunteer, data of his first 80% recording is used to train the networks, and data of the last 20% recording is used to test the networks. The data size of training and testing are 79496 and 19931, respectively.
5.2 Experimental Results
where is a binary indicator that outputs 1 while true and 0 while false. are the same as Equation. is the number test frames. denotes the index of body joint and . The and are for the positions of the right shoulder and the left hip, respectively. Thus the can be regarded as the length of the upper limb, which is used to normalize the prediction error, , where is prediction coordinates and is the ground-truth.
Table. 2 shows the estimation performance of 18 body keypoint in @5, @10, @20, @30, @40, and @50. From the table, we can see WiSPPN do pose estimation well. Figure. 9 illustrates some estimation comparisons between AlphaPose and WiSPPN. The results show that WiSPPN can work single pose estimation with comparable results to cameras.
In this paper, we build a system and propose a novel network termed WiSPPN for a fine-grained WiFi sensing, i.e., single person pose estimation. The experimental results show that WiFi sensors can achieve a comparable performance in fine-grained human sensing to cameras.
We thank Jianwei Feng, Zeyi Huang and Sanping Zhou for valuable discussions. Fei Wang is supported by China Scholarship Council.
- (1) D. Vasisht, S. Kumar, and D. Katabi, “Decimeter-level localization with a single wifi access point,” in 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), 2016, pp. 165–178.
- (2) M. Kotaru, K. Joshi, D. Bharadia, and S. Katti, “Spotfi: Decimeter level localization using wifi,” in ACM SIGCOMM Computer Communication Review, vol. 45, no. 4. ACM, 2015, pp. 269–282.
- (3) K. Qian, C. Wu, Y. Zhang, G. Zhang, Z. Yang, and Y. Liu, “Widar2. 0: Passive human tracking with a single wi-fi link,” in Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 2018, pp. 350–361.
- (4) X. Li, S. Li, D. Zhang, J. Xiong, Y. Wang, and H. Mei, “Dynamic-music: accurate device-free indoor localization,” in Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing. ACM, 2016, pp. 196–207.
- (5) W. Wang, A. X. Liu, M. Shahzad, K. Ling, and S. Lu, “Understanding and modeling of wifi signal based human activity recognition,” in Proceedings of the 21st annual international conference on mobile computing and networking. ACM, 2015, pp. 65–76.
- (6) A. Virmani and M. Shahzad, “Position and orientation agnostic gesture recognition using wifi,” in Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 2017, pp. 252–264.
M. Kotaru and S. Katti, “Position tracking for virtual reality using commodity
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 68–78.
- (8) H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “Rmpe: Regional multi-person pose estimation,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2334–2343.
- (9) Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded pyramid network for multi-person pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7103–7112.
- (10) F. Wang, J. Han, S. Zhang, X. He, and D. Huang, “Csi-net: Unified human body characterization and action recognition,” arXiv preprint arXiv:1810.03064, 2018.
- (11) K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- (12) J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
- (13) S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in CVPR, 2016.
- (14) Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in CVPR, 2017.
- (15) A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European Conference on Computer Vision. Springer, 2016, pp. 483–499.
- (16) Y. Zhang, C. J. Yang, S. E. Hudson, C. Harrison, and A. Sample, “Wall++: Room-scale interactive and context-aware sensing,” in Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 2018, p. 273.
- (17) MG Chemicals. Product Information at https://www.mgchemicals.com/products/emi-and-rfi-shielding/water-based-conductive-coatings-wb-series/841wb-super-shield-water-based-nickel-conductive-coating.
- (18) T. Li, C. An, Z. Tian, A. T. Campbell, and X. Zhou, “Human sensing using visible light communication,” in Proceedings of the 21st Annual International Conference on Mobile Computing and Networking. ACM, 2015, pp. 331–344.
- (19) T. Li, Q. Liu, and X. Zhou, “Practical human sensing in the light,” in Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 2016, pp. 71–84.
- (20) T. Li, X. Xiong, Y. Xie, G. Hito, X.-D. Yang, and X. Zhou, “Reconstructing hand poses using visible light,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 3, p. 71, 2017.
- (21) F. Adib, C.-Y. Hsu, H. Mao, D. Katabi, and F. Durand, “Capturing the human figure through a wall,” ACM Transactions on Graphics (TOG), vol. 34, no. 6, p. 219, 2015.
- (22) M. Zhao, T. Li, M. A. Alsheikh, Y. Tian, H. Zhao, A. Torralba, and D. Katabi, “Through-wall human pose estimation using radio signals,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018, pp. 7356–7365.
- (23) D. Shin, F. Xu, D. Venkatraman, R. Lussana, F. Villa, F. Zappa, V. K. Goyal, F. N. Wong, and J. H. Shapiro, “Photon-efficient imaging with a single-photon camera,” Nature communications, vol. 7, p. 12046, 2016.
- (24) M. O’Toole, F. Heide, D. B. Lindell, K. Zang, S. Diamond, and G. Wetzstein, “Reconstructing transient images from single-photon sensors,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017, pp. 2289–2297.
- (25) D. Halperin, W. Hu, A. Sheth, and D. Wetherall, “Tool release: Gathering 802.11 n traces with channel state information,” ACM SIGCOMM Computer Communication Review, vol. 41, no. 1, pp. 53–53, 2011.
- (26) J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
- (27) D. B. West et al., Introduction to graph theory. Prentice hall Upper Saddle River, NJ, 1996, vol. 2.
- (28) K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- (29) S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
- (30) Advances in neural information processing systems, 2012, pp. 1097–1105.
- (31) M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in CVPR, 2014, pp. 3686–3693.
- (32) Y. Yang and D. Ramanan, “Articulated human detection with flexible mixtures of parts,” TPAMI, vol. 35, no. 12, pp. 2878–2890, 2013.