CrossCount: A Deep Learning System for Device-free Human Counting using WiFi

07/07/2020 ∙ by Osama T. Ibrahim, et al. ∙ 0

Counting humans is an essential part of many people-centric applications. In this paper, we propose CrossCount: an accurate deep-learning-based human count estimator that uses a single WiFi link to estimate the human count in an area of interest. The main idea is to depend on the temporal link-blockage pattern as a discriminant feature that is more robust to wireless channel noise than the signal strength, hence delivering a ubiquitous and accurate human counting system. As part of its design, CrossCount addresses a number of deep learning challenges such as class imbalance and training data augmentation for enhancing the model generalizability. Implementation and evaluation of CrossCount in multiple testbeds show that it can achieve a human counting accuracy to within a maximum of 2 persons 100 CrossCount as a ubiquitous crowd estimator with non-labour-intensive data collection from off-the-shelf devices.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Monitoring the human count in a given area of interest is a crucial part in many pervasive applications such as smart guiding in museums, energy management in smart buildings, indoor analytics, and people evacuation in emergency situations. For example, in a retail store, lightening and air conditioning can be automatically adjusted based on the clients’ density in each section. Furthermore, the occupancy statistics can assess the store sections that attract more visitors to plan for future business [retail].

Due to its importance, human-counting has attracted the attention of the research community. For example, computer vision researchers, aided with the recent advancement in deep learning, presented high-accuracy human counting systems 

[CrowdNet, CNN, CrossScene, RCNN]

. Images/videos captured by cameras from the area of interest are processed to estimate the human count. Most of these systems are built using Convolutional Neural Networks (CNN) with various architectures and feature optimization techniques. After the network is trained with a large set of labeled images/videos, the human density is estimated by processing images/videos of the crowd using the trained net. However, vision-based systems require high installation cost, suffer from blind spots and occlusion issues, require high computational power, are limited in functionality in poor lightening conditions, and raise privacy concerns. In addition, they cannot work through the walls. This

through-wall capability is highly-desirable in a number of applications such as law enforcement [hunt2001].

To address these issues, algorithms based on analyzing the RF signals have been introduced. In particular, they analyze the signal received from the already-installed wireless infrastructure. RF-based systems can be either device-based or device-free. In device-based systems, each human target must be equipped with a device, such as a cell phone [WiCounter, deviceBased], which limits the system ubiquity. On the other hand, in device-free systems, the number of human targets inside an area is estimated by analyzing their impact on the wireless links covering this area, without requiring them to carry any device. The concept of RF device-free localization was introduced in 2007 [challenges]

, presenting human counting as one of the challenges that faces the newborn technique. Since then, a number of device-free counting techniques have been introduced based on different features and machine learning algorithms 

[nakatsuka2008, Nuzzer, yoshida2015, Depatla2015].

In an early study, Nakatsuka et al. [nakatsuka2008]

demonstrated the feasibility of using Received Signal Strength (RSS) of radio links to estimate human counting. The authors showed empirically that increasing the number of people leads to higher variance in the RSS signal of a single RF link. Based on that, they derived a linear formula that relates the human count to the RSS average and variance. The counting functionality in the Nuzzer system 

[Nuzzer] extended this model to work on a large scale. Based on observations, the authors showed that the variance of a single link is not enough to differentiate clearly between the human count classes. Therefore, they proposed to use the average relative variance of WiFi links to count only up to two persons with accuracy and up to three persons with accuracy. Adding more complication to the model, Yoshida et al. [yoshida2015]

examined non-linear regression to capture the relation between the RSS and the number of people. Specifically, they used a Gaussian kernel to perform regression of RSS absolute values of 10 wireless links. Depatla et al. 


propose a probabilistic approach to calculate the RSS probability mass function (PMF) of one link as a profile for each case of human count. When testing using an RSS vector, they compare its PMF with the pre-calculated profiles and report the nearest one, achieving

exact counting accuracy with estimation error of or less of time.

The main limitation of the above approaches is that the area of interest should be covered by a large number of WiFi links in order to achieve an acceptable counting accuracy, which is not the case in many wireless environments and applications. One possible solution to resolve this trade-off between links density and counting accuracy is to use channel state information (CSI) instead of RSS [ElectronicFrogEye, TrainedOnce, Cheng2017], where the data of all RF sub-carriers of every WiFi link is available for processing. Unfortunately, unlike RSS, reading CSI data is not widely supported by all wireless cards. In addition, both RSS- and CSI-based techniques cannot work-well in through-the-wall scenarios, due to the attenuation of the RF signal.

Recently, Depatla et al. [Depatla2018] proposed to utilize the WiFi link blockage events as a discriminate feature instead of features depending on the RSS exact values. In particular, the system embeds the inter-arrival times between the link line of sight (LoS) blockages into a renewable stochastic process that models the human motion mathematically. In addition to providing through-wall counting, the blockage pattern performs well in case of counting moving targets; which incur more RSS variance affecting the estimation model. However, this system accuracy significantly degrades in a number of real-world scenarios, as we quantify in Section IV, due to the oversimplified assumptions in the used mathematical model. These simplifications include discarding the order of inter-arrival times which is an important part of the context information and simulating the human motion to generate the model training data. Besides, the proposed mathematical model is tailored for special cases of testbeds where the WiFi link is in the middle and aligned with the area of interest. For any different setup, the mathematical model in [Depatla2018] does not fit leading to deteriorated performance.

In this paper, we present CrossCount, a through-wall human counting system that leverages a Recurrent Neural Network (RNN) to map a sequence of link inter-blockage temporal pattern to the human count using a single WiFi link. The idea is that, the higher the number of people in an area of interest the shorter the time between blocking a single WiFi link and vice versa. As part of CrossCount design, we introduce different modules to address practical issues such as reducing the labour-intensive calibration required for training a deep learning model as well as handling the imbalance in the number of training cases between the different counting classes.

Implementation and evaluation of CrossCount in different testbeds show that it can provide the exact human count of the time. This increases to to within two persons difference in count. This is achieved using the information of only a single WiFi link, highlighting CrossCount promise as a ubiquitous through-the-wall human counting system. To sum up, the main contributions of this work are threefold:

  • We present the architecture and details of CrossCount: a deep learning system that leverages the temporal blockage information of a single WiFi link to provide accurate device-free human counting.

  • Beneath the folds of CrossCount, we propose a novel technique for training data augmentation and class balancing to significantly decrease the data collection overhead.

  • We implement the CrossCount and thoroughly evaluate its performance in clear and cluttered WiFi testbeds by counting up to persons.

The rest of the paper is organized as follows. Section II provides an overview on how CrossCount works. Section III gives the details of CrossCount components and how it deals with different practical challenges. We evaluate the system performance in Section IV. Finally, Section VI concludes the paper and discusses future directions.

Ii System Overview

Fig. 1: CrossCount system architecture.

In this section, we present a typical scenario of how CrossCount works and give an overview about the information flow through its main modules. The details of these modules are described in subsequent sections. We assume an indoor area covered by a single WiFi link whose transmitter and receiver are behind the walls. There is an unknown number of people that are moving casually inside this room. Given only the RSS readings over time at the receiver, our goal is to infer the human count inside this area of interest.

The CrossCount system architecture is depicted in Fig. 1. The basic idea of CrossCount is to map a sequence of link-blockage time to the estimated count based on the observation that the more people in the area of interest the shorter the time between link blockages. To do this mapping, CrossCount leverages a recurrent neural network. In particular, for discretized time, the input sequence to the RNN is a stream of binary values which are at the time instances when an LoS blockage event encountered and otherwise. The output of the RNN is the estimated count.


operates in three stages: preparing the training data, training the deep network, and finally classifying the input sequence by leveraging the trained network. During the first stage, the training sequences are collected

manually in a light-weight process by recording the timestamps of the WiFi link virtual LoS blockages by a single walking human without reading the link RSS. However, during the online counting stage, the link-blockage events are calculated automatically by processing the captured WiFi link RSS stream at the receiver.

The typical way of collecting the training data in literature is to try all the combinations of the number of people in the area of interest [nakatsuka2008, yoshida2015, ElectronicFrogEye, TrainedOnce, Cheng2017]. This makes the training task labor intensive and limits the systems ability to detect a large number of people [RadioGrapher]. To address the scale and overhead of collecting the training data, CrossCount introduces a new technique that depends on collecting the training samples using only a single person. This single-person training data is then processed to automatically generate the training sets for all the other count classes. Specifically, during the Training Data Preparation stage, the Training Data Collector module records the LoS blockage events caused by a single person moving in the environment to generate the training data for the single-human count class. Thereafter, the Training Data Synthesizer module processes the collected data to generate the training set for multiple-user classes.

The generated data also suffers from the imbalanced data distribution for the different count classes. In addition, deep learning models require large amounts of training data. The Training Classes Balancer module tackles both of these issues by augmenting the training set with synthesized data. This also enhances the system generalizability and increases its ability to deal with the noisy characteristics of wireless channels.


then trains a Long Short-Term Memory (LSTM) RNN network to characterize the blockage pattern of each human count class. The data preparation and training phases are done offline only once while deploying the system.

During the online system operation in the Device-free Counting stage, CrossCount estimates the unknown crowd count by extracting their LoS blockages sequence pattern and classifying it using the trained LSTM model. Specifically, the RSS Collector reads the signal strength received over the WiFi link for a specific time window. The LoS Blockage Detector module processes the RSS values and estimates the LoS blockage events timing encountered along this window in quantized time units. This generates a binary link blockage sequence that has a 1 in the time slot that the link was blocked inside the current window and zero otherwise. The input binary sequence is then passed to the Count Estimator module which uses the learned LSTM network to estimate the human count.

Iii The CrossCount System

In this section, we provide the details of the main CrossCount modules, including diminishing the training data collection overhead, class balancing, and enhancing the system generalizability. We start by an overview of the CrossCount training process.

Iii-a CrossCount Training Process Overview

As a supervised machine learning system, each sample in the CrossCount training set is sequence of the link blockage times which is labeled with the number of persons generating this sequence. The link blockage time is mapped to a bit map, where ones represent instances of time the link was blocked in a discrete time slotted system (Fig. 2).

The direct way to collect these samples is for the specified number of persons to move inside the area of interest for a certain time window, , record the time instances they cross the LoS, and label the generated sequence with the human count. This should be repeated for each and every crowd count class, i.e. number of humans inside the area of interest. Moreover, each class should have a large number of training samples to generate a well-trained deep network.

Accordingly, collecting the training data is a labor-intensive and time consuming task. Some previous work, e.g. [Depatla2015, Depatla2018], proposed a human motion model to simulate this task. However, the mathematical model assumes simple motion that does not capture real life complex scenarios, affecting the system accuracy as we quantify in Section IV-D. CrossCount resolves this challenge by collecting the training samples for only a single user, form which the whole training data for any arbitrary number of humans inside the area of interest is extrapolated.

Iii-A1 Training Data for Single Target

The Training Data Collector module allows a single person to move randomly inside the area of interest while recording the timestamp each time she crosses the WiFi link virtual LoS. Note that the training process does not require processing the RSS of the link as it is based on visually recording when the user crosses the virtual line between the transmitter and receiver. This is because CrossCount depends only on the time of link blockage and not on the specific RSS.

Assuming a discrete time space, a training sample takes time steps, to be collected. This is repeated times to collect different training sequences, each of length . We call these manually collected samples the “the original training samples”.

Iii-A2 Reducing Training Data Collection Overhead for Multiple Persons

Fig. 2: An example describing the superposition of single-person sequences to generate a -persons sequence . The circles, diamonds, and squares represent the LoS blockages of the first, second, and third sequence respectively.

CrossCount generates the training samples for the multi-person classes using the collected data from the single-person case as described in the previous section. To do that, we assume that each person is moving independently of the others. As a result, the LoS blockage sequence of a user is independent from the others and the blockage pattern for multiple persons can be calculated as the superposition of the individual persons’ blockages. In particular, CrossCount randomly selects sequences out of the collected samples of a single person and consider each of them as a link blockage sequence for a different user. The -person count class training sample is synthesized from the superposition of these sequences. Fig. 2 shows an example on how to generate a training sequence for persons from single-person training sequences. For each time instance, the superposed sequence reports an LoS blockage if any of the single-person sequences encounters a blockage at this instance. For each multi-person count class, the Training Data Synthesizer module applies the superposition technique over all the available combinations of the original training samples to generate the higher count classes training set.

Iii-A3 Handling the Class Imbalance Problem

Count Class
No. of Samples
TABLE I: An example on the imbalanced training data for 0-5 count classes. Note that the zero-count class has just one example represented by the bit sequence of all zeros, i.e no link blockage events in an empty area of interest.

The Training Data Synthesizer uses the superposition technique to generate training samples for each -count class. For instance, Table I lists the size of the training set for the to -count classes as generated from original training samples. There is only one training sample for the

-count class which is a sequence of zeros as it does not encounter any LoS blockage during the counting window. The table shows that the class distribution is not uniform and is severely skewed. This imbalanced class data problem is well-known in machine learning 

[Imbalanced, hamada] and leads to a bias in the classifier towards the classes with large training data.

Traditionally, random oversampling and sub-sampling are simple and well-established solutions for this class imbalance problem [Imbalanced]. Oversampling increases the minority classes by replicating the training samples, while sub-sampling decreases the majority classes by randomly removing some of its samples. Oversampling leads to overfitting the training data [Overfitting]. Therefore, in CrossCount, the Training Data Balancer module sub-samples the majority classes training set. In addition, CrossCount also augments the data from the minority classes by injecting noise into the training samples in order to improve the algorithm generalization capability and system robustness [BookMIT]. Noise is injected by randomly flipping some of the bits in the link-blockage bit steam input. This simulates injecting false link-blockage events or missing an actual link-blockage event, which may occur in reality due to the noisy wireless channel. This noisy training data increases CrossCount robustness and avoids overfitting.

The Training Data Synthesizer and Class Balancer modules are implemented as follows. The main input from the Training Data Collector is the manually collected original single-person sequences, from which the multiple-persons training data is synthesized. Each training sequence is seconds in length. For the zero-entity class, there is only one original training sample which is a sequence of zeros with length , expressing that no LoS blockage happened along the time window. This sequence is noised by flipping any single bit along the sequence length. This generates a total of training sample for the -count case, limiting all other classes to this size of training samples, to ensure a balanced training set. The training data for any other -count class is generated as follows. A random combination of original sequences is selected from the inputs and a superposed training sample is synthesized from the bitwise logical OR of all the randomly-selected sequences. Another superposed sample can be generated by selecting another random sequence combination. We keep repeating this till meeting the class balancing condition where the number of generated sequences is equal to . If all the available sequence combinations are processed before meeting the balancing condition, the training set for the current -class is augmented by data noising. A new training sequence is generated by selecting a random superposed sequence, flipping any random bit inside, and appending it back to the training set.

Iii-B Model Training

Fig. 3: CrossCount network diagram.

Recurrent Neural Networks (RNN) model the contextual information in the input sequences. RNNs have been employed in many applications whose input data are sequences of features such as speech recognition [speechRec] and handwriting alignment [handWriting]

. However, when dealing with long sequences, traditional RNNs suffer from the vanishing gradient problem during training 

[BookPhD]. Long Short-Term Memory (LSTM) [LSTM] architecture is one of the most commonly-used solutions to this problem. Therefore, CrossCount leverages them to capture the contextual information in the link-blockage binary input sequence. Figure 3 shows the CrossCount LSTM architecture. The input sequence is one temporal bit stream generated from the deployed WiFi link, representing the link-blockage events as described in Section III

. Therefore, we employ a one-dimensional input layer. Each crowd count is an output class of the network. Accordingly, the size of output layer is determined and preceded by a fully connected softmax for classifying among more than two output classes. The LSTM layer implements the tanh and sigmoid functions as the default configuration for cell state and gates activation respectively. The output layer implements cross entropy loss function.

During the Model Training phase, the LSTM network is trained using the original and synthesized training samples. The synthesized data allows the model to generalize better to noisy unseen data.

Iii-C Online Phase: Human Counting

In this stage, the trained network is used to estimate the unknown crowd count. The WiFi link RSS stream is analyzed to extract the LoS blockage pattern of the current crowd. Feeding this blockage sequence to the trained LSTM activates a forward path to an output class which will be reported as the count estimate.

Iii-C1 The RSS collector

The RSS reader is a lightweight agent running on the receiver to record the temporal change of the Received Signal Strength (RSS) from the transmitter along with its time stamp. Finally, a sequence of timestamped readings, , is collected and sent to preceding modules. Where each reading is streamed as a pair of the RSS value and time .

Iii-C2 LoS Blocking Detection

(a) Single Person.
(b) Three Persons. For visual clarity, not all the blockages are indicated.
Fig. 4: The RSS measurements of a WiFi link while some persons are moving around. The time instances when a person crosses the LoS are highlighted.

Fig. 4 shows a typical example of the RSS measurements of a deployed WiFi link while a human is moving around. The figure shows that the RSS measurements are significantly attenuated when someone crosses the LoS. The figure shows that there is a down pulse whenever the target person crosses the link LoS while the other fluctuations are due to the multipath effect [Depatla2018]. From the above observations, the RSS fluctuations due to multipath are limited within a certain level around the mean value, while the LoS blocking has a higher down pulse. Accordingly, the LoS Blockage Detector of CrossCount captures an LoS blockage by a simple thresholding process on the RSS changes. Specifically, at any time instance if the RSS exceeds its mean value , by a certain threshold, , an LoS blocking event is declared at this instance. Finally, the module converts the detected LoS blockage events into a binary stream , where


Where is the th time step.

Iii-C3 Count Estimation

The crowd count in the current counting window is estimated by classifying its blockage pattern. The sequence is fed forward to the trained LSTM, and eventually, CrossCount reports the output class as the human count estimate.

Iv Evaluation

In this section, we evaluate the CrossCount performance. We start by describing the experimental testbeds and training process, followed by testing the effect of system parameters on counting accuracy. Finally, we compare CrossCount performance to the state-of-the-art RF human counting systems. Table II contains the default system parameters values used throughout the evaluation section.

Parameter Range
Default value
(Testbed )
Blockage Detection Threshold () (dBm) - ,
Estimation Window () (min.) -
LSTM layer size (units) -

Number of training epochs

Training mini-batch size (samples)
TABLE II: Default system parameters.

Iv-a Experimental Testbeds

(a) Testbed 1.
(b) Testbed 2
Fig. 5: Evaluation Testbeds.

We evaluated CrossCount in two testbeds. The first one is a m room in our lab as shown in Fig. (a)a; it is a controlled environment with low multipath where the WiFi link is deployed right in the middle of the room. This is similar to the testbed used in [Depatla2018]. The second testbed is more complex, as shown in Fig. (b)b: it is a larger hall of m, rich in multipath due to its furniture, and the WiFi link is not aligned with the hall nor in the middle. In both testbeds, the transmitter is mounted to the wall at height of cm, the receiver is placed cm from the floor, and the walls are made of bricks.

Iv-B Training Data

The original training set is collected using a single person who walks in each testbed for minutes generating a -minutes-long sequence. This sequence is divided into sub-sequences of -length each. Following the data synthesis process described in Section III-A3 with the default counting window, training sequences are generated per count class resulting in a total training set of samples when testing using persons. The details and types of the generated training sequences are listed in Table III.

The CrossCount

model was trained using stochastic gradient descent with momentum (SGDM) optimizer, with

momentum and initial learning rate. On average, the model training process takes us hours to finish on a Dell XPS 8500 desktop with a core i7 processor and GB memory.

Count Class 0 1 2
Collected Sequences
Synthesized Sequences
Generated Noisy Sequences
TABLE III: Collected Raw data for Training Sequences

Iv-C Effect of System Parameters

In this section, we evaluate the effect of changing the system parameters on the counting performance as reported by the Absolute Counting error , where is the summation of the absolute difference between real and estimated counts in all cases.


Where are the real and estimated counts respectively.

Iv-C1 LoS Blockage Detection Threshold

Fig. 6: Evaluation of blockage detection threshold.

Fig. 6 shows the absolute counting error at different values of the LoS blockage detection threshold () in both testbeds. For small thresholds, any minor fluctuation in RSS values is falsely detected as an LoS blockage; this increases the LoS blockage rate and reports extra persons in the testbeds. In contrast, high thresholds lead to missing real LoS blockage events, underestimating the actual crowd number and increasing the absolute error. Throughout the rest of the evaluation section, we set and as the default blockage detection thresholds for testbeds and respectively as they lead to the best performance.

Iv-C2 Counting window length

Fig. 7: Evaluation of the counting window length.

Figure 7 shows the effect of various time windows on the counting performance. For short windows, the information reflected by the blockages sequence encountered within the window is not enough for providing accurate count estimates. Increasing the window size to minutes, as used in this paper, gives the best counting accuracy. Note that this is the same estimation delay used in literature [Depatla2015, Depatla2018, yoshida2015]. The system designer needs to tune this parameter to trade-off latency and accuracy of estimation based on her specific application need.

Iv-D Comparisons with other systems

Real Count CrossCount Depatal et al. [Depatla2018]
Estimation Error Estimation Error
TABLE IV: Counting Results of Testbed . Lighter colors refer to better performance.
Real Count CrossCount Depatal et al. [Depatla2018]
Estimation Error Estimation Error
TABLE V: Counting Results of Testbed . Lighter colors refer to better performance.
(a) Testbed 1
(b) Testbed 2
Fig. 8: CDF of absolute counting error

In this section, we compare the CrossCount performance with Depatla et al. [Depatla2018] as the most recent related work. Moreover, the functionality of the two systems is based on the LoS blockage of a single WiFi link using RSS. For this comparison, volunteers walked through the testbeds. Due to the limited area of testbed , only persons participated. The distribution of the absolute counting error for the two testbeds is reported in Fig. 8 and detailed in Tables IV,V. In Testbed , CrossCount achieve exact count accuracy with only a maximum of -count-difference all the time, while Depatla et al. [Depatla2018] delivers only exact count accuracy. Depatla et al. [Depatla2018] do not take advantage of the arrival order of blockage events when processing the given counting window unlike CrossCount which consider the whole context information including the arrival order leading to improved performance. The counting estimates are less accurate in Testbed for both the two systems due to the environment complexity. CrossCount could achieve lower exact accuracy of while maintaining a maximum error of count difference, while Depatla et al. [Depatla2018] degraded to counting accuracy and in some cases it reports count difference out of persons. This can be explained by noting that the mathematical model in Depatla et al. [Depatla2018] is tailored for special cases of testbeds where the WiFi link is in the middle and aligned with the area of interest. These assumptions hold in Testbed , but is not true in Testbed 2, leading to increasing the accuracy improvement for sake of CrossCount over Depatla et al. [Depatla2018] in the second testbed.

V Discussion

In this paper, we proposed a novel idea for counting people by classifying their blockage pattern on WiFi links using deep learning. We introduced a data synthesizing technique that augmented the training set for robust training and provided a lightweight calibration phase. We conduced many experiments that proved the idea and showed that the proposed data generation technique could capture the reality to a reasonable extend. However, the presented system has some limitation that are encouraged to be addressed in the future extensions.

First, CrossCount generated the multiple-person training data by superposing the blockage pattern of a single person. However, further improvement in system performance is expected when simulating the multi-person data as the convolution of RSS signals of single-person. Moreover, the superposition technique could be enhanced by applying left\right translation on blockage sequences before being superposed. Second, CrossCount implemented a higher level data noising approach where the blockage events are noised to simulate the RSS changes due to wireless characteristics. However, a lower noising level could be considered by simulating the signal amplitude changes. Finally, CrossCount assumed a casual human motion with non-zero speed while collecting the training and testing sequences. Nonetheless, more sophisticated walking patterns could be investigated, besides, some special scenarios of human movement might be handled, such as when a human stands at the LoS generating a contiguous blocking sequence, among others.

Vi Conclusion

In this paper, we presented the design, implementation, and evaluation of CrossCount

: an accurate human counting system based on Recurrent Neural Networks. The system provides different techniques for handling a number of challenges found in the literature such as through-wall signal weakness, labor-intensive data collection, imbalanced training data, high training overhead, high number of data links, and unavailability of CSI data in commodity devices. The main idea is to process the WiFi link blockage inter-arrivals rather than depending on the statistical features extracted from RSS values. By classifying the blockage pattern using an LSTM network,

CrossCount achieved superior counting accuracy than the current state-of-art single-link RF-based counting systems.

Currently, we are extending CrossCount in different directions including extending the system to work with multiple links and leveraging CSI information when available.