A new approach for pedestrian density estimation using moving sensors and computer vision

An understanding of pedestrians dynamics is indispensable for numerous urban applications including the design of transportation networks and planing for business development. Pedestrian counting often requires utilizing manual or technical means to count individual pedestrians in each location of interest. However, such methods do not scale to the size of a city and a new approach to fill this gap is here proposed. In this project, we used a large dense dataset of images of New York City along with deep learning and computer vision techniques to construct a spatio-temporal map of relative pedestrian density. Due to the limitations of state of the art computer vision methods, such automatic detection of pedestrians is inherently subject to errors. We model these errors as a probabilistic process, for which we provide theoretical analysis and through numerical simulations. We demonstrate that, within our assumptions, our methodology can supply a reasonable estimate of pedestrian densities and provide theoretical bounds for the resulting error.


page 2

page 4

page 9

page 12


DR.VIC: Decomposition and Reasoning for Video Individual Counting

Pedestrian counting is a fundamental tool for understanding pedestrian p...

Tracking Individual Targets in High Density Crowd Scenes Analysis of a Video Recording in Hajj 2009

In this paper we present a number of methods (manual, semi-automatic and...

PSI: A Pedestrian Behavior Dataset for Socially Intelligent Autonomous Car

Prediction of pedestrian behavior is critical for fully autonomous vehic...

URBAN-i: From urban scenes to mapping slums, transport modes, and pedestrians in cities using deep learning and computer vision

Within the burgeoning expansion of deep learning and computer vision acr...

WiderPerson: A Diverse Dataset for Dense Pedestrian Detection in the Wild

Pedestrian detection has achieved significant progress with the availabi...

A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection

Deep learning-based computer vision is usually data-hungry. Many researc...

Herd Routes: A Preventative IoT-Based System for Improving Female Pedestrian Safety on City Streets

Over two thirds of women of all ages in the UK have experienced some for...

1. Introduction

Figure 1. Examples of the images utilized for pedestrian detection. Each of these pictures were automatically captured by vehicles. More then 10 million such images were used for pedestrian detection in Manhattan. Faces and other details have been concealed to protect privacy.

Pedestrians are a integral and pervasive aspect of the urban environment. Real estate, consumer patterns, public safety, and other aspects of city life are deeply intertwined with the variations of pedestrian densities across a city. However, current methods for estimating the distribution of people within a city tend to be expensive and mostly produce a sparse sampling of a few locations.

In this paper, we examine a new method to obtain a dense estimate of pedestrian density. We utilize recent advances in computer vision to find people within a previously intractable large collections of images to compile a relative density map.

In order to take into account the errors inherent to visual objects detection, we model it as a probabilistic detection. Using our model, we provide a closed form and bounds for the asymptotic error of the sampling process. We compare these formulas to numerical simulations of the sensing process. Our results suggests that computer vision produces usable data, despite the inherent noise.

To test our method, we utilized over 40 million street-level images provided by Carmera. The images provided by Carmera were a portion of the images obtained via their partnerships with high coverage fleets operating daily on city streets that traveled through the region of Manhattan Island in New York City over the course of a year. A sample of this data was used to benchmark several state-of-the-art computer vision algorithm. We then utilized the top performing algorithm in a case study to map pedestrian densities in Manhattan.

The contributions of this paper can be summarized as

  1. A new method for the analysis of the spatial variation of urban pedestrians densities utilising state of the art, but imperfect, computer vision algorithms.

  2. A closed form function and bounds for the asymptotic error of the resulting pedestrian densities.

  3. The results of simulations validating the sampling process and the derived asymptotic error.

  4. A benchmark of several of detection algorithms, along with the variation in their parameters, for the purpose of pedestrian detection.

  5. A case study demonstrating the resulting densities for a collection of images from the City of New York.

2. Related Work

There are many ongoing efforts on the use of urban data to achieve citizen-centered improvements (Zheng et al., 2014). Governments and organizations in urban environments collect a vast amount of data daily (USEEPAAD, 2017) encompassing a large assortment of information including mobility, crime and pollution. The collection and use of this information has been attracting attention from the academics, governments and corporations (Vanegas et al., 2012). The work (Arietta et al., 2014) explores the correlation of visual appearance of pictures and the attributes of the region it pertains. They collected images from (Google Inc., 2017) and also indicators from multiple regions and trained a model (Burges, 1998) to predict the indicator based on images. The city attributes include violent crime rates, theft rates, housing prices, population density and trees presence. Results show that the visual data can be efficiently used to predict the region attributes. Additionally, the regressor trained in one region showed reasonable results when tested in a different city.

A pedestrians map of the city has numerous applications for urban planners including the design of public transport network and of public spaces (Whyte, 2012). One approach to obtain a citywide count of pedestrians is to have people scattered around the city manually counting the pedestrians nearby. This approach though is laborious because it requires dedicated people to perform the measures. Another possibility explored in (Reades et al., 2007) is to use cellphone use data to perform the pedestrian count. One clear limitation of this approach is that these data are not public and their coverage are restricted to the places where the carrier signal is present. Additionaly, it is hard to know wether the cell signal is from a pedestrian or from someone in a building or from someone in a car.

Alternatively, we can consider the visual task of finding the pedestrians in city images. A remarkable work in this task consists in using the histogram of oriented gradients as the features vector and a support vector machines for the classification task 

(Dalal and Triggs, 2005)

. In the context of deep neural networks 

(Krizhevsky et al., 2012; Szegedy et al., 2015), the work of (Ren et al., 2015) introduced an approach that tries to solve this task by using a unified network that performs region proposal and classification. In this way, the method accepts accepts annotations of multiple sized objects during the training step and during the testing stage, it performs classification of those objects in images of arbitrary sizes. In (Dai et al., 2016) the authors follow the two-stage region proposal and classification framework of (Ren et al., 2015) and proposes the Region-based Fully Convolutional Networks (R-FCN) which incorporate the idea of position-sensitive score maps to reduce the computational burden by sharing the per-RoI computation. Such speed alterations allow the incorporation of classification backbones such as (He et al., 2016).

There are several city images repositories that contemplate pedestrians, some of them obtained using static cameras (Vezzani and Cucchiara, 2010; Tokuda et al., 2018; Oh et al., 2011) and others obtained using dynamic ones (Geiger et al., 2013; Cordts et al., 2016; Maddern et al., 2017). Such configuration of sensors arrangement have long been studied in the sensor network field (Akyildiz et al., 2007; Akyildiz et al., 2002; Othman and Shazali, 2012) and an important aspect of these networks is whether the sensors are static or mobile. In (Wang et al., 2003) the authors explore the setting of a network composed of both static sensors and of mobile sensors. The holes in the coverage of the static sensors network are identified and the mobile sensors are used to cover the holes. A common problem in sensor networks is the k-coverage problem defined in (Huang and Tseng, 2005), that aims to find the optimal setting of sensors such that any region is covered at least by k sensors. In (Yang et al., 2003) the authors perform the task of counting people based on images obtained through a wireless network of static sensors.

Apart from controllable mobile sensors network, many works explore data collected from collaborative uncontrolled sensors (Basagni et al., 2007) such as from vehicles GPS (Shi et al., 2009; Karagiorgou et al., 2017), mobile phones sensors (Sheng et al., 2012; Lane et al., 2010; Rana et al., 2010) and even from on-body sensors (Consolvo et al., 2008).

The work of (Li et al., 2017) considers the problem of using GPS data from a network of uncontrolled sensors to reconstruct the traffic in a city. They do that in two steps: initial traffic reconstruction and dynamic data completion. Such approach allowed the authors to get a complete traffic map and a 2D visualization of the traffic.

There are many ways to model the movement of mobile nodes in a sensor network, the so-called mobility models (Camp et al., 2002). A simple one is the random walk mobility model (Davies et al., 2000) where at each instant in time each particles gets a direction and a speed to move. In the random waypoint mobility model (Johnson and Maltz, 1996), in turn, particles are given destinies and speeds. They travel toward their goal and once they get the destination a new goal and speed are given. The Gauss-Markov mobility model (Liang and Haas, 1999) attempts to eliminate abrupt stops and sharp turns present in the random waypoint mobility model. It is done by computing the current position based on the previous position, speed and direction.

Simulation of wireless sensor networks has long been studied (Vinyals et al., 2011; Lesser et al., 2012; Niazi and Hussain, 2011) because it allows a complete analysis of system architectures by providing a controlled environment for the system (Titzer et al., 2005). The real-life systems non-determinism is simulated by the use of pseudo random number generators (Knuth, 1997). Among the large number of pseudo random number generators (Park and Miller, 1988), a popular algorithm is the Mersenne Twister (Matsumoto and Nishimura, 1998) due to its efficiency and robustness.

3. Pedestrians and sensors flow model

Figure 2. An hypothetical illustration of the type of detection errors considered in this paper. The person on the left was not identified by the detector and is a false-negative. The rightmost detection is a false-positive. The two correct detections in the center are true-positives. Notably missing are true-negatives which are not a useful concept in this situation due to the overwhelming number. Faces and other details have been blurred to protect privacy.

As current pedestrian detection algorithms are far from perfect, it is natural to wonder about the accuracy of any pedestrian count resulting from their use. In this section we we provide a theoretical analysis of the effect of algorithmic errors on the final count.

In our model, we assume that the world is modeled by a number of small regions, or buckets, each of which we intend to measure a density. Sensors and people move around a world in some random fashion. At regular intervals, each sensor takes an independent measurement of the nearby pedestrian count and updates the recorded density at its current location,

. More formally, each time a sensor takes a sample, it obtains a measurement represented by the random variable

. While we don’t specify the distribution of , we assume that the expected value follows the formula


Here is the actual number of people in the location and time being sensed. is a number giving the success rate of the vision algorithm and indicating its false positive rate.

The result of this process is the density of people at each location, .


For comparison, the ground truth density , defined respectively by (where is the number of steps and samples),


We show in Appendix B, Equation 16 that the expected value of is


In other words, is a biased estimator of . Unless the our sensing algorithm precisely follows Equation 4, we are unable to transform this biased estimator into an unbiased one. Furthermore, even in the ideal case, and may not be known. Instead, we directly utilize and attempt to find a relative histogram. That is, we expect to get a number proportional to the density of the number of people at a location and not the actual density. As such, for any constant , our density is equivalent to one scaled to . Treating the distribution as a vector, we measure the direction but not the magnitude. In the terminology of group theory, our measurement suggests a density within the equivalent class:


To validate our measurement we need a metric that indicates how well the equivalent class compares to the ground truth distribution . To do that, we compare the ground truth to the unique closest element within the equivalent class. As a vector projection, this minimum element is (see Appendix A for a proof):


which we can then compare using the usual euclidean metric . However, this metric depends on the number of locations in the map, as well as the number of people. As such, we normalize the metric to between 0 and 1, to obtain a final metric:


In Appendix B we show that we expect that over long periods of time we expect the asymptotic error to approach Equation 24:


Here is the average density of people and describes the distribution of . However, can best be thought of as parameters that describe the asymptotic error. Both of these parameters depend on the resolution of the heat map in addition to pedestrian distribution. In many cases can not be determined, as such we can use the inequality in Equation 26 of Appendix B:


It is important to note that needs to be the ground truth density of people, in the same units of . If only the sampled average density,

, is know, the unbiased estimator of

, can be used. This leads to the bounds


This final formula is only dependent on the false positive rate of the sensing algorithm and the average density of sensed objects measured by process, making it suitable for practical sensing applications. We wish to emphasize that this inequality is true whenever Equation 4

holds regardless of the underling probability distribution. This function is only useful when

. In that domain, it is a monotonically increasing function of . Thus, if is not precisely known, it is best to err on the side of larger values.

3.1. Simulation

Figure 3. Left: An illustration of our simulation containing a sensor (center) moving through an environment with numerous pedestrians. Right: Each sensor movies with uniform speed and is able to sense people within a radius of . Each sensing operation has a probability of correctly detecting each person and, on expectation, finds false positives. People move with uniform speed .

The real-life acquisition process lacks some of the simplifications we used in our model. For example, samples taken in spatial and temporal proximity are correlated. To examine the performance of the sensing systems in the face of these non-ideal circumstances, we created a discrete event simulation (Law et al., 2007) to compare sensed distributions to a known ground truths.

As illustrated in Figure 3, we simulated a number of mobile sensors that detect nearby particles. Each sensor has a circular coverage of radius . Collision among particles and sensors are ignored for simplicity. Sensors and particles move with uniform speeds and respectively. The simulation world is mapped as a graph, as in (Tian et al., 2002). Each node in the graph is a traversable point by both sensors and particles and edges represent a path between the end nodes.

We assume that, in each time step, sensor has an independent chance, , of detecting each of the persons within range along with an independent chance per location to obtain a false positive. These assumptions lead to

being sampled from the sum of a binomial distribution with mean

and a Poisson process with a given expected number . A calculation of the expected value indicates that Equation 1 is satisfied and that our theoretical error calculations and bounds should be valid.

Initialization(map, currentposition, destination);
while indefinitely do
       if destination =  then
             destination random(map)
             path A*(currentposition, destination, map)
             currentposition pop(path);
       end if
end while
Algorithm 1 Mobility model of sensors and particles of the simulation.

The system state can be described by various state variables: sensors and particles positions, sensors and particles waypoints, real density of particles and sensed density of particles. Sensors and particles move with a variation of the random waypoint model (Johnson and Maltz, 1996), differing to it by the fact that sensors and particles are not allowed to change speeds; they have fixed speed given by the system parameters and . When a new destination is randomly picked, the trajectory on the map graph is computed using the A* algorithm (Hart et al., 1968) and the points of the trajectory are pushed to a heap (please refer to Algorithm 1).

As time progresses we obtain a 2D histogram for the sensed density as well as the ground truth density of particles. We are primarily interested in the difference between them, as given by the metric in Equation 7.

The source code of this implementation is publicly provided

Parameter Symbol Values
Number of people 50000
Person speed 1
Sensor speed 3
Number of sensors 10000
Sensor true positive rate 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0
Sensor exp. number of false positives 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2
Sensor range 1
Table 1. Parameters of the simulation along with the values we used in the experiments.

3.2. Simulation results

We evaluated different true positive rates and expected number of false positives of the sensors, and and we used a Mersenne Twister pseudo number generator (Matsumoto and Nishimura, 1998).

The various values for the parameters used in our experiments are listed in Table LABEL:tab:parameters. For the 143 possibilities combination of values, we ran the simulation for time steps independent times.

The code is primarily implemented in Python with performance-sensitive sections implemented in Cython (Behnel et al., 2011). The average time to run a single experiment of this optimized code is of seconds. A single processing and single machine processing would take roughly year to run all the experiments but running them in parallel, it took days.

For each experiment we examine the decay of the metric given by Equation 7 as a function of the cumulative number of samples captured by all the sensors. We assume the error continues to decay until it reaches an asymptotic minimum error within the simulation time steps. Afterwords, we take the average decay curve of all runs for each settings configuration and take the average of the last values to find the asymptotic value.

Figure 4. Left: Asymptotic error metric between the sensed and ground truth histograms in our simulation. Piloted as a function of the sensors true positive and expected number of false positives. Right: Comparison of the asymptotic histograms error measured from the simulation experiments (blue dots) to the theoretical close form function (Equation 8, green mesh) and the approximate bound (Equation 9, red mesh). Note that is not shown due to the denominator of Equation 9.

We can visualize the results from the simulations in Figure 4 which shows how the variation of true positive rate and false positive rate affect our histogram error. If we take a horizontal profile of say of true positive rate we can see how the errors are greatly affected by the variation of the expected number of false positives, varying from very low to high error values (represented by the variation on the color saturation). We compare these values to our theoretical formulas (see Equation 8), and show they are approximately equal. Finally, We show that they are within the bound given by Equation 9.

4. Computer Vision Sensing

Carmera uses a fleet of camera equipped cars (such as in (Lee et al., 2009)) traveling through Manhattan to acquire a temporally and spatially dense collection of pictures. The orientation of the cameras varies and the nature of the images are similar to street level collections provided by many mapping services. However, the images are not stitched into a degree panorama. Every image is accompanied by metadata including the acquisition time, location, and camera orientation. The images are captured as the vehicle travels, with no control of the content, the illumination, the weather, the traffic conditions, or vehicular speed. The typical image depicts a urban scenario as a background and the city dynamics including pedestrians, vehicles and bicycles such as in Figure 1. Our dataset differ from several existing publication  (Vezzani and Cucchiara, 2010; Tokuda et al., 2018; Oh et al., 2011; Geiger et al., 2013; Cordts et al., 2016; Maddern et al., 2017) by providing dense temporal coverage in addition to dense spatial coverage.

All images included in the sample have a resolution of . We used a sample of images captured from March 2016 to February 2017 containing 10,708,953 images. This sample presents a dense spatial sampling of the whole region over a year, but irregular spatio-temporal sampling on a daily basis (see Figure 5). All resulting heatmaps are weighted sampling according to this distribution.

Figure 5. Distribution of pictures by day of the week and by hour of the day. Our resulting pedestrian density is approximately a sum of the time varying densities weighted by this distribution.

We evaluated how three computer vision algorithms for pedestrian detection perform on our dataset. The first one is based on histogram of oriented gradients features (Dalal and Triggs, 2005)

. The second one is based on the extraction of features by means of convolutional neural networks 

(Ren et al., 2015). The third utilizes fully convolutional networks for accuracy and speed improvements (Dai et al., 2016).

We manually tagged images to use as a ground truth. We adopt the same metric as Everingham et al. (2010) when comparing the detected objects in an image to the ground-truth. A detected object is considered to correspond to a particular ground truth objects if their is a minimum ratio of between the overlap of the detected bounding boxes ground-truth bounding boxes , and the union of the two areas (see Equation 11).


The recognition of distant objects in an image is difficult for humans and is even more difficult for computers. We assume that, on average, the size of a person within an image is an indicator of the distance that person to the sensor and try improve accuracy by considering a minimal size of the people detected. Thus, bounding boxes smaller than a new hyperparameter threshold are ignored, as shown in Figure


Figure 6. Variation of the ground truth annotations for different minimal person size thresholds. When the threshold is small (left) all people in the images are annotated. As the threshold increases (middle and right) the number of annotated people decrees. Those remaining tend to be closer to the camera. Faces and other details have been blurred to protect privacy.

As discussed below, we decided to utilize R-FCN, which we ran over our entire data set in parallel and created a database with the number of pedestrians detected in each image. This database is then aggregated in space and time to create a visualization of the pedestrian counts by finding the average number of pedestrians per image in each region.

4.1. Survey of Algorithms

Figure 7. Comparison of HoG (Dalal and Triggs, 2005), Faster R-CNN (Ren et al., 2015) and R-FCN (Dai et al., 2016)

detection on our dataset. Left: The precision and recall for each configuration of method parameters and minimum size threshold; points in the same line represents the results of the same ground-truth height threshold. In this graph the upper left corner represents an ideal algorithm. Right: The true positive rate verses the average number of false positives for the same set of parameters and ground-truth thresholds. Here, the upper left corner represents an ideal algorithm.

We used a total of images, covering the region of Manhattan, Monday to Friday from 7am to 6pm. We evaluated three methods for the task of people detection (Dalal and Triggs, 2005; Ren et al., 2015; Dai et al., 2016) over a sample of our dataset. We used the Matlab (The MathWorks, Inc., 2017) implementation of (Dalal and Triggs, 2005), with an stride of the detection window, for the pyramid scaling factor and model trained on the resolution images from the INRIA pedestrian dataset (Dalal and Triggs, 2005). The detection thresholds ranged from to , spaced by . The implementation of (Ren et al., 2015) is published by the authors and the model we used is a VGG16 network (Simonyan and Zisserman, 2014) trained with Pascal VOC 2007 dataset (Everingham et al., 2010) with a non-maximum suppression (Kitchen and Rosenfeld, 1982) threshold of . We evaluated the method with scores ranging from to score, spaced by . The R-FCN algorithm (Dai et al., 2016) was also trained on the Pascal VOC 2007 dataset but with the 101-layers neural network architecture proposed by (He et al., 2016). Here again, we evaluated the method with detection scores ranging from to , spaced by .

Figure 7 shows the results of the evaluation of the three methods over a random sample of 600 images of our dataset. The images were manually annotated and precision and recall values were computed. Ground-truth pedestrians in this comparison included tiny pedestrians, which explains such low values for recall. We can see that the overall accuracy of R-FCN was the best in our experiments. The detection times for each image are on average 5.7s for (Dalal and Triggs, 2005), 3.9s for (Ren et al., 2015) and 4.1s for (Dai et al., 2016).

Figure 8. Evaluation of R-FCN (Dai et al., 2016) for different ground-truth height thresholds. The utilized model has a Resnet-101 backbone  (He et al., 2016) trained on the Pascal VOC 2007 dataset (Everingham et al., 2010).

None of methods in Figure 7 achieve recalls exceeding and this fact is inherent to the difficulty of object detectors in detecting small objects as discussed in Section 4. To mitigate such issue, the detection model we propose assumes a finite radius of coverage (see Figure 3) and thus, we establish a limit on the size of the objects detected in the image. Figure 8 shows the results of the adopted detector over our sample as we vary the minimum acceptable height. As we can see, the higher the ground-truth height threshold, the higher the precision and specially the recall of the method.

4.2. Case Study

Figure 9. A comparison of the ground truth pedestrian count and the measured pedestrian count from the tagged test images. While the actual true positive and false positive counts do not match there expected statistics (left), the total measured pedestrian count can be close to approximated as linear (right). It should be noted that this is only an approximation as, even taking sampling errors into account, the mean measured count do not fit a linear model. Error bars are the confidence interval of the mean, calculated by assuming the sampling process described in Section 3.

Based on the results of Section 4.1 we adopted a R-FCN using a residual network of 101 layers (He et al., 2016) trained on Pascal VOC 2007 (Everingham et al., 2010), as proposed by (Dai et al., 2016). The model was trained using a weight decay of and a momentum of . Assuming a method minimum score of and height threshold of 120 pixels, overall pedestrians were detected.

We compared the number of measured pedestrian count as a function of the average number of ground truth pedestrians in each of the 600 manually labeled images to test the linear assumption used in Equation 4. Error bars for the mean were computed using the to values of the median of the appropriate sample process given in section 3. We measured the true positive rate () to be and the average number of false positives () to be of .

Figure 10. Visualization of the pedestrians density in Manhattan. The scale of colors represent the relative density of pedestrians. Left: The heatmap over the island of Manhattan. Right: The same heatmap enlarged to show the details of midtown and surrounding areas. Underlying map data taken from OpenStreetMap (OpenStreetMap, 2017). Not drawn to scale.
Figure 11. Examples of locations with increased pedestrian densities. Futures studies may be able to use these correlations to better understands how cities interacts with pedestrians. Underlying map data taken from OpenStreetMap (OpenStreetMap, 2017). Not drawn to scale..

As shown in Figure 8, the actual number of true positive and false positives do not individually fit the linear and content assumptions that we proposed in Section 3. However, the total number of pedestrians detected is closer to being linear, despite statically significant deviations. These stem from the visions algorithm’s better than expected performance for images without any pedestrians and worse than expected performance for images with a single person. While we do not know how these deviations would effect the error bounds given in Equations 9 and 10, we hypothesize that the two deviations would cancel themselves out and bound may still approximately hold with a slightly larger equivalent .

A visualization of the density of pedestrians in entire Manhattan can be seen in Figure 10. For these maps, we obtained an average pedestrian density () to be 0.587 which, following Equation 9, takes us to an error of . The actual error may be larger due to the deviations from linearity discussed above.

Pedestrian distributions, like ours, can be useful for city planing, commercial, and other purposes. Depending on the task on hand, a large pedestrian density can be beneficial or detrimental. Taxis seeking riders, food trucks seeking customers, and businesses seeking storefronts all benefit from large crowds. However, traffic and self-driving cars do not. A knowledge of pedestrian densities can allow city planers, civil engineers, and traffic engineers to make better decisions.

Our pedestrian map can also show the effect that features of the city have on it’s people. As shown in Figure 11, in addition to populated neighborhoods, subway stations, and attractions like the the Metropolitan Museum of Art are all associated with a spike in the pedestrian densities. These spikes might be too localized to be detected using traditional methods. Further studies of vision based pedestrian counts may lead to a better understanding of the interplay between a city’s environment and it’s occupant’s walking habits.

5. Conclusion

In this project we used a large set of images from a region of Manhattan and automatically detected the number of pedestrians in each image. As a result we obtained a map of pedestrians in the region given by the spatio-temporal sampling. Additionally, we modelled the errors in this process by simulating a sensors network with probabilistic detections. Results give evidence that even considering a faulty detection model, such process can still be used to get a reliable map of pedestrians in the region.

Besides the results presented, there are other potential future avenues of studies as discussed next. First, we should caution that any application of our methodology should perform statistical tests to ensure that their results are statistically significant. While we set bounds on the asymptotic error after the sampling process converges, we have only provided case studies and heuristics for the time to convergence. It would be interesting to find a formal bound on time to convergence as well as provide guidelines for the appropriate statistical tests to validate the data post collection.

Our experiments could be extended to consider alternative mobility models (Camp et al., 2002), dynamics models including macroscopic ones (Helbing, 1998, 2001; Iwata et al., 2017), and more recent detection methods (Lin et al., 2018; Hajic jr et al., 2018) and the combination of them (Tokuda et al., 2013). We can also use data completion algorithms (Gandy et al., 2011; Li et al., 2013; Li et al., 2017) to reconstruct a city-wide pedestrian map.

The pedestrian map generated will then be able to be combined with other urban datasets such from Socrata (NYC open data, 2017), weather, crime rate, census data, public transportation, bicycles and shadows (Miranda et al., 2018). We additionally aim to explore apparently disparate datasets such as from wind and from garbage collection.

Another future work is incorporation of advances such as from (Photosynth, 2017) to visualize our images in the context of the the city and use this visualization to gain additional insights into the other datasets analyzed in Urbane. As a first pass, we are working to render the photographs in the locations they were captured. We hope to use Structure from Motion (Koenderink and Van Doorn, 1991) to improve the accuracy of image location as well as find the orientation that the images were captured.

Additionally, we hope to use 3D popups and/or photo based rendering to fully enhance the images in the three dimensional environments. It is our hope that the context of the images will allow users to better understand the different datasets that analyzed in Urbane.


We thank Carmera for their collaboration. We also thank Harish Doraiswamy, Fabio Miranda, Alexandru Telea for providing insights, comments, and suggestions that greatly contributed to this work. This work was supported in part by: NSF awards CNS-1229185, CCF-1533564, CNS-1544753, CNS-1730396, CNS-1828576; FAPESP (grants #14/24918-0 and #2015/22308-2); the Moore-Sloan Data Science Environment at NYU, and C2SMART. C. T. Silva is partially supported by the DARPA D3M program. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA.


  • (1)
  • Akyildiz et al. (2007) Ian F Akyildiz, Tommaso Melodia, and Kaushik R Chowdhury. 2007. A survey on wireless multimedia sensor networks. Computer networks 51, 4 (2007), 921–960.
  • Akyildiz et al. (2002) Ian F Akyildiz, Weilian Su, Yogesh Sankarasubramaniam, and Erdal Cayirci. 2002. A survey on sensor networks. IEEE Communications magazine 40, 8 (2002), 102–114.
  • Arietta et al. (2014) Sean M Arietta, Alexei A Efros, Ravi Ramamoorthi, and Maneesh Agrawala. 2014. City forensics: Using visual elements to predict non-visual city attributes. IEEE transactions on visualization and computer graphics 20, 12 (2014), 2624–2633.
  • Basagni et al. (2007) Stefano Basagni, Alessio Carosi, and Chiara Petrioli. 2007. Controlled vs. uncontrolled mobility in wireless sensor networks: Some performance insights. In Vehicular Technology Conference, 2007. VTC-2007 Fall. 2007 IEEE 66th. IEEE, IEEE, Maryland, USA, 269–273.
  • Behnel et al. (2011) S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D.S. Seljebotn, and K. Smith. 2011. Cython: The Best of Both Worlds. Computing in Science Engineering 13, 2 (2011), 31 –39. https://doi.org/10.1109/MCSE.2010.118
  • Burges (1998) Christopher JC Burges. 1998.

    A tutorial on support vector machines for pattern recognition.

    Data mining and knowledge discovery 2, 2 (1998), 121–167.
  • Camp et al. (2002) Tracy Camp, Jeff Boleng, and Vanessa Davies. 2002. A survey of mobility models for ad hoc network research. Wireless communications and mobile computing 2, 5 (2002), 483–502.
  • Consolvo et al. (2008) Sunny Consolvo, David W McDonald, Tammy Toscos, Mike Y Chen, Jon Froehlich, Beverly Harrison, Predrag Klasnja, Anthony LaMarca, Louis LeGrand, Ryan Libby, et al. 2008. Activity sensing in the wild: a field trial of ubifit garden. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, ACM, Florence,Italy, 1797–1806.
  • Cordts et al. (2016) Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016.

    The cityscapes dataset for semantic urban scene understanding. In

    Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, Nevada, USA, 3213–3223.
  • Dai et al. (2016) Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016. R-FCN: Object Detection via Region-based Fully Convolutional Networks. arXiv preprint arXiv:1605.06409 (2016).
  • Dalal and Triggs (2005) Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, Vol. 1. IEEE, IEEE, California, USA, 886–893.
  • Davies et al. (2000) Vanessa Ann Davies et al. 2000. Evaluating mobility models within an ad hoc network. Master’s thesis. Citeseer.
  • Everingham et al. (2010) Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision 88, 2 (2010), 303–338.
  • Gandy et al. (2011) Silvia Gandy, Benjamin Recht, and Isao Yamada. 2011. Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse Problems 27, 2 (2011), 025010.
  • Geiger et al. (2013) Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets Robotics: The KITTI Dataset. International Journal of Robotics Research (IJRR) (2013).
  • Google Inc. (2017) Google Inc. Last accessed March 2017. (https://maps.google.com). (Last accessed March 2017).
  • Hajic jr et al. (2018) Jan Hajic jr, Matthias Dorfer, Gerhard Widmer, and Pavel Pecina. 2018. Towards Full-Pipeline Handwritten OMR with Musical Symbol Detection by U-Nets. In Proceedings of the 19th International Society for Music Information Retrieval Conference, Paris, France. 23–27.
  • Hart et al. (1968) Peter E Hart, Nils J Nilsson, and Bertram Raphael. 1968. A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics 4, 2 (1968), 100–107.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on. IEEE.
  • Helbing (1998) Dirk Helbing. 1998. A fluid dynamic model for the movement of pedestrians. arXiv preprint cond-mat/9805213 (1998).
  • Helbing (2001) Dirk Helbing. 2001. Traffic and related self-driven many-particle systems. Reviews of modern physics 73, 4 (2001), 1067.
  • Huang and Tseng (2005) Chi-Fu Huang and Yu-Chee Tseng. 2005. The coverage problem in a wireless sensor network. Mobile Networks and Applications 10, 4 (2005), 519–528.
  • Iwata et al. (2017) Tomoharu Iwata, Hitoshi Shimizu, Futoshi Naya, and Naonori Ueda. 2017. Estimating People Flow from Spatiotemporal Population Data via Collective Graphical Mixture Models. ACM Transactions on Spatial Algorithms and Systems (TSAS) 3, 1 (2017), 2.
  • Johnson and Maltz (1996) David B Johnson and David A Maltz. 1996. Dynamic source routing in ad hoc wireless networks. Mobile computing 353, 1 (1996), 153–181.
  • Karagiorgou et al. (2017) Sophia Karagiorgou, Dieter Pfoser, and Dimitrios Skoutas. 2017. A layered approach for more robust generation of road network maps from vehicle tracking data. ACM Transactions on Spatial Algorithms and Systems (TSAS) 3, 1 (2017), 3.
  • Kitchen and Rosenfeld (1982) Les Kitchen and Azriel Rosenfeld. 1982. Gray-level corner detection. Pattern recognition letters 1, 2 (1982), 95–102.
  • Kleinrock (1976) Leonard Kleinrock. 1976. Queueing systems, volume 2: Computer applications. Vol. 66. wiley New York.
  • Knuth (1997) Donald Ervin Knuth. 1997. The art of computer programming. Vol. 3. Pearson Education.
  • Koenderink and Van Doorn (1991) Jan J Koenderink and Andrea J Van Doorn. 1991. Affine structure from motion. JOSA A 8, 2 (1991), 377–385.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. Nevada, USA, 1097–1105.
  • Lane et al. (2010) Nicholas D Lane, Emiliano Miluzzo, Hong Lu, Daniel Peebles, Tanzeem Choudhury, and Andrew T Campbell. 2010. A survey of mobile phone sensing. IEEE Communications magazine 48, 9 (2010), 140–150.
  • Law et al. (2007) Averill M Law, W David Kelton, and W David Kelton. 2007. Simulation modeling and analysis. Vol. 3. McGraw-Hill New York, Arizona, USA.
  • Lee et al. (2009) Uichin Lee, Eugenio Magistretti, Mario Gerla, Paolo Bellavista, and Antonio Corradi. 2009. Dissemination and harvesting of urban data using vehicular sensing platforms. IEEE Transactions on Vehicular Technology 58, 2 (2009), 882–901.
  • Lesser et al. (2012) Victor Lesser, Charles L Ortiz Jr, and Milind Tambe. 2012. Distributed sensor networks: A multiagent perspective. Vol. 9. Springer Science & Business Media.
  • Li et al. (2013) Li Li, Yuebiao Li, and Zhiheng Li. 2013.

    Efficient missing data imputing for traffic flow by considering temporal and spatial dependence.

    Transportation research part C: emerging technologies 34 (2013), 108–120.
  • Li et al. (2017) Weizi Li, David Wolinski, and Ming C Lin. 2017. City-scale traffic animation using statistical learning and metamodel-based optimization. ACM Transactions on Graphics (TOG) 36, 6 (2017), 200.
  • Liang and Haas (1999) Ben Liang and Zygmunt J Haas. 1999. Predictive distance-based mobility management for PCS networks. In INFOCOM’99. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE, Vol. 3. IEEE, IEEE, New York, USA, 1377–1384.
  • Lin et al. (2018) Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2018. Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence (2018).
  • Maddern et al. (2017) Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul Newman. 2017. 1 year, 1000 km: The Oxford RobotCar dataset. The International Journal of Robotics Research 36, 1 (2017), 3–15.
  • Matsumoto and Nishimura (1998) Makoto Matsumoto and Takuji Nishimura. 1998. Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation (TOMACS) 8, 1 (1998), 3–30.
  • Miranda et al. (2018) Fabio Miranda, Harish Doraiswamy, Marcos Lage, Luc Wilson, Mondrian Hsieh, and Claudio T Silva. 2018. Shadow Accrual Maps: Efficient Accumulation of City-Scale Shadows over Time. IEEE Transactions on Visualization and Computer Graphics (2018).
  • Niazi and Hussain (2011) Muaz A Niazi and Amir Hussain. 2011. A novel agent-based simulation framework for sensing in complex adaptive environments. IEEE Sensors Journal 11, 2 (2011), 404–412.
  • NYC open data (2017) NYC open data. Last accessed March 2017. (https://opendata.cityofnewyork.us/). (Last accessed March 2017).
  • Oh et al. (2011) Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit Mukherjee, JK Aggarwal, Hyungtae Lee, Larry Davis, et al. 2011. A large-scale benchmark dataset for event recognition in surveillance video. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, Colorado, USA, 3153–3160.
  • OpenStreetMap (2017) OpenStreetMap. 2017. Planet dump retrieved from https://planet.osm.org . https://www.openstreetmap.org. (2017).
  • Othman and Shazali (2012) Mohd Fauzi Othman and Khairunnisa Shazali. 2012. Wireless sensor network applications: A study in environment monitoring system. Procedia Engineering 41 (2012), 1204–1210.
  • Park and Miller (1988) Stephen K. Park and Keith W. Miller. 1988. Random number generators: good ones are hard to find. Commun. ACM 31, 10 (1988), 1192–1201.
  • Photosynth (2017) Photosynth. Last accessed March 2017. (https://blogs.msdn.microsoft.com/photosynth/ 2017/02/06/microsoft-photosynth-has-been-shut-down/). (Last accessed March 2017).
  • Rana et al. (2010) Rajib Kumar Rana, Chun Tung Chou, Salil S Kanhere, Nirupama Bulusu, and Wen Hu. 2010. Ear-phone: an end-to-end participatory urban noise mapping system. In Proceedings of the 9th ACM/IEEE International Conference on Information Processing in Sensor Networks. ACM, 105–116.
  • Reades et al. (2007) Jonathan Reades, Francesco Calabrese, Andres Sevtsuk, and Carlo Ratti. 2007. Cellular census: Explorations in urban data collection. IEEE Pervasive computing 6, 3 (2007), 30–38.
  • Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91–99.
  • Sheng et al. (2012) Xiang Sheng, Jian Tang, and Weiyi Zhang. 2012. Energy-efficient collaborative sensing with mobile phones. In INFOCOM, 2012 Proceedings IEEE. IEEE, Florida, USA, 1916–1924.
  • Shi et al. (2009) Wenhuan Shi, Shuhan Shen, and Yuncai Liu. 2009. Automatic generation of road network map from massive GPS, vehicle trajectories. In Intelligent Transportation Systems, 2009. ITSC’09. 12th International IEEE Conference on. IEEE, Missouri, USA, 1–6.
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
  • Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. 2015. Going deeper with convolutions. In Computer Vision and Pattern Recognition, 2015. CVPR 2015. IEEE Conference on. Massachusetts, USA.
  • The MathWorks, Inc. (2017) The MathWorks, Inc. 2017. Matlab version 2017b. (2017).
  • Tian et al. (2002) Jing Tian, Jorg Hahner, Christian Becker, Illya Stepanov, and Kurt Rothermel. 2002. Graph-based mobility model for mobile ad hoc network simulation. In Simulation Symposium, 2002. Proceedings. 35th Annual. IEEE, IEEE, California, USA, 337–344.
  • Titzer et al. (2005) Ben L Titzer, Daniel K Lee, and Jens Palsberg. 2005. Avrora: Scalable sensor network simulation with precise timing. In Information Processing in Sensor Networks, 2005. IPSN 2005. Fourth International Symposium on. IEEE, Tennessee, USA, 477–482.
  • Tokuda et al. (2018) Eric K. Tokuda, Gabriel B. A. Ferreira, Claudio Silva, and Roberto M. Cesar-Jr. 2018. A novel semi-supervised detection approach with weak annotation. In Image Analysis and Interpretation (SSIAI), 2018 IEEE Southwest Symposium on. IEEE, Nevada, USA.
  • Tokuda et al. (2013) Eric K. Tokuda, Helio Pedrini, and Anderson Rocha. 2013.

    Computer generated images vs. digital photographs: A synergetic feature and classifier combination approach.

    Journal of Visual Communication and Image Representation 24, 8 (2013), 1276–1292.
  • USEEPAAD (2017) United States Environment Protection Agency Air Data USEEPAAD. Last accessed March 2017. (https://www3.epa.gov/airdata/ad_data_daily.html). (Last accessed March 2017).
  • Vanegas et al. (2012) Carlos A Vanegas, Daniel G Aliaga, and Bedrich Benes. 2012. Automatic extraction of Manhattan-world building masses from 3D laser range scans. Visualization and Computer Graphics, IEEE Transactions on 18, 10 (2012), 1627–1637.
  • Vezzani and Cucchiara (2010) Roberto Vezzani and Rita Cucchiara. 2010. Video surveillance online repository (visor): an integrated framework. Multimedia Tools and Applications 50, 2 (2010), 359–380.
  • Vinyals et al. (2011) Meritxell Vinyals, Juan A Rodriguez-Aguilar, and Jesus Cerquides. 2011. A survey on sensor networks from a multiagent perspective. Comput. J. 54, 3 (2011), 455–470.
  • Wang et al. (2003) Guiling Wang, Guohong Cao, and Tom LaPorta. 2003. A bidding protocol for deploying mobile sensors. In Network Protocols, 2003. Proceedings. 11th IEEE International Conference on. IEEE, IEEE, Georgia, USA, 315–324.
  • Whyte (2012) William H Whyte. 2012. City: Rediscovering the center. University of Pennsylvania Press, Pennsylvania, USA.
  • Yang et al. (2003) Danny B Yang, Leonidas J Guibas, et al. 2003. Counting people in crowds with a real-time network of simple image sensors. In Proceedings of the IEEE International Conference on Computer Vision Workshops. IEEE, IEEE, Nice, France, 122.
  • Zheng et al. (2014) Yu Zheng, Licia Capra, Ouri Wolfson, and Hai Yang. 2014. Urban computing: concepts, methodologies, and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 5, 3 (2014), 38.

Appendix A Proof of metric formula

Here we derive the formula for the closest point in our class and the ground truth vector. This is equivalent to solving:


First we expand the distance metric using the euclidean inner-product:


which is minimized by

when . For , then and


regardless the value of a.

Substituting this value into , we obtain our equation above:

When then and we get the same value. Note that we are assuming that is never zero.

Appendix B Theoretical Asymptotic Error

In this paper we will derive the equation for sensor error that we give in Equation 8. The error bounds only assume that for some values of and . First, we will show that, for the simulation, the values of and agree with the parameters of the same name.

As noted in Equation 2, the sampling process for the simulation results in the following sampled values for each location:

where is a sampled from a binomial distribution with mean and is a Poisson process (Kleinrock, 1976) with a mean of .

At the same time, the ground truth distribution of people at each location is given by Equation 3

In this appendix we will make the simplifying assumption that

The executed value of the sampled can then be found by (noting that the random variables are all independent)


From this point on, all results will only depend on the equations and not the underlying sampling process.

Let be the total number of people and be the sampling location. In a real world scenario, may not be well defined. As such, we will work in terms of , the density of people.

By the law of large numbers, in the limit of

, approaches

Using this limit, we can find the asymptotic value of , as defined by Equation 6, can be found:


Here, is the Euclidean inner product and is the vector with all ones. Note that and

The magnitude of can then be found by


Similarly, the difference between and the ground truth sampling can be found by


The magnitude of which can be found by


Equations 18 and 20 can be used to find our metric as defind by Equation 7


However, this equation depends on , , and which would not be known for real applications. To account for these variables, we will introduce a new paramiter, :


While may also be unknown, we will be able to take a maximum of the resulting error function to get a bound. Using the bounds of L2-norm in terms of the L1-norm and noting that the L1-norm is equal to , we obtain the identity


By substituting, and a bit of algebra, we can transform Equation 21 into a final form:


Note that this formula is a function of , , and . By expanding this function as a Taylor series, we find


Noting that is always positive and applying Taylor’s theorem , we end up with the inequality


Where the last step comes from the inequality