Mobile network operators collect a profusion of metadata from the digital communication activity of their subscribers. They are in a position to extract significant new knowledge on human behaviors at heterogeneous scales, ranging from single individuals to large populations. Examples abound, and are comprehensively reviewed in : they include original insights on mobility laws , patterns of daily commuters , dynamics of infective disease epidemics , or people reaction to disaster situations . Mobile network operators can then leverage such information to develop metadata-driven value-added services, for, e.g., transport analytics  or location-based marketing .
Our work focuses on the use of mobile network metadata for the estimation of population density in urban regions, which is a paramount information for informed planning in metropolitan areas by local authorities. Traditional censuses are carried out at regular time intervals in all developed countries, and allow determining static population densities, i.e., the spatial distributions of citizens’ places of habitation or dwelling units. However, these censuses are complex to organize and expensive to run, which limits their periodicity to a few years in best cases . Instead, metadata collected by mobile network operators is fairly inexpensive to obtain, and covers large portions of the population . A reliable estimation of the static population density based on such metadata would eliminate the limitations of conventional survey-based approaches [9, 10].
In addition, mobile network metadata can be retrieved and analyzed with minimum latency. This paves the road to the characterization of dynamic population densities, i.e., of the instantaneous distributions of individuals that evolve as people move over time in the region of interest. The automated, near-real-time estimation of population density dynamics has inestimable value in supporting innovative services for instance to transport planning, large event organization, public safety, or law enforcement.
The utility of a reliable estimation of population distributions based on mobile network metadata has generated a flurry of recent studies on the subject [11, 12, 13, 14, 15], which we review in Section 8. Our work has the following advantages over such previous proposals.
(i) Our design is based on a number of original metadata filters that significantly improve the accuracy of the population density estimate. The filters operate on subscriber presence, which we show to be a better proxy of the population density than other classes of mobile network metadata.
We exploit a novel multivariate relationship of population density, subscriber presence and subscriber activity level for the estimation of dynamic population densities.
(iii) When confronted with ground-truth data in multiple urban scenarios, our solution achieves good accuracy in estimating static and dynamic population densities. Specifically, it consistently outperforms current state-of-the-art approaches in both cases, and generates reasonable dynamic representations of the population distribution that allow, e.g., appraising attendance at sports and social events.
(iv) Estimates of the dynamic population distribution obtained with our proposed model are openly accessible . The datasets describe one month of population density fluctuations in the cities of Milan, Rome and Turin.
The paper is organized as follows. Section 2 describes our reference datasets. Section 3 presents the model for static population estimation, which is assessed in Section 4. Model refinements for the dynamic case are explained in Section 5 and evaluated in Section 6. Comparisons with the state-of-the-art are in Section 7, and related works are reviewed in Section 8. Section 9 concludes the document.
We leverage several datasets made available by Telecom Italia Mobile (TIM) within their 2015 Big Data Challenge . We focus on three major conurbations in Italy, i.e., Milan, Turin and Rome. For each city, we retrieve metadata about the mobile traffic activity (presented in Section 2.1), as well as ground-truth census data on the local population distribution (Section 2.2). We then infer information on land use from the mobile network metadata itself (Section 2.3).
2.1 Mobile network metadata
The mobile network metadata provided by TIM covers the months of March and April 2015, and describes the volume of traffic divided by type (incoming and outgoing voice calls, incoming and outgoing text messages, and Internet sessions), as well as the approximate presence of subscribers. These features are commonly available to mobile network operators from Call Detail Records compiled for billing purposes, hence they represent a sensible choice for developing an estimator of population density that is reusable. All metrics are aggregated in time and space. In time, the metadata is totaled over 15-minute time intervals. In space, metrics are computed over an irregular grid tessellation, whose geographical cells do not overlap and have sizes ranging from 255325 m to 22.5 km. The number of cells is 1419 for Milan, 571 for Turin and 927 for Rome.
While voice, text, and Internet traffic volumes are directly computed from the recorded demand, the presence information is the result of a simple preprocessing performed by the mobile network operator. Basically, each subscriber is associated to the geographical cell where he last interacted with the network for any purpose, which includes issuing or receiving a voice call, sending or receiving a text message, or establishing a new Internet data session. As each type of activity provides additional localization information, including them all in the calculation makes the presence information more accurate. If a user is first detected within cell , then he performs some activity in cell at time , and his latest action is recorded in cell at time , his location will be as follows: for , the user is positioned in ; for , he is in ; for , and until he performs an action in a different cell, he is in . Figure 1 shows an example of this user location estimation process. The presence metadata is then inferred by counting subscribers in each cell, at every 15 minutes.
2.2 Population distribution
Our ground-truth data for the static population distribution comes from the 2011 housing census run by ISTAT, the national organization for statistics in Italy. It includes population counts, measured in terms of families, cohabitants, persons temporarily present, domiciles, and other types of lodging and buildings, for each administrative area.
To ensure spatial consistency between the mobile network metadata and the census data, we proceed as follows. Let us denote as the total number of inhabitants in the administrative area , and as its surface. The population density in a geographical cell , defined in Section 2.1, is then computed by assuming a uniform spatial distribution of inhabitants in each administrative area, as
where is the surface of cell , denotes the total number of administrative areas, and stands for the intersection surface of cell and administrative area .
2.3 Land use
. We leverage the operator-collected data itself to classify the geographical cells based on their primary land use. To that end, we employ MWS, which is the current state-of-the-art technique for land use detection from mobile network metadata.
MWS computes, for each spatial cell of the target region, a mobile traffic signature, i.e., a compact representation of the typical dynamics of mobile communications in the considered cell. Specifically, MWS signatures are computed as the median voice call and text activity in a cell recorded at every hour of the week. Signatures are clustered based on their shape, using a classical hierarchical algorithm on top of a correlation-based signature similarity measure. The output clusters group cells with similar types of human activities, i.e., belonging to a same land use. Averaging the signatures of all cells in a specific cluster allows defining characteristic signatures associated to each land use, which are shown to be city-invariant: hence, once the characteristic signatures are identified, MWS can detect land use regions by solely using mobile network metadata. When applied to our reference datasets, MWS classifies urban areas into five land uses: residential, office, touristic, university and shopping.
We stress that the land uses detected through mobile traffic metadata are more accurate and meaningful than those returned by other methods. As an example, the Milan conurbation territory is classified into buildings, vegetation, water, road, and railroads in : these categories, based on pure geographical features, have little relation with the activity of individuals. Instead, the classes identified by MWS in the same region, listed above, map to actual human endeavors, and, as such, have a stronger tie to the population density we aim at estimating.
3 Static population estimation
Power laws regulate many relationships among social, spatial, and infrastructural properties of cities . In particular, previous works demonstrated the existence of a power relationship between the mobile network activity density (i.e., activity per km2) and the population density (i.e., inhabitants per km2) in a region [14, 13]. In other words
We find that such a power relationship holds in our reference scenarios as well. Figure 2 portrays heat-maps of the correspondence between the ground-truth population density and the network activity density, as recorded in each cell over the whole data collection period. The plots refer to different cities and different classes of mobile network metadata, i.e., incoming and outgoing calls and text messages, as well as presence. All trends are clearly linear on a log-log scale: this implies that , which is a simple transformation of (2).
Our baseline population estimation model is thus described by the expression in (2
). By transforming the formula to a logarithmic scale as done above, we can use a linear regression model to estimate the parametersand . Unfortunately, all linear relationships in Figure 2
are characterized by a significant amount of noise, caused by the heterogeneity and heteroscedasticity of voice calls, text messages and subscriber presence density with respect to the actual population density. Classical linear regression models assume absence of heteroscedasticity: running a regression directly on the raw metadata would yield poor results. In order to de-noise the metadata, we take a number of actions, which are discussed in the rest of the section.
3.1 Metadata class filtering
As a first step, we determine which type of mobile network metadata is the most suitable to population estimation. To this end, we investigate the correlation between the population census data and the different classes of metadata presented in Section 2.1, as measured in each geographical cell over time and using the full two-month data. Results are summarized in Table I, which reports the Pearson correlation coefficient computed in the case of incoming and outgoing voice calls, incoming and outgoing text messages, Internet sessions, and presence. It is apparent that the presence is steadily better correlated than all other metadata, with coefficients of 0.79–0.91 that improve by 3.5%–5% the runner-up value associated with Internet session volumes.
We further examine the higher suitability of presence density as a proxy of population density in Figure 3. The top plots detail the typical daily fluctuation of the Pearson correlation coefficient, computed on 24 hourly aggregates of the two-month data. In addition to presence, the plots illustrate the correlation dynamics for incoming voice calls and text messages, as a benchmark111Outgoing calls and messages, and Internet sessions produced results equivalent to or worse than those obtained with incoming calls and messages. They are not shown here for the sake of clarity.. The results highlight that subscriber presence is sensibly better correlated with the ground-truth data, at all times and across all scenarios.
Interestingly, our results are aligned with those in , which, however, only considered calls and messages, and not the subscriber presence. In fact, their conclusion that calls between 10 am and 11 am yield the strongest correlation with population density, is superseded by our observation that subscriber presence is a much more relevant metric. In the light of these considerations, we select the subscriber presence as the mobile network metadata on top of which we develop our methodology.
3.2 Time filtering
A second dimension for filtering is time. As already observed in previous works, the correlation between mobile network metadata and population varies over time . This effect is also present in our case studies, as shown by the bottom plots in Figure 3: they detail the variability of the correlation coefficient for subscriber presence, over daytime and in the three reference urban scenarios.
The correlation coefficient has similar dynamics in all cities, and always peaks at night. The intuitive explanation is that the ISTAT census data refers to the static population density, and residents are more frequently at home overnight. This result allows selecting a single time filter that maximizes the correlation with the ground truth for all scenarios. Hereafter, and for the purpose of training our model, we will consider in (2) as the subscriber presence density in cell during the interval between 4 am and 5 am.
3.3 Day filtering
As mentioned in Section 2.1, the mobile network metadata we employ covers 60 days in March and April 2015. An interesting question is if all these days should be considered in the regression model, or if there exists a more meaningful subset of days. Indeed, mobile network metadata is clearly affected by the diverse activity patterns and social phenomena that may characterize different days.
The three top plots in Figure 4 show the Pearson correlation coefficient in the reference cities over April222Similar results for March are omitted here, for the sake of brevity.. They refer to incoming voice calls (A), incoming text messages (B), and subscriber presence (C); according to the previous discussion, they are computed over the 4 am to 5 am interval. We still include calls and messages in order to check if they show correlation peaks on a daily basis that we could not observe in previous plots. This is not the case: the presence correlation coefficient ranges between 0.9 and 0.94, i.e., steadily higher values than those of calls and texts, lying between 0.6 and 0.8. We confirm that subscriber presence is a most convenient proxy of population distribution. When it comes to filtering based on days, however, no clear trend is observed in the top three plots, for all metadata types.
A more insightful result in this sense is obtained by considering the percentage of cells for which no subscriber presence metadata is available between 4 am and 5 am on each day (E). Here, a remarkable weekly pattern emerges: high peaks of missing information appear on weekends (denoted by red bars) and holidays (denoted by black bars, and corresponding to Good Friday and Easter), and reach up to 90% of cells. In order to understand why this happens, let us consider the average call density (D), i.e., the mean number of (incoming and outgoing) calls333Similar results were obtained with equivalent text message density and Internet session density, and are omitted for the sake of brevity. normalized by the cell surface and computed over all cells during each 15-minute slot between 4 am and 5 am: here, weekdays and weekends (highlighted in gray) yield remarkably different and lower density. Such a reduced communication activity leads to seldom updated presence metadata that easily misses user occupancy in less populated cells; therefore, it increases the number of low-presence cells, which are then removed from the dataset by the operator during sanitization to mitigate privacy risks444The rationale for the operator’s choice is that if too few users are present in a cell, they may be tracked and re-identified in the metadata. . We conclude that a substantially lower activity of users during non-working days is the cause for the notable absence of presence metadata on those days.
In the light of these considerations, the high correlation between presence metadata and population census during weekends and holidays (C) is ostensible, as it only concerns cells for which metadata is available. For the (many) other cells, the correlation cannot be computed due to the lack of presence metadata. Since working with inconsistent temporal supports for different cells may introduce biases in the analysis, we ultimately opt for excluding weekends and holidays from the data used to derive our model.
3.4 Outlier cell filtering
In order to estimate the parameters and in (2), we employ the RANSAC regressor  on the filtered subscriber presence metadata. In addition to estimating the model parameters in an iterative manner, RANSAC also automatically detects outlying points and excludes them from the regression. Figure 5 depicts heatmaps of the filtered subscriber presence metadata versus the census population density data, for the three reference scenarios of Milan, Rome and Turin separately, as well as when the data for all these cities is considered at once. A first important observation is the effect of the proposed filters. The noise is sensibly reduced with respect to the raw subscriber presence metadata, in Figure 2: most points are tagged as inliers (colored dots) by RANSAC, and follow a clear linear trend.
Still, there exists a minority of outliers (gray dots) detected by RANSAC. A closer look at these outliers reveals that they refer to a same subset of cells, consistently over time. Thus, those cells yield some features that make the subscriber presence recorded there less related to the local amount of population. Maps of such cells are in Figure6.
We do not have strong evidence of why a limited number of specific cells show outlying behaviors with respect to the model. However, we speculate that two factors may contribute to this phenomenon. The first is a border effect: in all three cities, many outlying cells are located at the boundaries of the considered geographical region. We argue that the mobile network antennas in such cells probably cover areas (and populations) beyond the limits of the scenario, and the metadata reflects that. As a result, the subscriber presence associated to frontier cells also refers to users located outside the cells, which leads to an overestimation of the population by the model.
The second factor is the temporal mismatch between our datasets. The ISTAT census dates back to 2011, whereas the mobile network metadata refers to 2015. It is not unlikely that the distribution of the population density in several areas of the cities changed during the four years that separate the datasets. Figure 6 seems to corroborate this hypothesis, since many outlying cells are located in suburban areas where resident population dynamics are more subject to evolve. If this were the case, mobile network metadata may thus help updating population distribution maps at a much higher frequency than traditional survey-based methods.
In summary, however, there is a risk that outlying cells may be affected by artifacts of the spatial tessellation or by potential issues in the associated ground-truth data; we thus deem safer to filter them out when training our model.
Milan, Rome, Turin. Geographical positions of the cells that determine frequent outliers detected by RANSAC.
4 Model evaluation
The regression returns fitted parameters and of (2). Our model estimates the static population density in spatial cell from the filtered subscriber presence density as
Figure 5 illustrates the curves obtained with the model (solid red lines), as well as with a pure linear fitting where =1 (dashed black lines). The two lines are close in all plots, which underscores the quasi-linearity of the relation between subscriber presence and population in all urban scenarios: e.g., we find =1.02 when all cities are used jointly.
In order to assess the accuracy of the model in (3), we compute the determination coefficient (), and the Normalized Root Mean Square Error () of the estimates, when compared to the ground-truth data. The coefficient is
where denotes the number of cells in the spatial tessellation, and is the average population density in all cells from ground-truth data. We compute two versions of the , which facilitates the comparison of the model results in different contexts. They are defined as
where and are the maximum and minimum population densities in cells within the target urban scenario, as indicated by the ground truth. In the following, we will use to refer to both expressions (5) and (6) at once.
We stress that our performance metrics are computed on all cells, including those excluded from model training.
4.2 Milan case study
We first focus on the Milan scenario. We adopt a three-fold cross-validation procedure, by separating the subscriber presence data into three equally-sized subsets in time. Two subsets are used as a training set to learn the model parameters, and the third is employed as a test set to evaluate the model quality. The process is repeated three times, by changing the test subset at each iteration.
The baseline result for the Milan case study is shown in Figure 7. The top plot reports the values of the and metrics computed on the test set, as well as on the training set for completeness. The results are separated per land use: i.e., the model is independently trained and tested on mobile network cells characterized by different land uses, extracted from the mobile network metadata as per Section 2.3. We remark that both and : (i) yield comparable values for all pairs of training and test data; (ii) undergo significant variability across land uses. The first result proves that the model can correctly estimate unknown population densities from subscriber presence metadata. The second highlights how land use affects the activities of individuals, including their mobile communication habits; in turn, this diversity influences the accuracy of population estimates from mobile network metadata.
The estimation is better in residential areas, where =0.85, =0.075 and =0.122. This is a reasonable result, given that the distribution of dwelling units in the ISTAT census population is best captured in neighborhoods where households prevail. The quality of the estimation is still good ( above 0.6) for touristic and shopping zones, fair ( close to 0.5) for office areas, and bad ( below 0.4) for university areas. Our speculation is that different percentages of individuals present in these areas overnight are not resident inhabitants in the census data (but, e.g. tourists, students, or locals hanging out), hence do not appear in the population ground truth.
In the light of these considerations, the model parameters estimated in residential areas have the highest chance to be those actually linking population and presence metadata. We verify to what extent a model trained in residential zones can estimate populations in areas characterized by different land uses in the bottom plot of Figure 7. In the Milan case study, this approach does not degrade performance significantly, with the sole exception of university areas that however represent a negligible minority of spatial cells. This lets us consider a unifying model in the remainder of our study, by training (3) on presence metadata from cells where residential land use is predominant.
4.3 Generalization to different city scenarios
Table LABEL:tab:cities summarizes the performance of our model in all urban scenarios. The residential portion of Table LABEL:tab:cities refers to results obtained with data relative to cells with residential land use. It shows: (i) the model parameter values and
returned by fittings on the training set, with 95% confidence intervals;(ii) the fitting quality of the model computed over the training set, as and ; (iii) the accuracy of the estimation in the test set, as and .
We remark that the values of are consistently close to one across all of the urban scenarios we consider. Instead, tends to be city-specific555We explored if a number of features (e.g., total population, average population density, conurbation size, number of cells in the geographical surface, mobile network operator market share, etc.) could explain the difference, without finding significant correlations.. The accuracy of the estimation is in all cases very good, attaining determination coefficients between 0.8 and 0.87, and a normalized error below 0.1.
The right portion of Table LABEL:tab:cities, under the mixed tag, outlines the performance of the model trained on residential areas, and then used to estimate the population density in the complete urban region, including zones that are not residential in nature. The accuracy remains good666We remark that these values are aligned with or better than those of current state-of-the-art solutions for static population density estimation. For instance, of 0.66 and 0.8 reported in  and in , respectively; also, the is measured at 1.0 in . A complete comparative evaluation is provided in Section 7., with in the range between 0.76 and 0.82, and around 0.1.
|Mixed Land Use|
As a final test, we explore the possibility of estimating the population density in an urban area using a model trained on data collected in another city. The rationale is that, in Table LABEL:tab:cities, the values are almost identical for all cities, and values are not dramatically different. Table III summarizes the estimation accuracy on all possible combinations of our three reference city scenarios, for both and . The -th element in the table maps to the accuracy of a model trained on metadata and ground truth from city , and used to estimate the population in city .
The metric values stay fairly high (in the range 0.64-0.84) for and low (between 0.1 and 0.2) for , in all combinations of cities, which suggests that a cross-city estimation of population density is in fact possible. This result has important practical implications, since it paves the road to the estimation of populations in cities using mobile network metadata, and without any need for training on ground-truth data on a specific urban region.
As a concluding remark, we underscore that all results above hold for our three reference cities, and we cannot claim generality beyond these. Yet, the strong consistence of model performance and the possibility of cross-city estimation are especially encouraging when considering that the target cities have fairly diverse topological features and population sizes, at approximately 2,850,000 (Milan), 1,350,000 (Rome) and 800,000 (Turin) inhabitants.
5 Dynamic population estimation
An interesting possibility offered by mobile network metadata is the estimation of dynamic populations, i.e., the instantaneous distribution of city inhabitants over time, as determined by their daily activities. In this case, we do not target the evaluation of the static density of dwelling units in cell , but that of the time-varying real-time density in cell , at any time . Estimating dynamic population densities is more challenging than approximating the static distribution of dwelling units, due to the much shorter timescale of people movements (minutes) compared to that of domicile variation (years).
The main problem in estimating dynamic populations is the lack of ground-truth data, which makes training supervised models such as that in (3) impossible. Simply reusing the parameters and computed for the static population is inaccurate, because those values describe the relationship between a specific subscriber presence and the static density of residents: there is no certainty that the correspondence remains valid when inhabitants are not at home. For instance, the reduced correlation between census population information and mobile subscriber presence during working hours, in Figure 2, already advises against using parameters trained over night hours to estimate whole-day population densities at any time .
Our approach roots instead in a novel multivariate relationship between the population density, the subscriber presence and the level of mobile communication activity of subscribers. This allows taking the constants and out of the equation, and finding a unifying expression that can be used to estimate dynamic urban populations in absence of ground-truth information.
5.1 Subscriber presence and activity levels
We start by discussing the interplay between the time-varying subscriber presence and the subscriber activity level. The latter is the frequency at which a subscriber interacts with the mobile network. Formally, we define the activity levels for voice calls and text messages at network cell and time as:
where stands for the number of mobile communication events of type (i.e., incoming or outgoing voice calls, incoming or outgoing text messages) recorded in mobile network cell at time . The subscriber presence is a proxy of the number of “unique users”, as it provides an approximate tally of the mobile devices based on their interactions with the mobile network: the fractions in (7) and (8) respectively denote the mean number of calling and texting events per user in cell at time .
We can now introduce the average activity levels as
where denotes the number of mobile network cells in the target geographical region. Then, and express the average number of calls and messages sent or received at time by a user located in the whole urban areas.
These activity levels are not uniform over time. Figure 8 depicts the fluctuation of and in the residential areas of Milan, Rome and Turin, over a day. The error bars indicate the average and 5th, 25th, 75th and 95th percentiles of the activity levels, on a hourly basis. For all cities, and undergo major variations, with minima at night and higher activity during work hours.
This behavior is consistent across land uses, as shown in Figure 9 for the specific case of Milan. Minor variations exist, in accordance to the prevalent activities in each land land use: university and shopping areas show higher mobile communication during rush hours, and almost no network usage at night; touristic and office areas have activity peaks in the morning that then degrade until midnight. Still, the overall heterogeneity of and is the same for all land uses: Figure 10 highlights such uniformity, by aggregating the daily behavior into a single error bar for each land use. In all cases, the 75-th and 95-th percentiles of both voice calls and text messages are approximately 50% and 300% higher than the mean activity level.
Our key point here is that the heterogeneity of and in time has an impact on the correctness of the subscriber presence information. As an illustrative example, let us consider again Figure 1: the more often a mobile device issues and receives voice calls or text messages, the more accurate is its localization in the presence data. A legitimate question is then whether the different activity levels we observe in all our urban scenarios can be linked to the model parameterization, and explain the diversity of and noted in Section 4. We explore this possibility next.
5.2 Population estimation with activity levels
We do not have access to the real values of the parameters in (2), but to their estimations and . We thus collect data in all cities that refer to the overnight period, i.e., from midnight to 8 am: in this period the ISTAT census information can be still considered a reliable ground truth, as most people are at home. We then draw a scatterplot of the activity levels and , with the corresponding and obtained with the RANSAC regression model.
The results for the three cities are depicted in Figure 11, and outline a striking linear relationship between the activity levels and both model parameters, for all cities. Specifically, grows linearly with the activity levels, while drops linearly with the same measures. We can then re-write the parameters and as
where denotes the type of event (i.e., calls or texts).
The exact values of , , , and are listed in Table 11. In both (11) and (12), we observe some heterogeneity across cities. Specifically, the derivatives of the curves are quite close to each other, whereas a constant offset tells apart the linear behavior observed in different urban areas. Such a diversity in the parameter settings evidences the need for a per-city calibration of the model. Instead, there is no significant difference between the values obtained when considering voice call or text message activity: hereinafter, we will consider the former only.
We leverage the results above to draw a unifying multivariate model that links the population density to both the subscriber presence and the subscriber activity level. Specifically, we refine our estimation model as
where is a simplified notation for .
The following important considerations are in order. First, the expression in (13) employs variables and that can be computed from mobile network metadata at any time . Second, unlike the original and in (2), the new parameters , , , and can be regarded as time-invariant, assuming that the linear relationships in (11) and (12) hold for any activity level. When considered jointly, these observations imply that the model in (13) is suitable for the dynamic estimation of population densities in practical cases where ground-truth information on the instantaneous distribution of inhabitants is unavailable.
5.3 Properties of the multivariate model
We explore the basic properties of the multivariate model in Figure 12. The plot illustrates how expression (13) reshapes the relationship between the presence density and the estimated population density depending on the activity level , in the exemplary Milan scenario. The dashed line in the figure is the conceptual curve that would link the estimated population to the presence if the latter metadata reported the exact number of customers in region . In that case, the population could be obtained by simply scaling the presence by the market share of the mobile operator777TIM, our mobile network metadata provider, has a market share of approximately 35% in Milan, hence in our case., i.e., .
We observe that the model always lies below such a conceptual curve, compensating for the effect that the presence metadata computed with current activity levels tends to overestimate the population density. However, as the value of grows from 0.01 to 0.20 (the two extreme values observed in our datasets), the model approaches the ideal relationship. This is consistent with the intuition that higher subscriber activity results in presence metadata that approximates more accurately the actual number of people present in a cell. Interestingly, such a correspondence occurs faster for low-density cells (e.g., presence is an excellent proxy for population densities below 50 individuals/km2 already at ); instead, high-density cells require subscriber presence values that are never attained in our data. At the maximum activity level recorded in our metadata, i.e., , the model is nearly equivalent to the perfectly proportional representation for presence densities up to 400 devices/km2.
Finally, we comment on the suitability of the model for real-time estimation of dynamic population densities. Our multivariate model has a close form, presented in (13). Therefore, the computational complexity of the model corresponds to that of computing the equation – an operation performed in nanoseconds by any low-end CPU. As a result, the complexity of the model easily meets the requirements of any real-time application. Instead, the actual system latency would depend from the time needed by the mobile network operator to collect and process the data required to compute the presence metadata and the activity level: however, these aspects concern the overall mobile network architecture, and are well beyond the scope of our contribution.
6 Case studies
In this section, we provide proof-of-concept exploitations of the model in (13). Specifically, we first leverage the model to provide a glance of the population dynamics during typical days in Milan and Turin. Then, we assess the model capacity to approximate the population density during special events, such as sport matches and public rallies.
6.1 A day in the life of Milan
Figure 13 illustrates the dynamic distribution of the population in Milan and its suburbs, as inferred from our multivariate model, during key times of one normal day, i.e.
, Monday April 13, 2015. Left plots refer to the whole conurbation, whereas right plots focus on the actual city. In each plot, colors denote the z-score, which measures the deviation from the mean of the population distribution at each location. Formally, the z-score in cellat time is
are the mean and standard deviation of the population density at cell, computed over the values estimated by our model during the full two months, i.e., over . As a result, the plots show how the population density fluctuates at specific times: variations range from high in-flow of individuals moving into a cell (red) to high out-flow of individuals leaving a cell (blue), passing by neutral cells where the population density does not vary during the observation period (white).
Reasonable dynamics emerge from the plots. At 5 am, the only point of interest showing relevant activity is the mercato ortofrutticolo, i.e., the wholesale produce market of Milan. The market attracts farmers and merchants very early in the morning, as highlighted by the red spot in the top right plot. At 7 pm, the population density is especially high around main transportation hubs, such as train stations and intermodal exchange nodes. The population density grows significantly in the city center throughout the morning, see e.g., the plots at 9 am and 12 pm. At the same times, the suburban and rural areas show low z-scores, indicating a clear in-flow of inhabitants from around the city to downtown, where office areas are located. The trend is then reversed in the afternoon, starting at 4 pm and more clearly at 6.30 pm, when people leave from the office. Here, the city center tends to become less populated, with an out-flow of inhabitants towards the city outskirts. Finally, it is interesting to observe that late at night, at 11.30 pm, the population is mostly at home, with high z-scores in suburban and rural areas, or in nightlife areas.
Overall, the results in Figure 13 show that our multivariate model can capture sensible dynamics of the population density at an intra-urban scale.
6.2 Crowds at large-scale events
Our proposed model does not only capture typical dynamics of the urban population density, but can also estimate crowds at major social events. Interestingly, official figures about attendance at such events offer an original means to validate our assumptions on the time-invariance of the parameters in Table 11, as well as the overall quality of the estimates returned by our multivariate model.
6.2.1 Football matches
Football matches are traditionally very popular events throughout Italy. Figure 14 highlights the population density in Milan and Turin during games of major local football teams. For the Milan case, the plot refers to April 19, a Sunday, at 10 pm. The model conveys well the crowd attracted by an important football match between AC Milan and Inter Milan, the two city teams, that took place on that day at San Siro, the main city stadium. The stadium position is even more evident in the case of Turin, on April 14 at 9 pm. At that time, the local team, Juventus FC, was playing a quarter-final Champions League game against AS Monaco.
In fact, the multivariate model allows going beyond a simple visualization of population density peaks in and around the stadiums during matches. By leveraging the expression in (13), we can produce actual estimates of the attendance at matches through the following steps.
We identify the mobile network cells that provide coverage to the stadium, and denote their set as . Since exact maps of the signal propagation and antenna coverage areas are not available, we tend to be inclusive, considering all cells that intersect with the stadium surface, as well as the adjacent ones.
For each cell , we determine the presence density in a normal situation, i.e., when no match is played at the stadium. To that end, we record all presence density values at the same weekday and time of the match, excluding those days where another match was played; we then compute the median of such presence densities, and denote the result as .
We establish the time of the match at which the presence density across all cells in reaches its peak, i.e.,
where is the match timespan, from 15 minutes before kickoff to 15 minutes after the final whistle.
We compute the average presence density in normal conditions, , and during the match, , as
where indicates the surface of mobile network cell . We opt for an average weighted on the relative cell sizes, so as to account for the fact that the cells in may have quite diverse surfaces.
We calculate the mean activity level in all concerned cells during the course of match as
where operator designates the cardinality of a set. Here, a simple arithmetic mean suffices, since the values are averaged over users and unrelated to cell surfaces.
The attendance is finally obtained via our multivariate model as
The last operation above applies the model in (13) to the difference between and , i.e., to the increased presence density during the football match. The result is an estimate of the population density inflation (in attendees/km) caused by the crowd in the stadium. Multiplying by the total geographical surface allows inferring the actual attendance at the event.
In order to assess the quality of the estimation, we consider all matches played in March and April 2015 by the first-division football teams of Milan, Turin and Rome, and compute their attendance from (19). We then compare such estimated attendance to the official figures reported by local authorities, which are very precise and represent a reliable ground-truth. The overall attendance estimation error, in terms of the metrics introduced in Section 4.1 is and : these values are aligned with those for the static population estimation, and demonstrate that our multivariate model can measure population densities in a time-varying environment with similar accuracy.
Further details are provided in Table V, which reports the date, estimated and ground-truth attendance for each match. Overall, there is a good agreement between the values, with relative errors that range from 0.2% to 28.4%, and an average error of 11.9%. These results are especially encouraging when considering that our approach is general, and not specifically designed for large-scale events.
6.2.2 Public march
A second example of social happening detected from the dynamic population estimated by our multivariate model is a public manifestation that occurred in Milan, on Saturday April 25. On that day, Italy celebrated the 70th anniversary of liberation after World War II, and a large crowd marched along the streets of the city to commemorate the event. Figure 15 illustrates the significant increase of the z-score of the dynamic population density inferred by our model at different times during the manifestation. Namely, the plots allow appreciating the initial gathering of people in the Porta Venezia neighborhood, the two-hour procession in the city center, and the final arrival at the Duomo cathedral. This maps very well to the actual route of the parade, in the left plot of Figure 16.
Also in this case, our model can be leveraged to approximate the number of participants to the march. The steps are the same detailed in Section 6.2.1. Notably, we consider presence metadata in all cells that provide coverage to the path of the public march in the left plot of Figure 16, as well as their immediately adjacent cells, during the whole duration of the rally. The right plot of Figure 16 shows the results returned by the multivariate model. There are approximately 5,000 people strolling in the march area during a typical early afternoon on spring Saturdays; however, the population increases dramatically on the demonstration day. We estimate the peak attendance at 37,000 persons888This value represents the increased presence over the normal population, as in the case of football matches., in the central phases of the march. Official and unofficial figures evaluate the total number of participants at 50,000 and 60,000, respectively [21, 22]. However, our model captures the instantaneous number of participants, and not the total one: as this is a four-hour demonstration, we speculate that the discrepancy between the official figures and our estimate is explained by a natural turnover of attendees, some of which conclude the march and leaving the manifestation before others join it at its start location.
7 Comparative evaluation
We carry out a comparative performance evaluation in order to clearly position our approach with respect to previously proposed competitor solutions. Specifically, the current state-of-the-art techniques for the estimation of population densities from mobile network metadata are those presented in  and . The former targets static population distribution estimation, while the second is designed for dynamic population density estimation. Therefore, we compare our multivariate model with the solution in  in the static case, and with that in  in the dynamic case.
7.1 Static population
The approach in 
performs a random forest regression on a large number of per-cell features that include the hourly volumes of incoming and outgoing calls and text messages, the hourly volume of Internet sessions, and the surface fraction belonging to each land use. Land uses are classified into buildings, vegetation, water, road, and railroads, and obtained from the OpenStreetMap database (41% of the cells) or inferred from satellite imagery (59% of the cells). Each feature is computed at three different scales, by considering cells in isolation as well as by aggregating neighboring cells into 33 and 55 grid squares.
In fact, the random forest regression is tested in one of the scenarios we also consider, i.e., the conurbation of Milan, using an equivalent dataset provided again by TIM in the context of the 2014 Big Data Challenge. Therefore, we can directly compare the accuracy in our results with that attained in . In the best configuration, where only the 16 most important features are retained for training, the random forest technique presented in  achieves an of 0.66. The equivalent result obtained with our model is 0.80, as shown in Table LABEL:tab:cities (Milan, mixed land use): this amounts to an improvement that exceeds 21%.
The better performance of our model is due to a combination of factors. First, we leverage subscriber presence, which is a more reliable proxy of population density than the mobile network metadata used in . Second, we employ land use information that tells apart human activities (e.g., residential versus office areas) rather than simple urbanization (e.g.
, buildings versus vegetation): therefore, our notion of land use has a more direct relationship with population distributions. Third, we filter the metadata based on daytime, land use and outlying human dynamics: in this way, we account for important phenomena, such as the heterogeneity of subscribers’ behaviors over the day and during the night, the diversity of mobile service usage in residential and non-residential area, or the variations of mobile traffic activity during weekdays and weekends or holidays. Given the rather intuitive nature of these phenomena, designing filters based on reasoning is more effective than having a machine learning technique guess them.
7.2 Dynamic population
The approach in  mimics several of our solutions, as originally presented in . It leverages regressions on the power-law model of population density in (2), where it uses the number of subscribers at 7 am as the variable: this is semantically similar, but not identical, to the subscriber presence we employ as . Also, the solution in  performs the regression on different functional regions separately, which is equivalent to telling apart land uses. Functional regions are urban areas characterized by different densities of points of interest, and are classified into residence, entertainment, business, industry, education, scenery spot and suburb.
However, the model in  fundamentally differs from ours in two aspects.
It implements the estimation of the dynamic population density via a time-varying rescaling factor , which is applied uniformly to all cells999The study in  considers a spatial tessellation based on functional regions and not a Voronoi tessellation based on the mobile network deployment. In our case, we derive land use from mobile network metadata, hence the two tessellations match. at time and ensures that the total population stays constant over time.
Formally, the design choices above lead to a model
for each spatial cell of functional region at time . In (20), and are the power-law parameters regressed from the static population for functional region , is the ground-truth static population in cell , and is cell surface.
In the original work, the model in  is cross-validated with transport data in the region of Shanghai, PRC. The urban environment and validation methodology are very different from those we consider, hence a direct comparison with our results is impossible. For the sake of a fair comparison, we thus implement101010We had to make two approximations in our implementation. First, we use subscriber presence instead of the number of subscribers, as we do not have access to the latter in our scenarios. Second, we employ MWS-based land uses instead of functional regions, as we do not have access to high-detail points of interest databases in our scenarios. We argue that these are minor changes, as the replacement metadata is semantically close to that employed in , and the core differences between the two models lie elsewhere. the solution in , and evaluate its performance in our reference scenarios. The validation strategy is the same we adopted in Section 6.2.1, i.e., leveraging ground-truth attendance at football matches.
Figure 17 displays the relative error of the estimation made by the solution in  and our proposed multivariate model. The left plot summarizes the results over all matches listed in Table V, as the 5th, 25th, 50th, 75th and 95th percentiles. The relative error incurred by the approach in  is approximately twice that caused by our model, for all these statistics; in the median case, it exceeds a factor 2.2. The statistical validity of the comparison is confirmed by a two-sided Mann-Whitney U-test, which returns a p-value of 0.0005. When disaggregating results for all matches, in the right plot, it is evident that our estimation technique performs consistently better. A couple of outliers apart, our model attains peaks of improvement at 0.4.
Different matches can attract a very diverse number of spectators, hence we also investigate how the two models compare in terms of absolute error, i.e., the discrepancy in the number of attendees between estimates and ground truth. Figure 18 shows that the performance are aligned with those observed with relative errors: also in this case our model grants a 50% error reduction, consistently across matches. Thus, our conclusions hold also in this case.
We believe that the better accuracy achieved by our model roots in (i) its capability to capture the dependence of the regression parameters on the user activity levels , which is instead neglected by the time-invariant in , and (ii) its exclusive reliance on regression in residential land uses. Concerning this second aspect, we argue that the early morning population in non-residential regions may not map well to people living in those areas; depending on the nature of the area, it rather corresponds to commuters, early workers, tourists, etc. Therefore, a regression trained on such land uses will lead to unreliable parameters and . As a supporting evidence from Figure 18–17, the city with more heterogeneous land uses, i.e., Milan, is the one where the solution in  performs the worst.
8 Related work
There is wide agreement on the suitability of mobile network metadata as a source of information for human mobility analysis. Specifically to population distribution estimation, mobile network metadata was first proposed as a proxy for the density of inhabitants in . Early evidences of the existence of an actual correlation between the mobile communication activity and the underlying population density were presented in , by comparing city population sizes and amounts of mobile network customers.
Subsequent works carried out more comprehensive evaluations. In , the home location of each subscriber was localized as the most frequently visited cell with a home profile, i.e., where the activity peaks at evening. The density of home locations was then found to match very well –with a 0.92 correlation– census data on nationwide population distribution. Similarly, an excellent agreement between the overnight spatial density of mobile subscribers and that of nationwide static populations was found in [26, 27]. However, these results refer to populations at the scale of a whole country, with a spatial granularity at the level of counties or tracts, i.e., large regions that comprise multiple cities each. Our focus is instead on urban population distribution estimation within individual urban areas: down-scaling the investigation to a citywide level requires orders-of-magnitude higher accuracy, and is a harder challenge.
Citywide population estimation from mobile network metadata has been addressed by a limited number of works in the literature. In , voice calls, text message, and Internet activity metadata, combined with Twitter records, were found to be highly correlated with the number of people at specific city locations (i.e., sports arena and airport) during a target time interval. In , LandScan™, a tool for ambient population estimation, was employed to explore the relationship between the voice call activity and the underlying inhabitant density, at a 1-km resolution. The authors found a weak correlation of 0.24, later improved to 0.45 by limiting the analysis to selected time intervals rather than considering the daily communication volume. In , telecommunication data was mixed with a number of other sources, including information on land use, road networks, satellite nightlights, and slope. By feeding such data to a asymmetric modeling approach, the authors obtained a high 0.92 correlation with census information. This solution is employed by the Worldpop initiative, which aims at building fine-grained maps of static population densities in underdeveloped countries . However, the high correlation mentioned before is a nationwide average, and the authors indicate that the accuracy is lower for the most densely populated areas, i.e., large cities. Indeed, in such areas, a normalized error of around 0.6 is measured in , whereas we obtain values below 0.1.
The current state-of-the-art in the estimation of citywide population distribution from mobile network metadata is represented by the approaches in  and in . Section 7 provides a thorough comparative evaluation. In addition to granting improved performance, our solution is the first that is (i) evaluated in multiple cities so as to demonstrate its general viability, and (ii) validated using attendances at sports and public events as ground-truth dynamic populations.
Finally, an earlier version of this paper focuses on static population estimation, and only briefly sketches the multivariate model for the dynamic population estimation . In addition to general presentation improvements, the present document improves the conference version by including: (i) a much sounder discussion of the multivariate model design; (ii) an original validation methodology for dynamic population estimates; (iii) a comprehensive evaluation of the model performance with respect to ground truth information and in comparison with state-of-the-art benchmarks.
We introduced a novel approach to the estimation of population density based on mobile network metadata, for both static and dynamic populations.
Building on the well-known power relationship between the mobile network activity density and the population density, our baseline model of static population density yields several unique properties: (i) it builds exclusively on metadata collected by mobile network operators avoiding complex and cumbersome data mixing; (ii) it leverages for the first time subscriber presence data inferred from the mobile communications of each user, which is found to be a much better proxy of the population distribution than previously adopted metrics; (iii) it introduces a number of original data filters based on time and land use that allow refining the population distribution estimates. Thanks to these features, when tested with substantial real-world metadata, our model outperforms previous proposals in the literature, allowing for a reliable representation of static populations across different cities.
In addition, we extended the baseline model to a multivariate version that exploits a new time-invariant linear relationship between the power law parameters and the subscriber activity levels. The resulting multivariate model can be used to determine population densities in a dynamic fashion, as proven by evaluations with real-world mobile network metadata. Specifically, our multivariate model can reproduce typical daily activity trends in urban areas, and shows good accuracy in estimating attendance at large-scale sports and social events.
The authors would like to thank Angelo Furno for his help with extracting land use information. This research was partly supported by the European Union Horizon 2020 Framework Programme under REA grant agreement no.778305 DAWN4IoE, and by BPI-France through the FUI FluidTracks project.
-  D. Naboulsi, M. Fiore, R. Stanica, and S. Ribot, “Large-scale mobile traffic analysis: a survey,” IEEE Communications Surveys and Tutorials, vol. 18, no. 1, 2016.
-  C. Song, Z. Qu, N. Blumm, and A.-L. Barabási, “Limits of predictability in human mobility,” Science, vol. 327, no. 5968, pp. 1018–1021, 2010.
-  Y. Yang, C. Herrera, N. Eagle, and M. C. González, “Limits of predictability in commuting flows in the absence of data for calibration,” Nature Scientific Reports, vol. 4, no. 5662, 2014.
-  P. Bajardi, C. Poletto, J. J. Ramasco, M. Tizzoni, V. Colizza, and A. Vespignani, “Human mobility networks, travel restrictions, and the global spread of 2009 h1n1 pandemic,” PloS one, vol. 6, no. 1, p. e16591, 2011.
-  J. P. Bagrow, D. Wang, and A.-L. Barabasi, “Collective response of human populations to large-scale emergencies,” PloS one, vol. 6, no. 3, p. e17680, 2011.
-  Telefonica smart steps. [Online]. Available: http://dynamicinsights.telefonica.com/smart-steps/
-  Orange flux vision. [Online]. Available: http://www.orange-business.com/fr/produits/flux-vision
-  D. Coleman, “The twilight of the census,” Population and Development Review, vol. 38, pp. 334–351, 2013.
-  C. Ratti, R. Pulselli, S. Williams, and D. Frenchman, “Mobile land- scapes: Using location data from cell-phones for urban analysis,” Environment and Planning B Planning and Design, vol. 33, no. 5, 2006.
-  G. Krings, F. Calabrese, C. Ratti, and V. D. Blondel, “Urban gravity: a model for inter-city telecommunication flows,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2009, no. 07, p. L07003, 2009.
-  F. Botta, H. S. Moat, and T. Preis, “Quantifying crowd size with mobile phone and twitter data,” Royal Society open science, vol. 2, no. 5, p. 150162, 2015.
-  C. Kang, Y. Liu, X. Ma, and L. Wu, “Towards estimating urban population distributions from mobile call data,” Journal of Urban Technology, vol. 19, no. 4, pp. 3–21, 2012.
-  P. Deville, C. Linard, S. Martin, M. Gilbert, F. R. Stevens, A. E. Gaughan, V. D. Blondel, and A. J. Tatem, “Dynamic population mapping using mobile phone data,” Proceedings of the National Academy of Sciences, vol. 111, no. 45, pp. 15 888–15 893, 2014.
R. W. Douglass, D. A. Meyer, M. Ram, D. Rideout, and D. Song, “High resolution
population estimates from telecommunications data,”
EPJ Data Science, vol. 4, no. 1, pp. 1–13, 2015.
-  F. Xu, P. Zhang, and Y. Li, “Context-aware real-time population estimation for metropolis,” in Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, ser. UbiComp ’16, 2016, pp. 1064–1075.
-  Estimated dynamic population density datasets. [Online]. Available: https://zenodo.org/record/1012194.
-  Telecom italia big data challenge. [Online]. Available: http://www.telecomitalia.com/tit/en/innovazione/big-data-challenge-2015.html.
-  A. Furno, M. Fiore, R. Stanica, C. Ziemlicki, and Z. Smoreda, “A tale of ten cities: Characterizing signatures of mobile traffic in urban areas,” IEEE Transactions on Mobile Computing, vol. 16, no. 10, pp. 2682–2696, Oct 2017.
-  L. M. A. Bettencourt, “The origins of scaling in cities,” Science, vol. 340, no. 6139, pp. 1438–1441, 2013. [Online]. Available: http://science.sciencemag.org/content/340/6139/1438
-  M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
-  Rassegna.it. [Online]. Available: http://www.rassegna.it/articoli/25-aprile-in-60mila-alla-manifestazione-di-milano
-  Repubblica.it. [Online]. Available: http://milano.repubblica.it/cronaca/2015/04/25/news/milano_corteo_per_il_25_aprile_contestata_la_brigata_ebraica-112820135/
-  M. Haklay and P. Weber, “Openstreetmap: User-generated street maps,” IEEE Pervasive Computing, vol. 7, no. 4, pp. 12–18, Oct 2008.
-  G. Khodabandelou, V. Gauthier, M. El-Yacoubi, and M. Fiore, “Population estimation from mobile network traffic metadata,” in IEEE WoWMoM, 2016, pp. 317–326.
-  B. C. Csáji, A. Browet, V. A. Traag, J.-C. Delvenne, E. Huens, P. Van Dooren, Z. Smoreda, and V. D. Blondel, “Exploring the mobility of mobile phone users,” Physica A: Statistical Mechanics and its Applications, vol. 392, no. 6, pp. 1459–1473, 2013.
-  S. Bekhor, Y. Cohen, and C. Solomon, “Evaluating long-distance travel patterns in israel by tracking cellular phone positions,” Journal of Advanced Transportation, vol. 47, no. 4, pp. 435–446, 2013.
-  F. Calabrese, G. Di Lorenzo, L. Liu, and C. Ratti, “Estimating origin-destination flows using mobile phone location data,” IEEE Pervasive Computing, vol. 10, no. 4, pp. 0036–44, 2011.
-  Worldpop. [Online]. Available: http://worldpop.org.uk/