A strategy for the matching of mobile phone signals with census data

by   Rodolfo Metulini, et al.
University of Brescia

Administrative data allows us to count for the number of residents. The geo-localization of people by mobile phone, by quantifying the number of people at a given moment in time, enriches the amount of useful information for "smart" (cities) evaluations. However, using Telecom Italia Mobile (TIM) data, we are able to characterize the spatio-temporal dynamic of the presences in the city of just TIM users. A strategy to estimate total presences is needed. In this paper we propose a strategy to extrapolate the number of total people by using TIM data only. To do so, we apply a spatial record linkage of mobile phone data with administrative archives using the number of residents at the level of sezione di censimento.


page 5

page 6

page 7


The HOG-FDA Approach with Mobile Phone Data to Modeling the Dynamic of People's Presences in the City

In the context of Smart City, the dynamic of the presence of people can ...

Inferring land use from mobile phone activity

Understanding the spatiotemporal distribution of people within a city is...

Geographical veracity of indicators derived from mobile phone data

In this contribution we summarize insights on the geographical veracity ...

Iranian cashes recognition using mobile

In economical societies of today, using cash is an inseparable aspect of...

Phone Sharing and Cash Transfers in Togo: Quantitative Evidence from Mobile Phone Data

Phone sharing is pervasive in many low- and middle-income countries, aff...

How to Measure Sessions of Mobile Device Use? Quantification, Evaluation, and Applications

Research on mobile phone use often starts with a question of "How much t...

A Generalisable Data Fusion Framework to Infer Mode of Transport Using Mobile Phone Data

Cities often lack up-to-date data analytics to evaluate and implement tr...

1 Introduction

A crucial aspect for the well-being of an urban area in the context of the much discussed “Smart-City” argument refers to the monitoring of the dynamic of people’ presences. To do so, thanks to a research collaboration of DMS StatLab (https://sites.google.com/a/unibs.it/dms-statlab/) with the Statistical Office of the Municipality of Brescia, we use mobile phone data provided by Telecom Italia Mobile (TIM), quantifying the number of TIM users connected to the smartphone over a spatial grid characterized by its latitude and longitude, and over the time.

We estimate the presence of TIM users in the city of Brescia by classifying similar days in terms of both the spatial and the temporal dimension. In doing so, we use a larger (

2 years) dataset compared to that used in similar works (Zanini et al. zanini16understanding

for example). To manage with high dimensional data, we employ a multi-stage procedure that, in the first step, converts the data matrix containing the values of the grid (2-D) at each point in time to a vector (1-D) of features using a method borrowed from image processing (the Histogram of Oriented Grandients - HOG). The procedure, by classifying similar days using a mix of traditional (k-means) and functional data clustering techniques, is able to identify the dynamic related to the presences of TIM users in the city.

However, for a proper evaluation, all the people should be counted in. In absence of the data about other companies’ users, the coefficient related to TIM market share could be applied. Unfortunately, this percentage is just available at country level, while we have reason to think that the market share varies along different cities due to socio-economic and demographic reasons. So, the aim of this work focuses on the estimation of the number of people at a city level. In doing so, we employ a record linkage based on comparing residents from administrative archives with the number of TIM users on specific areas of the grid during a sample of proper time periods.

All in all, this work regards the match of administrative archives with data from technological devices, adopting a method that reminds spatial record linkages (Blum and Calvo blum01geo and Xu et al. xu12geo ), for the purpose of quantifying the presences in the city when some data are missings.

Section 2 describes the mobile phone data, section 3 summarizes the procedure adopted to classify similar days and to quantify the number of TIM users in those days. Section 4 outlines the record linkage strategy adopted and section 5 discusses and concludes.

2 Data

This work focuses on mobile phone data provided by Telecom Italia Mobile (TIM), which is currently the largest operator in Italy in this sector. These data sometimes goes with the name of “Erlang” measures. In detail, the data refer to the mobile phone activity recorded in the period April 1st 2014 to August 11th 2016, in a rectangular region defined by latitude 45.21 N - 46.36 N and longitude 9.83 N - 10.85 N. Data were aggregated into 923 x 607 cells of 150 size each. Data are available at intervals of 15 minutes, for a total of more than 40,000 millions of records collected. For each cell and for each time interval, the corresponding record refers to the average number of mobile phones simultaneously connected to the network in that area in that time interval. The mobility feature of these data is hidden, in the sense it is not possible to trace the single person over the time. 111Works following the single person over the time exist. Unfortunately, those kind of data are available subject to the administration of a survey, that is usually limited in terms of time and usually cover a small portion of the population (Zhaedi and Shafahi zahedi18estimating ). Similar data has been used by Carpita and Simonetto carpita14big , who analyzed the presence of people during big events in the city of Brescia, by Zanini et al. zanini16understanding

, who find, by mean of a Independent Component Analysis (ICA), a number of spatial components that separate main areas of the city of Milano, and by others (Metulini and Carpita

metulini18on Manfredini et al. manfredini15treelet and Secchi et al. secchi17analysis ).

3 Estimating TIM Density Profiles

To characterize the dynamic of the presences in the city by defining daily density profiles (DDPs) is an object of interest. With a DDP we consider the curve representing the number of people in a defined rectangular region at different time periods during the day. We develop a procedure to group similar days in terms of the dynamic of the presences of TIM users considering both the spatial structure and the evolution over the time. By letting be the matrix containing the values of the grid at the quarter of the day :

  • In the first step, using Histogram of Oriented Gradients (Dalal et al. dalal2005histogram , Tomasi tomasi12histogram ), for each and for each we extract the vector of features from (a standardized version of the matrix such that min/max is in the interval [0,100] for all and for all );

  • in the second step, we stack together the vectors of features at a daily basis and we perform a traditional k-means cluster analysis where the objects to be clustered are the days and the variables are all the stacked vector of features. With this step we aim at group days with similarities in the spatial structure of the grid;

  • in the third step, for each cluster, we perform a functional data analysis (FDA) clustering technique (Bouveyron and Comebouveyron15discriminative ) to further group days. In this step we group days that are similar in terms of the shape of the curve;

  • in the fourth step we define, for each final group, confidence intervals for the DDP, by using functional box plots (Sun and Genton

    sun11functional ; sun12adjusted ).

Much details can be found in Metulini and Carpita metulini19tobe . The output of the procedure is a functional box plot on the DDPs of a group of similar days. For example, by applying the procedure to the rectangular grid defined by latitude 45.516 N - 46.564 N and longitude 10.18 N - 10.245 N (roughly corresponding to the municipality of Brescia) and composed by 39 x 39 cells, we find that most of the week days of the Summer 2016 belongs to the same group. Functional box plots of that days (Figure 1) highlight the amount of TIM users along different quarters, that varies, by month and by quarter, from a minimum of 30 to a maximum of about 55 thousands of TIM users.

Figure 1:

Functional box plots of the Daily Density Profiles of Summer 2016 (y-axis, in thousands of people), divided by month. Quarters are in the x-axis. Median (black curve), envelope (purple area), 1.5 * envelope (blue curves), outliers (red curves). Our Elaborations.

These values give us a series of information, for example that the number of people in the city increases during the morning and the afternoon hours and decreases during the night. However, the same values give us no information on the number of total people.

4 The linkage strategy

With the procedure in section 3 we are able to quantify the number of TIM users in the city in selected days and in particular hours of the day. However, to have an idea of the total number of people we have to consider users of other mobile phone companies as well. This data is often unavailable, unless to an onerous cost. An alternative approach is to apply the TIM mobile phone market share coefficient to the number of TIM users to retrieve the total number of people. A country-level estimate being available through “Il Sole 24 Ore” newspaper. This value stands to 30.2 % (2016, December). However, we have reasons to think that TIM market share varies along cities since socio-economic and demographic variables (Table 1) highlight significant differences.

Quantity Municipality of Brescia Italy
Per-capita revenues (Euro / year) 23,418 19,514
% foreigners 18.5 8.5
Average number of people per family 2.11 2.33
Average age 45.8 44.7
1 MEF -Dipartimento delle Finanze (2016)
2 ISTAT (2017)
Table 1: Socio-economic and demographic comparison between Municipality of Brescia and Italy.

To estimate the market share coefficient for the municipality of Brescia we employ the following strategy. We compare the data on the number of residents from administrative archives with the number of TIM users on a selected region during a specific hour of the day.

To the sake of comparison, we assume that, during late evening hours, residential areas are populated just by residents. ISTAT (https://www.istat.it/it/archivio/104317) published geographical data called “Basi territoriali e variabili censuarie” in the form of a shape file with data (the so-called SpatialPolygonDataFrame in R language). For the municipalities with more than 20,000 residents, ISTAT aggregates the region at a “Sezioni di censimento” (SC) level. To have an idea, the municipality of Brescia has 1,836 SCs. The shape file contains the information on the number of residents by SC. We compute the number of TIM users in each SC by putting together the grid data of the TIM users with the shape files containing the value of residents. The grid cells have regular size while SCs are irregular size polygons. So, to count the number of TIM users in each polygon we apply a weighted scheme based on the portion of the polygon contained in the cell222In doing so, we had to convert the coordinate reference system of the spatial polygon object from ”UTM” to “longlat”. ISTAT provides shape files in WGS84 international coordinate reference system.. For example, SC 110, located at latitude 45.544 N and longitude 10.217 N (Figure 2) overlaps with 4 cells. Cell 1, at 9pm - October 28th 2015, counts for 682 TIM users, cell 2 counts for 555 users, cells 3 for 677 and cell 4 751. Moreover, 8.3% of the polygon lies in cell 1, 27.0% in cell 2, 26.4% in cell 3 and 38.2% in cell 4.


Figure 2: Weighting scheme to assign the number of TIM users to SC 110, located at latitude 45.544 N and longitude 10.217 N.

The number of TIM users in SC 110 is estimated as:

After having computed this number for every SC, we then compute the ratio among TIM users and residents333In the count of “Residents” we exclude elderly people (80 years) and children ( 11 years), assuming that we want to consider just those with a smartphone.:

Some descriptive statistics on the ratio index by SC, evaluated on TIM users at 9pm - October 28th 2015, are reported in Table

2. The median value is consistent with the TIM market share at a country level. However, the values in the right tail of the distribution are extremely large. For example, the 95th percentile stands to 5.567, meaning that the number of TIM users retrieved in that polygon are five times as large as the number of residents.

min 5th percentile 25th percentile median 75th percentile 95th percentile max
0.006 0.070 0.139 0.245 0.547 5.567 347.024
Table 2: Quartiles distribution of the index in the municipality of Brescia, by SC. 9pm - October 28th 2015.


Figure 3: Map of the by SC in the municipality of Brescia, 9pm - October 28th 2015. Colors go from blue (small ratios) to yellow (large ratio).

The map in Figure 3 confirms this evidence444A dynamic map can be found at authors personal websites.. We find that the ratio value in residential areas is quite homogeneous. Unfortunately, we also notice how TIM assignment of the users in space is affected by the localization of Antennas. So, in many areas (particularly those larger ones where the number of residents is small) the index is overestimated. This fact suggests that that the Antenna is often localized in large SCs where few people reside. So, people on residential areas are likely tracked where the antenna is (i.e. up to few hundred meters far from where the person is). For this reason, we decide to visually detect in the map residential areas that are located close to anomalies (i.e. where the antenna is). The strategy is to consider the index for a selected residential area plus a portion of region that surrounds the area. Residential areas are chosen according to DUSAF (Destinazione d’Uso dei Suoli Agricoli e Forestali) maps ( https://www.dati.lombardia.it/Territorio/Dusaf-5-0-Uso-del-suolo-2015/iq6r-u7y2). A zoom of the map is reported in Figure 5 (Villaggio Sereno residential areas plus a shopping center) and Figure 5 (San Polo residential area plus a factory). The index in the first area stands to 0.265 and the same value stands to 0.310 in the area of San Polo.

Figure 4: Zoom of the map for “Villaggio Sereno” area. Latitude 45.505 N - 45.523 N and longitude 10.179 N - 10.194 N.
Figure 5: Zoom of the map for “San Polo” area. Latitude 45.511 N - 45.519 N and longitude 10.235 N - 10.250 N.

5 Conclusions

Mobile phone data can be used to count the number of people in the city considering the dynamics over the time, differently to administrative archives. However, generally, just data on a portion of phone users is available (i.e. just the users of a company). For this reason we have developed a method to estimate the market share coefficient of the company at a municipality level - whereas the market share is available only at country level - by using administrative data on the number of residents by “Sezione di censimento”. In doing so we take into account for the bias caused by the location of the antennas. Results are consistent with the market share at a national level.

Authors are grateful with the Statistical Office of the Municipality of Brescia, with a special mention to Dr. Marco Palamenghi, who kindly supported us with providing the data.


  • (1) Zanini, P., Shen, H., & Truong, Y. Understanding resident mobility in Milan through independent component analysis of Telecom Italia mobile usage data. The Annals of Applied Statistics, vol. 10(2), pp. 812-833 (2016).
  • (2) Blum, O., Calvo, R.,Geospatial data collection and analysis as crucial processes in an integrated census. Israel Central Bureau of Statistics (2001)
  • (3) Xu, S., Flexner, S., Carvalho, V. Geocoding Billions of Addresses: Toward a Spatial Record Linkage System with Big Data (2012).
  • (4) Zahedi, S., & Shafahi, Y. Estimating activity patterns using spatio-temporal data of cell phone networks. International Journal of Urban Sciences, 22(2), 162-179 (2018).
  • (5) Carpita, M., & Simonetto, A. Big Data to Monitor Big Social Events: Analysing the mobile phone signals in the Brescia Smart City. Electronic Journal of Applied Statistical Analysis: Decision Support Systems and Services Evaluation, vol. 5(1), pp. 31-41. (2014).
  • (6) Metulini, R., Carpita, M., On Clustering Daily Mobile Phone Density Profiles, Workshop ”High Dimensional Small Data” (Ca Foscari - Venice) (2018).
  • (7) Manfredini, F., Pucci, P., Secchi, P., Tagliolato, P., Vantini, S., & Vitelli, V. Treelet decomposition of mobile phone data for deriving city usage and mobility pattern in the Milan urban region. In Advances in complex data modeling and computational methods in statistics (pp. 133-147). Springer, Cham. (2015).
  • (8) Secchi, P., Vantini, S., & Zanini, P. Analysis of Mobile Phone Data for Deriving City Mobility Patterns. In Electric Vehicle Sharing Services for Smarter Cities (pp. 37-58). Springer, Cham (2017).
  • (9)

    Dalal, N., & Triggs, B. Histograms of oriented gradients for human detection. In international Conference on computer vision & Pattern Recognition (CVPR’05) (Vol. 1, pp. 886-893). IEEE Computer Society. (2005).

  • (10) Tomasi, C. Histograms of oriented gradients. Computer Vision Sampler, pp. 1-6. (2012).
  • (11) Bouveyron, C., Côme, E., & Jacques, J. The discriminative functional mixture model for a comparative analysis of bike sharing systems. The Annals of Applied Statistics, vol. 9(4), pp. 1726-1760. (2015).
  • (12) Sun, Y., & Genton, M. G. Functional boxplots. Journal of Computational and Graphical Statistics, vol. 20(2), pp. 316-334. (2011).
  • (13)

    Sun, Y., & Genton, M. G. Adjusted functional boxplots for spatio‐temporal data visualization and outlier detection. Environmetrics, vol. 23(1), pp. 54-64. (2012).

  • (14) Metulini, R., & Carpita, M., The HOG-FDA Approach with Mobile Phone Data to Modeling the Dynamic of People’s Presences in the City, IES 2019 - Statistical evaluation systems at 360: techniques, technologies and new frontiers (2019).