1 Introduction
A crucial aspect for the wellbeing of an urban area in the context of the much discussed “SmartCity” argument refers to the monitoring of the dynamic of people’ presences. To do so, thanks to a research collaboration of DMS StatLab (https://sites.google.com/a/unibs.it/dmsstatlab/) with the Statistical Office of the Municipality of Brescia, we use mobile phone data provided by Telecom Italia Mobile (TIM), quantifying the number of TIM users connected to the smartphone over a spatial grid characterized by its latitude and longitude, and over the time.
We estimate the presence of TIM users in the city of Brescia by classifying similar days in terms of both the spatial and the temporal dimension. In doing so, we use a larger (
2 years) dataset compared to that used in similar works (Zanini et al. zanini16understandingfor example). To manage with high dimensional data, we employ a multistage procedure that, in the first step, converts the data matrix containing the values of the grid (2D) at each point in time to a vector (1D) of features using a method borrowed from image processing (the Histogram of Oriented Grandients  HOG). The procedure, by classifying similar days using a mix of traditional (kmeans) and functional data clustering techniques, is able to identify the dynamic related to the presences of TIM users in the city.
However, for a proper evaluation, all the people should be counted in. In absence of the data about other companies’ users, the coefficient related to TIM market share could be applied. Unfortunately, this percentage is just available at country level, while we have reason to think that the market share varies along different cities due to socioeconomic and demographic reasons. So, the aim of this work focuses on the estimation of the number of people at a city level. In doing so, we employ a record linkage based on comparing residents from administrative archives with the number of TIM users on specific areas of the grid during a sample of proper time periods.
2 Data
This work focuses on mobile phone data provided by Telecom Italia Mobile (TIM), which is currently the largest operator in Italy in this sector. These data sometimes goes with the name of “Erlang” measures. In detail, the data refer to the mobile phone activity recorded in the period April 1st 2014 to August 11th 2016, in a rectangular region defined by latitude 45.21 N  46.36 N and longitude 9.83 N  10.85 N. Data were aggregated into 923 x 607 cells of 150 size each. Data are available at intervals of 15 minutes, for a total of more than 40,000 millions of records collected. For each cell and for each time interval, the corresponding record refers to the average number of mobile phones simultaneously connected to the network in that area in that time interval. The mobility feature of these data is hidden, in the sense it is not possible to trace the single person over the time. ^{1}^{1}1Works following the single person over the time exist. Unfortunately, those kind of data are available subject to the administration of a survey, that is usually limited in terms of time and usually cover a small portion of the population (Zhaedi and Shafahi zahedi18estimating ). Similar data has been used by Carpita and Simonetto carpita14big , who analyzed the presence of people during big events in the city of Brescia, by Zanini et al. zanini16understanding
, who find, by mean of a Independent Component Analysis (ICA), a number of spatial components that separate main areas of the city of Milano, and by others (Metulini and Carpita
metulini18on Manfredini et al. manfredini15treelet and Secchi et al. secchi17analysis ).3 Estimating TIM Density Profiles
To characterize the dynamic of the presences in the city by defining daily density profiles (DDPs) is an object of interest. With a DDP we consider the curve representing the number of people in a defined rectangular region at different time periods during the day. We develop a procedure to group similar days in terms of the dynamic of the presences of TIM users considering both the spatial structure and the evolution over the time. By letting be the matrix containing the values of the grid at the quarter of the day :

In the first step, using Histogram of Oriented Gradients (Dalal et al. dalal2005histogram , Tomasi tomasi12histogram ), for each and for each we extract the vector of features from (a standardized version of the matrix such that min/max is in the interval [0,100] for all and for all );

in the second step, we stack together the vectors of features at a daily basis and we perform a traditional kmeans cluster analysis where the objects to be clustered are the days and the variables are all the stacked vector of features. With this step we aim at group days with similarities in the spatial structure of the grid;

in the third step, for each cluster, we perform a functional data analysis (FDA) clustering technique (Bouveyron and Comebouveyron15discriminative ) to further group days. In this step we group days that are similar in terms of the shape of the curve;

in the fourth step we define, for each final group, confidence intervals for the DDP, by using functional box plots (Sun and Genton
sun11functional ; sun12adjusted ).
Much details can be found in Metulini and Carpita metulini19tobe . The output of the procedure is a functional box plot on the DDPs of a group of similar days. For example, by applying the procedure to the rectangular grid defined by latitude 45.516 N  46.564 N and longitude 10.18 N  10.245 N (roughly corresponding to the municipality of Brescia) and composed by 39 x 39 cells, we find that most of the week days of the Summer 2016 belongs to the same group. Functional box plots of that days (Figure 1) highlight the amount of TIM users along different quarters, that varies, by month and by quarter, from a minimum of 30 to a maximum of about 55 thousands of TIM users.
These values give us a series of information, for example that the number of people in the city increases during the morning and the afternoon hours and decreases during the night. However, the same values give us no information on the number of total people.
4 The linkage strategy
With the procedure in section 3 we are able to quantify the number of TIM users in the city in selected days and in particular hours of the day. However, to have an idea of the total number of people we have to consider users of other mobile phone companies as well. This data is often unavailable, unless to an onerous cost. An alternative approach is to apply the TIM mobile phone market share coefficient to the number of TIM users to retrieve the total number of people. A countrylevel estimate being available through “Il Sole 24 Ore” newspaper. This value stands to 30.2 % (2016, December). However, we have reasons to think that TIM market share varies along cities since socioeconomic and demographic variables (Table 1) highlight significant differences.
Quantity  Municipality of Brescia  Italy 

Percapita revenues (Euro / year)  23,418  19,514 
% foreigners  18.5  8.5 
Average number of people per family  2.11  2.33 
Average age  45.8  44.7 
1 MEF Dipartimento delle Finanze (2016)  
2 ISTAT (2017) 
To estimate the market share coefficient for the municipality of Brescia we employ the following strategy. We compare the data on the number of residents from administrative archives with the number of TIM users on a selected region during a specific hour of the day.
To the sake of comparison, we assume that, during late evening hours, residential areas are populated just by residents. ISTAT (https://www.istat.it/it/archivio/104317) published geographical data called “Basi territoriali e variabili censuarie” in the form of a shape file with data (the socalled SpatialPolygonDataFrame in R language). For the municipalities with more than 20,000 residents, ISTAT aggregates the region at a “Sezioni di censimento” (SC) level. To have an idea, the municipality of Brescia has 1,836 SCs. The shape file contains the information on the number of residents by SC. We compute the number of TIM users in each SC by putting together the grid data of the TIM users with the shape files containing the value of residents. The grid cells have regular size while SCs are irregular size polygons. So, to count the number of TIM users in each polygon we apply a weighted scheme based on the portion of the polygon contained in the cell^{2}^{2}2In doing so, we had to convert the coordinate reference system of the spatial polygon object from ”UTM” to “longlat”. ISTAT provides shape files in WGS84 international coordinate reference system.. For example, SC 110, located at latitude 45.544 N and longitude 10.217 N (Figure 2) overlaps with 4 cells. Cell 1, at 9pm  October 28th 2015, counts for 682 TIM users, cell 2 counts for 555 users, cells 3 for 677 and cell 4 751. Moreover, 8.3% of the polygon lies in cell 1, 27.0% in cell 2, 26.4% in cell 3 and 38.2% in cell 4.
The number of TIM users in SC 110 is estimated as:
After having computed this number for every SC, we then compute the ratio among TIM users and residents^{3}^{3}3In the count of “Residents” we exclude elderly people (80 years) and children ( 11 years), assuming that we want to consider just those with a smartphone.:
Some descriptive statistics on the ratio index by SC, evaluated on TIM users at 9pm  October 28th 2015, are reported in Table
2. The median value is consistent with the TIM market share at a country level. However, the values in the right tail of the distribution are extremely large. For example, the 95th percentile stands to 5.567, meaning that the number of TIM users retrieved in that polygon are five times as large as the number of residents.min  5th percentile  25th percentile  median  75th percentile  95th percentile  max 
0.006  0.070  0.139  0.245  0.547  5.567  347.024 
The map in Figure 3 confirms this evidence^{4}^{4}4A dynamic map can be found at authors personal websites.. We find that the ratio value in residential areas is quite homogeneous. Unfortunately, we also notice how TIM assignment of the users in space is affected by the localization of Antennas. So, in many areas (particularly those larger ones where the number of residents is small) the index is overestimated. This fact suggests that that the Antenna is often localized in large SCs where few people reside. So, people on residential areas are likely tracked where the antenna is (i.e. up to few hundred meters far from where the person is). For this reason, we decide to visually detect in the map residential areas that are located close to anomalies (i.e. where the antenna is). The strategy is to consider the index for a selected residential area plus a portion of region that surrounds the area. Residential areas are chosen according to DUSAF (Destinazione d’Uso dei Suoli Agricoli e Forestali) maps ( https://www.dati.lombardia.it/Territorio/Dusaf50Usodelsuolo2015/iq6ru7y2). A zoom of the map is reported in Figure 5 (Villaggio Sereno residential areas plus a shopping center) and Figure 5 (San Polo residential area plus a factory). The index in the first area stands to 0.265 and the same value stands to 0.310 in the area of San Polo.
5 Conclusions
Mobile phone data can be used to count the number of people in the city considering the dynamics over the time, differently to administrative archives. However, generally, just data on a portion of phone users is available (i.e. just the users of a company). For this reason we have developed a method to estimate the market share coefficient of the company at a municipality level  whereas the market share is available only at country level  by using administrative data on the number of residents by “Sezione di censimento”. In doing so we take into account for the bias caused by the location of the antennas. Results are consistent with the market share at a national level.
Acknowledgements.
Authors are grateful with the Statistical Office of the Municipality of Brescia, with a special mention to Dr. Marco Palamenghi, who kindly supported us with providing the data.Bibliography
 (1) Zanini, P., Shen, H., & Truong, Y. Understanding resident mobility in Milan through independent component analysis of Telecom Italia mobile usage data. The Annals of Applied Statistics, vol. 10(2), pp. 812833 (2016).
 (2) Blum, O., Calvo, R.,Geospatial data collection and analysis as crucial processes in an integrated census. Israel Central Bureau of Statistics (2001)
 (3) Xu, S., Flexner, S., Carvalho, V. Geocoding Billions of Addresses: Toward a Spatial Record Linkage System with Big Data (2012).
 (4) Zahedi, S., & Shafahi, Y. Estimating activity patterns using spatiotemporal data of cell phone networks. International Journal of Urban Sciences, 22(2), 162179 (2018).
 (5) Carpita, M., & Simonetto, A. Big Data to Monitor Big Social Events: Analysing the mobile phone signals in the Brescia Smart City. Electronic Journal of Applied Statistical Analysis: Decision Support Systems and Services Evaluation, vol. 5(1), pp. 3141. (2014).
 (6) Metulini, R., Carpita, M., On Clustering Daily Mobile Phone Density Profiles, Workshop ”High Dimensional Small Data” (Ca Foscari  Venice) (2018).
 (7) Manfredini, F., Pucci, P., Secchi, P., Tagliolato, P., Vantini, S., & Vitelli, V. Treelet decomposition of mobile phone data for deriving city usage and mobility pattern in the Milan urban region. In Advances in complex data modeling and computational methods in statistics (pp. 133147). Springer, Cham. (2015).
 (8) Secchi, P., Vantini, S., & Zanini, P. Analysis of Mobile Phone Data for Deriving City Mobility Patterns. In Electric Vehicle Sharing Services for Smarter Cities (pp. 3758). Springer, Cham (2017).

(9)
Dalal, N., & Triggs, B. Histograms of oriented gradients for human detection. In international Conference on computer vision & Pattern Recognition (CVPR’05) (Vol. 1, pp. 886893). IEEE Computer Society. (2005).
 (10) Tomasi, C. Histograms of oriented gradients. Computer Vision Sampler, pp. 16. (2012).
 (11) Bouveyron, C., Côme, E., & Jacques, J. The discriminative functional mixture model for a comparative analysis of bike sharing systems. The Annals of Applied Statistics, vol. 9(4), pp. 17261760. (2015).
 (12) Sun, Y., & Genton, M. G. Functional boxplots. Journal of Computational and Graphical Statistics, vol. 20(2), pp. 316334. (2011).

(13)
Sun, Y., & Genton, M. G. Adjusted functional boxplots for spatio‐temporal data visualization and outlier detection. Environmetrics, vol. 23(1), pp. 5464. (2012).
 (14) Metulini, R., & Carpita, M., The HOGFDA Approach with Mobile Phone Data to Modeling the Dynamic of People’s Presences in the City, IES 2019  Statistical evaluation systems at 360: techniques, technologies and new frontiers (2019).