Country-scale Exploratory Analysis of Call Detail Records through the Lens of Data Grid Models

03/20/2015 ∙ by Romain Guigourès, et al. ∙ 0

Call Detail Records (CDRs) are data recorded by telecommunications companies, consisting of basic informations related to several dimensions of the calls made through the network: the source, destination, date and time of calls. CDRs data analysis has received much attention in the recent years since it might reveal valuable information about human behavior. It has shown high added value in many application domains like e.g., communities analysis or network planning. In this paper, we suggest a generic methodology for summarizing information contained in CDRs data. The method is based on a parameter-free estimation of the joint distribution of the variables that describe the calls. We also suggest several well-founded criteria that allows one to browse the summary at various granularities and to explore the summary by means of insightful visualizations. The method handles network graph data, temporal sequence data as well as user mobility data stemming from original CDRs data. We show the relevance of our methodology for various case studies on real-world CDRs data from Ivory Coast.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Telco operators’ activities generate massive volume of data, mainly from three sources: networks, service platforms and customers data bases. Particularly, the use of mobile phones generates the so called Call Detail Records (CDRs), containing information about end-point antenna stations, date, time and duration of the calls (the content of the calls is excluded). While this data is initially stored for billing purpose, useful information and knowledge (related to human mobility [20, 1], social interactions and economic activities) might be derived from the large sets of CDRs collected by the operators.

Recent studies have shown the potential added-value of analyzing such data for several application domains: United Nations Global Pulse [19] sums up some recent research works on how analysis of CDRs can provide valuable information for humanitarian and development purposes, e.g., for disaster response in Haiti, combating H1N1 flu in Mexico, etc. Also, leveraging country-scale sets of CDRs in Ivory Coast, the recent Orange D4D challenge (Data For Development [5]) has given rise to many investigations in several application domains [4] such as health improvement, analysis of economic indicators and population statistics, communities understanding, city and transport planning, tourism and events analysis, emergency, alerting and preventing management, mobile network infrastructure monitoring. Thus, the added-value of analysis of CDRs data does not need to be proved any longer.

Various classical data mining techniques have been applied on CDRs data depending on the features and the task considered: e.g., considering network graphs from (source antenna, destination antenna) data or temporal sequences from (source antenna, date) data appeals for different clustering techniques for summarizing information in the data.

Contribution: in this paper, we suggest an efficient and generic methodology for summarizing CDRs data whatever the features are retained in the analysis. The method is based on data grid models [6]

, a parameter-free joint distribution estimation technique that simultaneously partitions sets of values taken by each variable describing the data (numerical variables are discretized into intervals while the categories of categorical variables are grouped into clusters). The resulting data grid – that can be seen as a coclustering – constitutes the summary of the data. The method is thus able to summarize various types of data stemming from CDRs: network graph data, temporal sequence data as well as user mobility data. We also suggest several criteria

(i) to exploit the resulting data grid at various granularities depending on the needs of analysis and (ii) to interpret the results through meaningful visualizations.

Outline: in the next section, we give a brief description of the CDRs data and the various case studies we led on the data. Section 3 recalls the main principles of data grid models and introduces the criteria for exploiting the resulting data grid. In section 4, we report the experimental results on the various case studies. Also, we discuss further related work in section 5 before concluding.

2 Data description & studies

The CDRs data under study come from the Orange D4D challenge111http://d4d.orange.com/en/home (Data For Development [5]). We consider several case studies on two anonymized data sets, namely communication data and mobility data.

2.1 Case studies on communication data

Communication data consists in 471 millions mobile calls and covers a 5-month period (from 2011, December 1st to 2012, April 28th). The records are described by the four following variables:

  1. emitting antenna (1214 categorical values);

  2. receiving antenna (1216 categorical values);

  3. time of call (with hour precision);

  4. date of call (from 2011/12/01 to 2012/04/28);

From this data set, we consider three subsets for:

  1. Analysis of call network between antennas. Considering emitting antennas, receiving antennas and the calls made between antennas, the data set can be seen as a directed multigraph where nodes are antennas and links are the calls between antennas.

  2. Analysis of output traffic w.r.t. date of call. We consider emitting antennas and the number of days for each call from referral to first day of recording. This data set can be considered as a temporal event sequence spanning over the whole observation period, where the time is the number of days passed and the events are the emitting antenna IDs.

  3. Analysis of output traffic w.r.t. week day and hour of call. We consider emitting antennas, the day of the week (stemming from the date and considered as a numerical variable) and the hour of the day for each call. Here the time dimension is represented by two variables and the data of the whole period are folded up to week day and hour.

2.2 Case studies on mobility data

Mobility data consists in mobility traces of 50000 users over a 2-week period (from 2012 December 12th to 2012 December 24th), i.e. approximatively 55 millions records. The records are described by the four following variables:

  1. anonymized user ID (50000 categorical values);

  2. connexion antenna (1214 categorical values);

  3. time of call (minute precision);

  4. date fo call (from 2012/12/12 to 2012/12/24);

From this data, we consider the user trajectories (identified by user ID) inside the network for the following analysis:

  1. Analysis of user mobility w.r.t. week day and hour. We consider the user ID, antennas, week day and hour. This data set can be considered as a set of spatio-temporal footprints, where each user ID is associated with a sequence of antenna usage over the time dimension. Here again, the time dimension is represented by two variables and the data of the whole period is folded up to week day and hour.

3 Exploitation of data grid models

Data grid models aim at estimating the joint distribution between several variables of mixed-types (categorical as well as numerical). The main principle is to simultaneously partition the values taken by the variables, into groups/clusters of categories for categorical variables and into intervals for numerical variables. The result is a multidimensional (-D) data grid whose cells are defined by a part of each partitioned variable value set. Notice that in all rigor, we are working only with partitions of variable value sets. However, to simplify the discussion we will sometime use a slightly incorrect formulation by mentioning a “partition of a variable” and a “partitioned variable”.

In order to choose the “best” data grid model (given the data) from the model space , we use a Bayesian Maximum A Posteriori (MAP) approach. We explore the model space while minimizing a Bayesian criterion, called cost. The cost criterion implements a trade-off between the accuracy and the robustness of the model and is defined as follows:

Thus, the optimal grid

is the most probable one (maximum a posteriori) given the data. Due to space limitation, the details about the

criterion and the optimization algorithm (called khc) are reported in appendix. Hereafter, we focus on the tools for exploiting the grid and their applications on large-scale CDRs data. The key features to keep in mind are: (i) khc is parameter-free, i.e., there is no need for setting the number of clusters/intervals per dimension; (ii) khc provides an effective locally-optimal solution to the data grid model construction efficiently, in sub-quadratic time complexity ( where is the number of data points).

3.1 Data grid exploitation and visualization

Because of the very large number observations in CDRs data, the optimal grid computed by khc can be made of hundreds of parts per dimension, i.e., millions of cells, which is difficult to exploit and interpret. To alleviate this issue, we suggest a grid simplification method together with several criteria that allow us to choose the granularity of the grid for further analysis, to rank values in clusters and to gain insights in the data through meaningful visualizations.

Dissimilarity index and grid structure simplification. We suggest a simplification method of the grid structure that iteratively merge clusters or adjacent intervals – choosing the merge generating the least degradation of the grid quality. To this end, we introduce a dissimilarity index between clusters or intervals which characterize the impact of the merge on the criterion.

Definition 1 (Dissimilarity index)

Let and be two parts of a variable partition of a grid model . Let be the grid after merging and . The dissimilarity between the two parts and is defined as the difference of before and after the merge:

(1)

When merging clusters that minimize , we obtain the sub-optimal grid (with a coarser grain, i.e. simplified) with minimal degradation, thus with minimal information loss w.r.t. the grid before merging. Performing the best merges w.r.t. iteratively over the variables without distinction, starting from until the null model , agglomerative hierarchies are built and the end-user can stop at the chosen granularity that is necessary for the analysis while controlling either the number of clusters/cells or the information ratio kept in the model. The information ratio of the grid is defined as follows:

(2)

where is the null model (the grid with a single cell).

Typicality for ranking categorical values in a cluster. When the grid is coarsen during the hierarchical agglomerative process, the number of clusters per categorical dimension decreases and the number of values per cluster increases. It could be useful to focus on the most representative values among thousands of values of a cluster. In order to rank values in a cluster, we define the typicality of a value as follows.

Definition 2 (Typical values in a cluster)

For a value in a cluster of the partition of dimension given the grid model , the typicality of is defined as:

where is the probability of having a point with a value in cluster , is the cluster from which we have removed value , is the cluster to which we add value and the grid model after the aforementioned modifications.

Intuitively, the typicality evaluates the average impact in terms of on the grid model quality of removing a value from its cluster and reassigning it to another cluster . Thus, a value is representative (say typical) of a cluster if is “close” to and “different in average” from other clusters . Notice that this measure does not introduce any numerical encoding of the categories of the categorical variable under study.

Insightful visualizations with Mutual Information. It is common to visualize 2D coclustering results using 2D frequency matrix or heat map. For D coclustering, it is useful to visualize the frequency matrix of two variables while selecting a part of interest for each of other variables. We also suggest an insightful measure for co-clusters to be visualized, namely, the Contribution to Mutual Information (CMI) – providing additional valuable visual information inaccessible with only frequency representation. Notice that such visualizations are also valid whatever the variable of interest.

Definition 3 (Contribution to mutual information)

Given the selected parts , the mutual information between two partitioned variables and (from the partition of and variables induced by the grid model ) is defined as:

(3)

where represent the contribution of cell to the mutual information.

Thus, if then and we observe an excess interaction between and located in cell defined by parts of and of . Conversely, if , then , and we observe a deficit of interactions in cell . Finally, if , then either in which case the contribution to MI and there is no interaction or and the quantity of interactions in is that expected in case of independence between the partitioned variables.

The visualization of cells’ CMI highlight valuable information that is local to the selected parts and bring complementary insights to exploit the summary provided by the grid. In our experiments, we show the added-value of those visualizations on CDRs data from Ivory Coast described in Section 2.

4 Exploration results

This section describes the application of the previously introduced exploratory analysis framework on mobile data. Each application of khc222available at http://www.khiops.com on the case study data is achieved within a day of computation – which confirms the efficiency of the method. First, we apply the co-clustering with two categorical variables to build clusters of antennas based on the mobile traffic. Then, we study the time evolution of the calls distribution using data grid models with one categorical and one continuous variables. Next, we extend the previous analysis by applying our triclustering technique on the antennas, the weekday and the daytime in order to track active and inactive areas in function of the day and the hour. Finally, we investigate on the users behavior according to the antennas they use, the weekday and the time. This last study is an application of data grid models in four dimensions, i.e a tetra-clustering.

4.1 The clusters

The application of data grid models on the call detail records provides a segmentation with clusters, that corresponds to nearly one antenna per cluster. This is due to the large amount of data – 471 millions CDRs. Indeed the number of calls is so high for each antenna that the distribution of calls originating from (resp. terminating to) each antenna can be distinguished from each other.

Figure 1: Evolution of the information kept in the data grid model w.r.t. the number of clusters using the ascending hierarchical post-processing – from optimal data grid (100%) to the null model (0%).

In order to obtain a more interpretable segmentation, we apply the post-treatment introduced in the Section 3.1. Figure 1 plots the information ratio (see definition 1) versus the number of clusters for all intermediate models obtained during the ascending hierarchical post-processing. Interestingly, the resulting Pareto curve shows that very informative models are obtained with few clusters. In our study, we decrease the number of clusters until keeping of the model informativity. By doing this, we obtain clusters that is a satisfying number for the interpretation.

Throughout the simplification process, both partitions of source and target antennas stay identical. Thus we consider only the partition of source antennas for the rest of the study. We have plotted the clusters on a map in Figure 2. Antennas are identified using dots, which color match with the cluster they belong to.
The first observation is the strong correlation between the clusters and the geography of the country. Indeed, antennas from a same cluster are close to each other. The size of the clusters is almost the same in terms of area and match with the administrative zones of the country. There is an exception for Abidjan that is split into four clusters. This is due to the high concentration of antennas in the city ( of the ivorian antennas) and the dense phone traffic ( of the calls).

We use the typicality (see definition 2) to rank the antennas of each cluster. The place, where the antenna with the highest typicality is located, is used to label the cluster. On the map in Figure 2, the size of the dots are proportional to the antenna typicality. Most typical antennas are located in the main cities of Ivory Coast. This phenomenon has already been observed in [3] and [11]: the clusters match with the area of influence of the main cities of a country. We note some exceptions. Among them, the cluster of the city of Sassandra contains the antennas of the city of Divo, while Divo is almost times bigger than Sassandra (population wise) and is the sixth Ivorian city. Antennas in Divo are less typical than the ones in Sassandra, meaning that allocating them to another cluster would be less costly for the criterion. Actually, calls emitted from Divo are significant in direction to other regions of Ivory Coast whereas calls from Sassandra are more internal to its region. In more formal terms, the calls distributions of the antennas in Divo are closer to the marginal distribution than to its cluster’s distribution. This observation is not really surprising because Divo has experienced a recent growth of its population, due to migrations within the country [8]. Divo is also located in an area specialized in the intensive farming, that attracts seasonal workers from other parts of Ivory Coast.

Figure 2: Twenty clusters displayed on Ivory Coast map. There is one color per cluster.

Finally, let us focus on the segmentation of Abidjan. The city is divided into four parts with a strong socioeconomic correlation. The first cluster – in red in Figure 2 – covers central Abidjan, including the Central Business District (le Plateau), the transport hub (Adjamé) and the embassies and upper class area (Cocody). The second cluster – in light green in the Figure 2 – is located in the South of the city. The covered neighborhoods are mainly residential areas and ports. Note that this cluster and the previous one are separated by a strip of sea, except for its North part that is included in the previous cluster. This very localized neighborhood matches with the party area of Abidjan. Finally, the last two clusters group antennas located in two areas with a similar profile: these are lower class neighborhoods. These clusters are separated not only because they are located in different parts of the city but especially because their call distribution differs: Abobo in dark blue and Yopougon in grey in the Figure 2.

4.2 Traffic between clusters

In the Section 3, we have introduced the contribution to the mutual information. We propose to use it in order to visualize the lacks and excesses of calls between the clusters, compared to the expected traffic in case of independence. Whatever the granularity level of the clustering, we observe a strong excess of calls from the clusters to themselves and way weaker excesses and lacks between clusters. Studying the traffic within the clusters has a limited interest. We only focus on the inter-clusters traffic. To visualize the traffic between clusters, we use a finer clustering than in Section 4.1. Here we have 355 clusters for informativity (see Figure 1).

Figure 3: Excess of calls between clusters of antennas

In Figure 3, the red segments on the map are the excesses of traffic between clusters. The end points of the segments are drawn at the positions of the most representative antennas of the associated clusters (i.e with the highest typicalities). The opacity of a segment is proportional to the value of the contribution to mutual information and its width is proportional to the number of calls between clusters. The biggest cities – like Bouaké, San Pedro and Man – are clearly marked on the map: they are regional capitals, a fact that is highlighted by the call traffic visualization. The case of Bouaké is particularly interesting. Although it is not the country capital, its national influence seems bigger than the one of Yamoussoukro, the actual capital. Yamoussoukro is twice smaller than Bouaké (population wise) and is a quite recent city where there is no major economical activity, contrary to Bouaké. This fact can explain our observation.

Excess of traffic between major cities is a rare phenomenon. Cities are more like phone hubs, except in the West of the country around Soubré. This area is not a densely populated area but corresponds to a region with important migration flows. Finally, in Abidjan we can note important excesses of traffic within neighborhoods, but not between neighborhoods.

4.3 Temporal Analysis of the Calls Distribution

In this analysis, we track the evolution of the traffic over time. The studied time period runs from 2011, December 1st to 2012, April 28th. The model that is introduced in Section 3 has been designed to deal with several variables, either continuous or categorical. Thus we could study the calls from emitting antennas to destination antennas according to the time, using three variables. However in the Section 4.1, we have shown that the correlation between source and destination antennas is very high. The evolution of the calls distribution over time might be the same for both sets of antennas. Therefore, we only study the time evolution of the originating calls: one call is described by the emitting antenna and a day count (stemming from the date).

Figure 4: Antennas activity clusters projected on Ivory Coast map. Colored clusters show inactivity periods while grey clusters indicate antennas whose traffic is complete over the period.

In this study, the clustering of antennas is also too fine for an easy interpretation (1051 clusters of antennas and 140 intervals for the day count). We apply the same post-treatment than in the Section 3, so that the informativity of the model is , with ten clusters of antenna and twenty time segments. The main problem of this analysis is the missing data. Indeed, some antennas emitted no call during some time periods. Consequently, we obtain time segments that are strongly correlated with missing data. For the same reason, antennas are grouped because they experienced an absence of calls during one or several same periods.

In the Figure 4, the colored antennas belong to clusters having experienced simultaneous absences of calls. We note that the green, orange, light blue and purple clusters are located in localized area. The missing data are during short periods for these clusters. This grouping might be due to localized technical issues on the network. The antennas of the yellow cluster are spread over the country. These antennas are grouped because they have been activated at the same date. This use case provides a better understanding the dysfunctions in the network over the year.

4.4 Output communications w.r.t. week day and hour

In this analysis with use three variables to describe the calls: the emitting antenna, the week day and hour. Our objective is to build simultaneously a partition of the antennas, a partition of the week days and a discretization of the hour. This approach is a triclustering and for the same reasons as previously, we only keep the emitting antennas.

At the finest level, we obtain a triclustering with clusters of emitting antennas, clusters of days and time segments. These results must be simplified to ease the interpretation. Here we fix the numbers of clusters of days and time segments, since they are acceptable for the analysis. We only reduce the number of clusters of antennas. With four clusters of antennas, we keep of the informativity of the model.

Figure 5: Clusters on the map of Ivory Coast. Dots are antennas. There is one color per cluster.

Antennas are displayed on the map of Figure 5. We also build a calendar (see Figure 6) for each cluster with days in columns and time segments in lines. The color of the cells indicates the excesses (red) or the lacks (blue) of traffic emitted from the corresponding cluster. The lacks and excesses are measured using the contribution to the mutual information (see definition 3) between the cluster and the cross product of the cluster of weekday and the time segment: , with the partition of the antennas, the partitions of the weekdays and the discretization of the time.

Let us make an analysis of each cluster of antennas:

Abidjan - Le Plateau (yellow cluster) This cluster covers exactly the Central Business District of Abidjan. In the calendar of Figure 6, we observe an excess of calls from the Monday to the Friday, between 8-9am and 4-5pm. The rest of the time, there is a low lack of traffic emitted from this area. This means that during the office hours the phone traffic is higher than expected and lower the rest of the time. This is representative of this type of area: a non-residential business district.

Economic zones (red cluster) The antennas of this cluster are located either in the commercial areas of the cities or in areas with a strong economic activity, like plantations or mines. In Abidjan, these antennas are located in industrial zones (South and North-West), the shopping districts (North of the business district) and the universities and embassies neighborhood (East). The traffic in these areas is mainly in excess from the Monday to the Saturday between 9 am and 5 pm. The correlation is very strong between the working hours and the calls traffic on these areas.

Urban residential areas (blue cluster) The antennas belonging to this cluster are mainly located in the cities like Abidjan, Bouaké and Yamoussoukro. If we focus on Abidjan, we realize that the cluster covers the residential neighborhood located in the West and in the North-East of the city. At a finer level of partition of the antennas, this cluster would be split according to the socioeconomic class of the neighborhood: the upper class neighborhood in the East of the city is separated from the lower class neighborhoods, located in the North and the West. The calendar shows lacks of calls during the office hours and excesses the weekend, the night and the early morning during the week. This is correlated with the presence of people in residential areas. Note that the excesses of calls start around 8 pm, while it stops around 5 pm in the Central Business district or in economic areas. This time lag is due to the cheaper price of calls after 8 pm.

The countryside (green cluster) The antennas of this cluster are spread over the country, except in Abidjan and other cities in general. The calendar for this cluster is quite similar to the one of the urban residential areas, except that the excess periods are limited to the early evening and the whole Sunday.

(a) Le Plateau (Abidjan)
(b) Activity areas
(c) Urban residential areas
(d) Countryside
Figure 6: Calendars of excesses (red) and lacks of calls emitted from each of the four clusters of antennas, in function of the weekday and the daytime.

4.5 User mobility analysis w.r.t. week day and hour

The data of this study are customers trajectories. For a set of anonymized users, we have the antennas and the timestamps with their uses of the network. The data are the connections to the network, that are identified by a user identifier, an antenna, a week day and hour. In this study, we apply a tetra-clustering in order to build simultaneously clusters of users, of antennas, of week days and a discretization of the daytime. Here two users have the same profile if they connected to the same groups of antennas, the same days of the weeks and at the same time periods. The data are filtered so that only the most mobile users are kept. A mobile user is characterized by a frequent use of a large set of distinct antennas. After filtering, we keep users.

At the finest level, we have clusters of users, clusters of antennas and three time segments. Week days are not grouped: each of them is in its own cluster. This clustering is too fine for an easy interpretation, so we use the post-treatment, introduced in Section 3 to simplify the model. We keep of informativity, that enables a reduction of the numbers of clusters of users and antennas to , and the numbers of groups of week days and hour segments to two. The week is divided in two parts: the working days and the weekend. As for the hour, the split occurs around 6 pm. The intervals are 0 am - 6 pm and 6 pm - 12 am. Note that the bound at midnight is artificial, because the day start as this time. The cut at 6 pm is the last in the hierarchy of the time segmentation. Then it would have been more relevant to consider a day from 6 pm to 6 pm the next day. Nevertheless, it is easier to have an interpretations on a “usual” time period between 0 am and 12 pm. Therefore we keep the following segmentation: 0 am - 6 pm, 6 pm - 12 pm.

(a) Working days before 6 pm
(b) Working days after 6 pm
Figure 7: For a group of user, excesses and lacks of uses of antennas according to the day of the week and the time of the day. Focus on Abidjan.

We aim at characterizing the users behaviors in terms of mobility. We focus on a group of users to illustrate our results. The maps of Figure 7 are the maps of excesses and lacks of traffic in Abidjan during the week, for both periods of the day and for a selected group of users. The colors correspond to the mutual information where is the partition of antennas; , the partition of the weekdays; the discretization of the daytime; and , the partition of the users.

Users of the studied cluster mainly connect to the antennas located in the East of Abidjan after 6 pm during the working days, while they rarely connect to the same antennas before 6 pm the same days. Then it can be assumed that the selected cluster of users groups people living in the same area. This hypothesis is reinforced by the socioeconomic nature of this part of Abidjan: it is a residential area. The contributions to mutual information of the other clusters of antennas are smaller. Three areas experience excesses of traffic before 6 pm and lacks after 6 pm. They correspond to the business district (Le Plateau), the embassies and universities neighborhood and the industrial zone located in the West of the city. The common point of all these areas is their economic activity during the day. To sum up, we can assume that the users of the selected cluster are similar in that they live in the same area and work during the week in three localized area of Abidjan.

4.6 Synthesis

This section aimed at illustrating how a co-clustering approach could be useful to extract different information on a single telephone data set. We have also shown that some mathematically defined exploratory analysis concepts helped us to make an interpretation of our results. In the first part of the analysis, we have shown how people from the same zone tend to call the same areas of the country. Using mutual information between clusters of antennas, we are able to plot the network of calls in the country. Finally, we made a temporal analysis of the emitted calls, highlighting the differences of behaviors of mobile users according to the area where they live in the country. In a last study of anonymized user mobility data, we proposed a tetra-clustering (or co-clustering in four dimensions) and focused on one cluster of users to show how we can build users profile and investigate their mobile usage. These results have been discussed and the interpretations have been validated by a sociologist from the University of Bouaké in Ivory Coast.
Impacts on economic strategy. Besides the high-level knowledge extracted from country-scale data and confirmed by local sociologists, these studies have also a strong impact on future economic development strategy, mainly in two identified branches:

  • Network planning strategy: In 2014, there are around 20M inhabitants in Ivory Coast and the mobile service penetration rate is – with a still growing mobile phone market. Maps resulting from the first case study (that can be seen as the network of calls available at various granularities, see Section 4.1) are considered as an additional input for network planning and investment; for instance to help network designer in answering questions about how many and where the next antennas have to be set while preserving the quality of service at a reasonable cost.

  • Yield management pricing strategy: a part of the pricing policy, called Bonus Zone, established in Ivory Coast offers discount prices (from 10% to 90%) to calling users depending on the location and hour of the emitting call. Maps and calendars resulting from case studies (Sections 4.4 and 4.5) that are available at various granularities, provide valuable information to economic analysts in order to design optimized spatio-temporal pricing policy in Bonus Zone context.

5 Related Work

CDRs data have received much attention in recent years. Famous applications of CDRs data analysis are for the benefit of social good: e.g., in the transportation domain, [2] suggest a system for public transport optimization. Mobile phones may also provide other types of data (e.g., the Nokia Mobile Data Challenge [14]), like applications events, WLAN connection data, etc. For instance, [13] pre-processed phone activities of one million users to obtain information about their approximative temporal location, then mined daily motifs from the spatio-temporal data to infer human activities. Finally, smart phones are or will be equipped with accelerometers and/or gyroscopes providing data about physical activities of users: [15] suggest a complete system of activity recognition based on smartphone accelerometers with potential application to health monitoring.

Research work related to data grid models: Dhillon et al. [7]

have proposed an information-theoretic coclustering approach for two discrete random variables: the loss in Mutual Information

is minimized to obtain a locally-optimal grid with a user-defined number of clusters for each dimension. This is limited to two variables and requires to choose the number of clusters per variable. Going beyond 2D matrices, recent significant progress has been done in multi-way tensor analysis 

[18]. For instance, [16] suggest a method for mining time-stamped event sequences and effective forecasting of future events.

To the best of our knowledge, our summarization approach is the only one to combine the following advantages: it is parameter-free, scalable and can be applied to mixed-type attributes (categorical, numerical, thus multiple types of time dimensions). Therefore, the same generic method can be used to analyze network graph data, temporal sequence data and mobility data.

6 Conclusion

We have suggested a generic methodology for exploratory analysis of CDRs data. Our method is based on a joint distribution estimation technique providing the user with a summary of the data in a parameter-free way. We have also suggested several criteria for exploring and exploiting the summary at various granularities and highlighting its relevant components. We have demonstrated the applicability of the method on graph data, temporal sequence data as well as user mobility data stemming from CDRs data.

References

  • [1] Becker, R.A., Cáceres, R., Hanson, K., Isaacman, S., Loh, J.M., Martonosi, M., Rowland, J., Urbanek, S., Varshavsky, A., Volinsky, C.: Human mobility characterization from cellular network data. Commun. ACM 56(1), 74–82 (2013)
  • [2] Berlingerio, M., Calabrese, F., Lorenzo, G.D., Nair, R., Pinelli, F., Sbodio, M.L.: Allaboard: A system for exploring urban mobility and optimizing public transport using cellphone data. In: ECML/PKDD. pp. 663–666 (2013)
  • [3] Blondel, V., Krings, G., Thomas, I.: Regions and borders of mobile telephony in Belgium and in the Brussels metropolitan zone. Brussels Studies 42 (2010)
  • [4] Blondel, V., de Cordes, N., Decuyper, A., Deville, P., Raguenez, J., Smoreda, Z.: Mobile phone data for development - analysis of mobile phone datasets for the development of ivory coast (2013), perso.uclouvain.be/vincent.blondel/netmob/2013/D4D-book.pdf
  • [5] Blondel, V.D., Esch, M., Chan, C., Clérot, F., Deville, P., Huens, E., Morlot, F., Smoreda, Z., Ziemlicki, C.: Data for development: the D4D challenge on mobile phone data. CoRR abs/1210.0137 (2012)
  • [6]

    Boullé, M.: Data grid models for preparation and modeling in supervised learning. In: Guyon, I., Cawley, G., Dror, G., Saffari, A. (eds.) Hands-On Pattern Recognition: Challenges in Machine Learning, volume 1, pp. 99–130. Microtome Publishing (2011)

  • [7] Dhillon, I.S., Mallela, S., Modha, D.S.: Information-theoretic co-clustering. In: KDD. pp. 89–98 (2003)
  • [8] Gnabéli, R.: La production d’une identité autochtone en Côte d’Ivoire. Journal des anthropologues. Association française des anthropologues 114-115, 247–275 (2008)
  • [9] Grünwald, P.: The minimum description length principle. MIT Press (2007)
  • [10] Guerraz, B., Boullé, M., Gay, D., Clérot, F.: Khiops coviz: A tool for visual exploratory analysis of k-coclustering results. In: ECML/PKDD. pp. 444–447 (2014)
  • [11] Guigourès, R., Boullé, M.: Segmentation of towns using call detail records. In: NetMob Workshop at IEEE SocialCom (2011)
  • [12] Hansen, P., Mladenovic, N.: Variable neighborhood search: Principles and applications. European Journal of Operational Research 130(3), 449–467 (2001)
  • [13] Jiang, S., Fiore, G.A., Yang, Y., Ferreira Jr., J., Frazzoli, E., Gonzàlez, M.C.: A review of urban computing for mobile phone traces: current methods, challenges and opportunities. In: UrbComp@KDD (2013)
  • [14] Laurila, J.K., Gatica-Perez, D., Aad, I., Blom, J., Bornet, O., Do, T.M.T., Dousse, O., Eberle, J., Miettinen, M.: From big smartphone data to worldwide research: The mobile data challenge. Pervasive and Mobile Computing 9(6), 752–771 (2013)
  • [15] Lockhart, J.W., Weiss, G.M.: The benefits of personalized smartphone-based activity recognition models. In: SDM. pp. 614–622 (2014)
  • [16] Matsubara, Y., Sakurai, Y., Faloutsos, C., Iwata, T., Yoshikawa, M.: Fast mining and forecasting of complex time-stamped events. In: KDD. pp. 271–279 (2012)
  • [17] Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal (1948)
  • [18] Sun, J., Tao, D., Faloutsos, C.: Beyond streams and graphs: dynamic tensor analysis. In: KDD’06. pp. 374–383 (2006)
  • [19] United Nations Global Pulse: Mobile phone network data for development (2013), www.unglobalpulse.org/Mobile_Phone_Network_Dat_for_Dev
  • [20] Wang, D., Pedreschi, D., Song, C., Giannotti, F., Barabási, A.L.: Human mobility, social ties, and link prediction. In: KDD. pp. 1100–1108 (2011)

Appendix 0.A Data grid models in a nutshell

Data grid models aim at estimating the joint distribution between several variables of mixed-types (categorical as well as numerical). The main principle is to simultaneously partition the values taken by the variables, into groups/clusters of categories for categorical variables and into intervals for numerical variables. The result is a multidimensional (-D) data grid whose cells are defined by a part of each partitioned variable value set. Notice that in all rigor, we are working only with partitions of variable value sets. However, to simplify the discussion we will sometime use a slightly incorrect formulation by mentioning a “partition of a variable” and a “partitioned variable”.

In order to choose the “best” data grid model (given the data) from the model space , we use a Bayesian Maximum A Posteriori (MAP) approach. We explore the model space while minimizing a Bayesian criterion, called cost. The cost criterion implements a trade-off between the accuracy and the robustness of the model and is defined as follows:

Boullé [6] has shown that we can obtain an exact analytic expression of the cost criterion if we consider a data-dependent hierarchical prior (on the parameters of a data grid model) that is uniform (on a combinatorial point of view) at each stage of the hierarchy. Notice that it does not mean that the prior is uniform, thus in our case, the MAP approach is different from a simple likelihood maximization. The cost criterion is then defined as follows.

Definition 4

A data grid model is optimal if minimizing the criterion defined as follows:

(4)
(5)
(6)
(7)
(8)

where is the number of points in the data, the number of variables, the set of variables, the subset of numerical variables, the subset of categorical variables, the number of values of categorical variable , the size of the univariate partition of variable , the number of cells of the grid, the number of values of the group/cluster of categorical variable , the number of points with value of categorical variable , the number of points in the interval (or value group) of variable , the number of points in the cell of the grid.

The terms of the first three lines stand for the a priori probability of the grid model and constitute the regularization term of the model: complex models (with many clusters for categorical variables and/or many intervals for numerical variables) are penalized. The last two lines stand for the likelihood of data given the parameters of the model: models that are closest to the data are preferred. The extreme case where we have at most one point per cell will maximize the likelihood but we get a very low a priori probability of the grid model, thus a high cost value. The other extreme case, i.e., the null model , is when we have only one cell: we have high a priori probability but very low likelihood, thus high cost value. Grids with low cost value indicate a high a posteriori probability and are those of interest because they achieve a balanced trade-off between accuracy and generality/simplicity. In terms of information theory, negative logarithm of probabilities can also be interpreted as code length [17]: here, according to the Minimum Description Length principle (MDL) [9], the cost criterion can be interpreted as the code length of the grid model plus the code length of the data given the grid model. Then a low cost value also means a high compression of the data using grid model .

Optimization algorithm. The optimization of data grid is a combinatorial problem: the number of possible partitions of values of a categorical variable is equal to the Bell number and the number of discretizations of values is . Obviously, an exhaustive search is unfeasible and as far as we know, there is no tractable optimal algorithm. Therefore the criterion is optimized using a greedy bottom-up strategy whose main principle is described in pseudo-code Algorithm 1. We start with the finest grained data grid, that is made of the finest possible univariate partitions (for all variables), i.e., based on single value intervals or clusters. Then, we evaluate all merges between clusters and adjacent intervals and perform the best merge if the criterion decreases after the merge. We iterate until there is no more improvement of the criterion.

Input :  Initial data grid solution
Output :  final data grid solution with improved
1 ;
2 while improved data grid solution do
3       ;
4       forall Merge between two clusters or two intervals do
             ;
              //consider merge for grid
5             if  then
6                   ;
7                  
8            
9      if  then
             ;
              // Improved grid solution
10            
11      
return
Algorithm 1 khc: Data grid optimization

In the following, to alleviate the notations, without loss of generality, we consider the 3D case with e.g., two categorical variables with (respectively values) and one numerical variable with potentially values (i.e., the total number of data points).

A straightforward implementation of the greedy heuristic remains a hard problem since each evaluation of the

criterion for a grid requires time, given that the initial finest grid is made of up to cells. Furthermore, each step of algorithm 1 requires (resp. , ) evaluations of merges of clusters and intervals; and there are at most steps from the finest grained model to the null model. The overall time complexity is bounded by . In [6], it has been shown that further optimizations allow to reduce the time complexity to . Advanced optimizations combined with sophisticated algorithmic data structures mainly exploits (i) the sparseness of the grid, (ii) the additivity property of the criterion and (iii) starts from non-maximal grained grid models using pre and post-optimization heuristics:

  • In practice data sets represented by 3D points are sparse. Among the cells of the grid, at most cells are non-empty. The contribution of empty cells to the criterion in definition 4 is null, thus each evaluation of a data grid may be performed in time through advanced algorithmic data structures.

  • The additivity of the criterion stems from the data-dependent hierarchical prior of criterion. It means that it can be split in a hierarchy of components of the grid model: the variables, then the parts (clusters or intervals) and finally cells. The additivity property allows to evaluate all merges between intervals or clusters in time. Moreover, the sparseness of the data set ensures that the number of revaluations (after the best merge is performed) is small on average.

  • Instead of starting from the finest grained grid, for tractability concern, the algorithm starts from grids with at most clusters or intervals. Dedicated preprocessing and postprocessing heuristics are employed to locally improve the initial and final solutions produced by algorithm 1. In these heuristics, the criterion is post-optimized alternatively for each variable while the partitions of the others are fixed, by moving values across clusters and moving interval boundaries for the numerical variables.

The optimized version of algorithm 1 is now time-efficient but may lead to a local optimum. To alleviate this concern, we use the Variable Neighborhood Search (VNS) meta-heuristic [12]. The main principle consists of multiple runs of the algorithms using various random initial solutions (we consider 10 rounds of initialization): it allows anytime optimization – the more you optimize, the better the solution – while not growing the overall time complexity of algorithm 1. Full details of the optimization techniques are available in [6].

Appendix 0.B Conclusion

Data grid models [6] is an effective method for estimating the joint distribution between several variables of mixed-types and is available under the name of Khiops Coclustering at http://www.khiops.com. It has also been presented as a demo on several real-world case studies, see e.g. [10].