The Use of Presence Data in Modelling Demand for Transportation

by   Jonathan Epperlein, et al.

We consider the applicability of the data from operators of cellular systems to modelling demand for transportation. While individual-level data may contain precise paths of movement, stringent privacy rules prohibit their use without consent. Presence data aggregate the individual-level data to information on the numbers of transactions at each base transceiver station (BTS) per each time period. Our work is aimed at demonstrating value of such aggregate data for mobility management while maintaining privacy of users. In particular, given mobile subscriber activity aggregated to short time intervals for a zone, a convex optimisation problem estimates most likely transitions between zones. We demonstrate the method on presence data from Warsaw, Poland, and compare with official demand estimates obtained with classical econometric methods.



There are no comments yet.


page 1

page 2

page 3

page 4


A Survey Mobility Management in 5G Networks

For the next wireless communication systems, the major challenges are to...

Assignment of a Synthetic Population for Activity-based Modelling employing Publicly Available Data

Agent based modelling has acquired the spotlight in the transportation d...

The Economics of Social Data

A data intermediary pays consumers for information about their preferenc...

Modelling COVID-19 – I A dynamic SIR(D) with application to Indian data

We propose an epidemiological model using an adaptive dynamic three comp...

Trust but Verify: Cryptographic Data Privacy for Mobility Management

The era of Big Data has brought with it a richer understanding of user b...

Big Data Information and Nowcasting: Consumption and Investment from Bank Transactions in Turkey

We use the aggregate information from individual-to-firm and firm-to-fir...

Private, Fair, and Verifiable Aggregate Statistics for Mobile Crowdsensing in Blockchain Era

In this paper, we propose FairCrowd, a private, fair, and verifiable fra...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Estimating travel demand is a key step of the transportation planning process. Cities typically build a modeling framework and use surveys along with observations to characterize how, where and by what means citizens move. These estimates are then used for management of existing infrastructure or for design of new services. Such an approach is resource-intensive, time consuming, and therefore can only be performed infrequently providing estimates that may be out-of-date.

Alternatively, current movement patterns can be estimated based on data from mobile devices. Several works have studied digital trajectories, or “breadcrumbs”, which provide a very rich demand profile. Nevertheless, legislation and privacy regulation often restrict the use of such methods. In Europe, in particular, requirements on processing telecommunications transmission data are set by European directives:

  • Personal data should be protected, as per directive on e-Privacy(2002/58/EC) and directive on personal data protection (95/46/EC).

  • The consent of the data subject is required for processing any subscriber personal data.

  • The anonymized data can be processed without consent only if they are aggregated in the first step of processing and de-anonymization of the data is not possible. The anonymized record cannot not be connected with any subscriber in any case.

The use and processing of CDR records without the permission of the subscriber is hence not possible in the Europe, and one may need to rely on aggregate data.

In this paper, we present methods to estimate trip generation and trip distribution rates from aggregate presence data. In particular, we consider mobile subscriber’s location statistics aggregated in accordance with the above-mentioned regulations. We the present a linear program to be solved for each period of the discretisation of time to obtain the most likely estimate of movements since the previous period. We also present algorithms for obtaining the estimates of movements of likely movements over a number of periods, or consider a distribution of trip durations. Crucially, we show that the linear program can be solved in linear time, which makes the approach applicable in practice.

2 The Problem

2.1 Data

The data comes in the form of time-stamped averages of event counts for each of a number of geographical zones: (Zone, Timestamp, Event Count), where the zones are defined by the local public transport operator.

More specifically: Say there are zones , and we have a total of time intervals for which the event counts are reported; here, the intervals are all of equal length , i.e. , but this is not necessary for the applicability of what follows. Thus, the data consists of tuples of the form

which encode that “In zone during the time interval , events were observed.” Note that the timestamp of the end of the interval is reported in the dataset, and that there are such tuples.

For the purposes of this paper, we consider the number of events as a measure of how many unique terminals (phones, pagers, etc.) were present in the give zone over the given time interval.

2.2 Modelling user movement

The given data suggests how many terminals were present at any zone, but, due to privacy restrictions, not which terminals were present where at what time. Hence, the information as to how users move from one zone to another is lost, and the problem we address in this paper is how to estimate this information from the aggregate data that we are given. The problems treated are as follows:

Problem 1 (One-Step Transitions).

Given aggregate presence data for disjoint geographical zones at two points in time, and a cost function mapping each pair of geographical zones to a real number, estimate for each zone the proportion of users travelling to each other zone, within the time interval between the two points in time.

Problem 2 (-Step Transitions).

Given aggregate presence data for disjoint geographical zones at two points in time and a cost function mapping each pair of geographical zones to a real number, estimate for each zone the proportion of users travelling to each other zone, within time .

Problem 3 (Duration- Transitions).

Given aggregate presence data for disjoint geographical zones, for two or more points in time, and a cost function mapping each pair of geographical zones to a real number, and a given trip duration , estimate for each zone the proportion of users travelling to each other zone, with trip duration being .

Problem 4 (Realistic Transitions).

Given aggregate presence data for disjoint geographical zones, for two or more points

in time, and a cost function mapping each pair of geographical zones to a real number, and a distribution of trip durations, as a histogram approximation of a probability density function

, estimate for each zone the proportion of users travelling to each other zone.

Absent further information, we have to make assumptions about user behaviour, and the main assumption is that the users redistribute themselves in an optimal way – optimal with respect to a cost function, the design of which requires insights into the underlying transportation structure. This is best explained via a trivial-seeming example:
Assume first, that there are only two zones, and , and that we have the following data:

Intuitively, it seems obvious to expect that one user went from Zone 1 to Zone 2, whereas all other stayed put. Of course, it is also possible that two users went from Zone 1 to Zone 2 and the sole user in Zone 2 went to Zone 1. However, if we associate a cost to users moving, then the first, intuitive solution, is clearly the optimal one.

Assume now, that there is a third zone, , and that we additionally have the following data:

A natural assumption is now that the user in Zone 3 simply stayed there. But what if Zone 3 is connected to Zones 1 and 2 through rapid transit, whereas the only convenient way of getting from Zone 1 to Zone 2 is through Zone 3, and this journey takes longer than ? In this case, the cost of travel from to would be quite high, and the most reasonable and optimal solution would be to expect one user to move from to and one user to move from to .

This example illustrates the idea of optimal user movement and highlights, how knowledge of the transportation system needs to be incorporated in the design of the cost function. One can, for instance, benefit from the estimates of the travel-times provided by Google Maps [36], use the Euclidean distance, or a discrete metric, in which adjacent pairs of zones have cost 0 and all other pairs have cost 1. Either way, we will solve that problem by casting it as a linear program, which will be formally defined in the next section.

Finally, in Problem 4

, note that we treat the travel time of all users of the road network as a random variable

. For this random variable , one assumes there exists probability density function , i.e., a non-negative function:


On the input, we assume a histogram approximation thereof, with the bins of the histogram centered at multiples of the interval at which we sample the input data, i.e., . For example for the sampling at 1 minute, we would be given a list:


We note that such distributional data are regularly obtained by most operators of public transport, either by surveys or by studying traces of individuals using personal seasonal tickets, although the alignment of the bins may not readily match the sampling interval of the presence data.

3 The Linear Program

First, we show that Problem 1 can be solved by a linear program.

3.1 Notation and preliminaries

Throughout, denotes the nonnegative integers, and are positive integers, denotes an

-dimensional column vector of all

s, and is the

-dimensional identity matrix; the dimension

is omitted if it is clear from the context. We denote matrices by capital letters, and their elements by the corresponding lower-case letters, e.g. for a matrix , denotes the element in the row, column; denotes the transpose of ; denotes the trace of the square matrix . Note that with this notation, we obtain the vector of row-sums of , i.e.  with , as and analogously we have for the vector of column-sums.

To keep the notation compact, we consider user movement between two consecutive time intervals only; else, we’d have to include an index corresponding to the time stamp in everything, which is unnecessary for the developments. In other words, we consider a data set of the form


Let denote the (column) vector of users present in all zones at time , i.e.  and so on. The number of users moving from to is denoted by and the matrix of flows is ; this is the quantity we are interested in finding. We denote the costs of moving from to by and collect them in the matrix .

3.2 Cost functions

The costs of moving from to is, in some sense, a parameter to the model. We have conducted our experiments with three natural choices of the cost matrix , but we do not claim these are original.

Let us have each zone associated with a polygon, which is defined by a sequence of corner-points in the Euclidean plane. Let us denote corner-points associated with by . In the adjacency metric, we consider:


Considering the centroid of :


The centroidal distance is the Euclidean distance between centroids and of the two zones. Finally, considering the closest corner-point associated with the other zone:


Distance is then the Euclidean distance between and . One can clearly define many other cost matrices, e.g., considering free-flow travel times, and introducing randomisation.

3.3 Constraints assuming a constant number of users over time

As a first step, assume that , i.e. that the number of users in the entire network is the same in both time intervals. In order for a flow to explain the observations, it must “remove” users from, and “place” users back in each zone (note that the number of users staying in is ). The number of users moving from is then given by the row-sum , whereas the number of users moving to is given by the column-sum . Of course, flows have to be nonnegative, too. Using the above notations, this translates to


We note here that this set of constraints defines what is known as the -transportation polytope with marginals and . A further constraint can be added: in reality, only integer numbers of users can move, hence we could also require . Since we are only approximating anyway, this might or might not be a good idea. Note also Lemma 5, which suggests that for integer marginals, the minimizer is also integral.

The total cost associated with a flow is , and we find the flow minimizing the overall cost – and hence an estimate of the true movement of users between zones – as the solution to the following optimization problem:

minimize (LP-C)
subject to

3.4 Constraints allowing for time-varying numbers of users

We cannot assume that the number of users stays constant: Users are arriving in the covered area and leaving it, and of course devices are being turned off and on. To address that, we introduce a sink/source zone and index it by . For this zone we have

The number of users in the sink/source zone can only be computed after the data at is available. Then, we let


where selecting a parameter allows for consideration of users disappearing from and spawning in instead of travelling from to . One might also consider adding another sink/source for this specific purpose to gain more control over the cost: for instance, if and are too small compared to , the optimization problem will always prefer disappearing/spawning over travel.

4 The Matrix Manipulations

Second, one should like go beyond a snapshot of transitions within a single interval between two points in time, at which the presence data are available. We show that one can manipulate the solution of the linear program (LP) of the previous section to derive first the -step transitions of Problem 2 and subsequently the solutions to Problems 3 and 4 by straightforward matrix manipulation.

4.1 Extrapolating the most recent one-step transitions

One possible solution to Problem 2 involves considering the output of the LP as a matrix and raising it to the appropriate power. Let us return to our example, where we had

We expect the flow of users to be , in other words two users remain in , one goes from to , and the user already in remains there. Another interpretation is in terms of proportions: One third of the users in move to , whereas two thirds stay put, and so on. Mathematically, that corresponds to normalizing each column of to sum up to one (i.e. making

into a row-stochastic matrix), say


If this trend were to continue, then we’d have for the number of users in and at time :


This can be written compactly, and more generally, as the matrix-vector product and with given, we have and and eventually we get

In the context of probability distributions (where the elements

denote probability mass instead of users), this is a well-known and thoroughly researched Markov chain, and all results apply here. The most important one is that, should the trend continue forever, the system will (under mild conditions) settle into a steady state. More specifically, there is a distribution of users

such that and as , we have .

4.2 Extrapolating recent one-step transitions

If instead of just event counts at two time steps, we have event counts at several consecutive time steps, another solution to Problem 2 would be to multiply by the estimated one-step flow matrices in turn. To return to our example, let us extend it to:

The flows between times and would be as before, and from to we’d have , or after normalization:


Now the estimate for would be and in general

Because the product of row-stochastic matrices is again row-stochastic, this new Markov chain may have a stationary distribution appearing every time steps.

The apparent difficulty with a naive implementation of these approaches is runtime, in particular when is large. As we will show in the next section, there are surprisingly efficient algorithms, though.

4.3 Approximating duration- transitions

In order to solve Problem 3, it suffices to consider a convex combination of 2 solutions to Problem 2 for suitable . Consider the computation of duration- transitions, where there exist presence data at times with . Consider the -step transition matrix and -step transition matrix computed as a solution to Problem 2. Clearly,


is the solution to Problem 3.

4.4 Approximating realistic transitions

Finally, in order to solve Problem 4, one has to notice that the solution is a convex combination of solutions to Problem 2 for bins of the histogram approximation. Consider the bins of the histogram with values:

where is the largest possible realisation of . It is clear that can be seen as a probability mass function on a discretisation of L, with . One can use as weights in a convex combination of the transition matrices, which is a sum. For the first bin, we consider a solution of Problem 1 with weight as the first summand. For the second bin, we consider a 2-step transition matrix computed as a solution to Problem 2 with weight as the second summand. Subsequently, we consider -step transition matrix , computed as a solution to Problem 2 with weight to obtain:


which is the solution to Problem 4.

5 Run-Time Analysis

Method Run-time Ref.
General-purpose interior-point method [10]
Hungarian method [17]
Augmenting-path algorithm [15]
The proposed algorithm Theorem 6
Table 1: The complexity of solving the linear program (LP-C) in dimension . We note that it makes it possible to obtain the objective function value at cost , while retrieving the matrix, at which this value is attained requires time .

5.1 Linear Programming

Next, let us consider whether a faster algorithm for solving the class of linear programming problems (13) are possible. We make use of:

Lemma 5 (E.g. [7, Lemma 2.2 and Corollary 2.11]).

For the transportation polytope defined by and we have:

  1. it is nonempty if and only if , in other words if the sums of the marginals are equal;

  2. all its vertices are integral, if and , i.e. if the marginals are integral.

but we stress that the polytope has been studied very extensively over the past 70 years and refer to [29, 30, 33] as standard references.

Now we state the main result:

Theorem 6.

Given , with , consider


Then, . This is computable in time linear in .

For the proof, please see the Appendix.

5.2 Sparse Matrix Multiplication

Method Run-time Ref.
Standard dense matrix multiplication Standard
Present-best dense matrix multiplication [18]
Fourier-transform-based methods [26]
Table 2: The complexity of multiplying two sparse matrices, each in dimension , each with non-zero coefficients, where the product has non-zero coefficients.

In addition to the complexity of the solving of the instance of the linear programming problem, we may need to solve a number of matrix-matrix multiplication problems, where the two matrices are sparse, each in dimension , each with non-zero coefficients, and where the product has non-zero coefficients. Using traditional dense linear algebra, one can obtain the result in time , using more sophisticated methods for dense matrices, one can improve this to .

Considering the sparsity of matrices , however, Pagh [26] has shown methods that there are based on conversion to polynomial:

and utilisation of fast Fourier transform for polynomial multiplication, which can compute such that in time , with . Eventually, Pagh [26] shows that the algorithm computes exactly in time with high probability. Although these algorithms may not be easy to implement, related algorithms with simpler hashing functions are readily implementable.

6 Computational Experiments

  • A cell identifier change reported in the access request procedure of any mobile originated (MO) or mobile terminating (MT) transaction

  • A cell identifier change reported when a cell identifier different from the one stored in the visitor location register (VLR) is used in response to any mobile terminating (MT) transaction

  • Location update performed by a mobile station

  • International mobile subscriber identity (IMSI) attached to a mobile station changes

  • A subscriber is deleted from the visitor location register

  • The subscriber switches off the mobile station

  • Mobile station is considered as detached by the network after a long period of inactivity

  • Cell identifier change is detected during GPRS activity

  • Cell identifier change is detected during the provision of subscriber information

  • During long calls, the visitor location register (VLR) may receive request messages to refresh subscriber data. In this way, the VLR notices that the cell identifier arriving in the message differs from the one stored in a database.

Table 3: Types of registered events: First two types are generated by a user (“active”), while the following eight are generated by the network (“passive”).

For our preliminary experiments, we have used data collected from the Public Land Mobile Network (PLMN) of Orange Polska within the municipal area of Warsaw. For each base transceiver station (BTS) of each separate cellular system (2G/3G), we have received the total numbers of connections from unique terminals per a unit of time. No data other than the aggregate statistics were collected.

In particular, the data were collected by an internal transmission system, in which some types of network events are associated with a location. Such events can be subdivided into events generated by the user (“active”) or generated by the network (“passive”). In Table 3, the first two bullet points capture events generated by the user. These events relate to mobile terminating (MT) transactions, i.e., events registered during an active use of a mobile terminal by the user. The following bullet points list event types generated by the network. These events are generated without an active participation of the user and include operations of the so-called visitor location register (VLR), which tracks subscriber roaming within a mobile switching centre (MSC) location area, cell identifier changes, International Mobile Subscriber Identity (IMSI) changes, location updates performed by a mobile station, and the switching of mobile stations.

Figure 1: Total number of events vs time stamps, split by passive (magenta) and active (black) events. The spike in passive events at 3:15am is an artefact of the data acquisition system’s operation.
Figure 2: The evolution of some statistics of the active (left) and passive (right) events observed. The statistics are computed over all BTSs and all 5 available days (Monday – Friday), e.g. the mean at 9:00 is the mean of the number of events at all BTSs on all 5 days at 9am. The inner (dark blue) band is the interquartile range, whereas the outer (light blue) band extends between the 5th percentile and the maximum event count at that time.

Within the one week of data collected, the most frequent events were location updates, while the cell identifier change was very rare. Figure 1 presents the evolution of the total number of events during the period. The volumes of both active and passive events are correlated with the time of day; the cyclical nature of the evolution of the volumes of events over time is clear. The evolution of the numbers of passive events is more uneven than that of the active events. The local maxima of the numbers of passive events at 3:15 a. m. are related with the operations of the data acquisition and processing system. Further, note that the observed events are not equally distributed in the city. Figure 2 presents statistics of the number of events observed across all BTSs at a given time. The large difference between mean and median, the wide interquartile range, and the very large difference between maximum and mean number of events at all times suggest a large variability in the distribution of events over the BTSs. This disproportion is the result of differing population density, differing ranges of base transceiver stations, and differing telecommunication technologies.

For comparison, we have used the estimates presently employed by the public transport operator in Warsaw (ZTM). One should like to point out that the estimates of ZTM imply a rather different distribution of the population across the zones than the density of events within the Orange network. See Figure 6 for the density of events within the Orange network for three representative time periods. In Figure 7, we compare it to the implied presence data within the ZTM estimate: with each data point represents one zone and the two axes correspond to the presence implied by ZTM for the morning peak and the numbers of events within Orange network during the same time period, averaged over the 5 work days. As can be seen, our model starts with rather different data, and any direct comparison of our estimates of trip distributions with the estimates of the public transport operator (ZTM) will hence be imperfect.

(a) 6 a. m. 
(b) 9:30 a. m. 
(c) 4:30 p. m. 
Figure 6: Density of events across zones for different time periods. Color scale uses a power law normalization.
Figure 7: Presence by zone implied by the estimates of ZTM and the density of events in the Orange network.

For further comparison, we have implemented the doubly-constrained gravity model [9]. There, the presence vectors across two consecutive time steps are seen as production and attraction vectors within the framework of a travel-demand model, which aims to generate trip distribution rates. An appropriate normalization can give yield matrix similar:


where and are zone-specific parameters for the origin and destination, respectively, which are learned from the data, and is a parameter, which we assume to be . An iterative algorithm [9] can then be used to determine the flow, subject to flow conservation constraints.

To test the approaches, data were aggregated spatially to 820 zones defined by the public transport operator (ZTM), and temporally to 15-minute units, prior to the application of any methods described in the present paper. Independently, for each cost function, an

matrix has been calculated, based on a description of each zone as a polygon with points defined by their latitude and longitude. Additionally, we have obtained randomised variants of the objective function by adding uniformly-distributed noise

to each element, e.g., for all . (Note that the introduction of a small amount of noise makes it possible to obtain multiple optimizers with very similar objective function value.) The data have been normalised to the number of users, i.e., such that the marginals, and hence the scalar variables sum up to . Subsequently, the linear program (LP-C) of Section 3.3 with scalar variables and dense constraints has been formulated and solved for each pair of consecutive time steps between 8 a. m. and 9.30 a. m. on each of the five work days and each of 4 randomisation of the objective function. The sparse, integral solutions of all linear programs have been averaged to reduce the sparsity in our estimate of the 1-step transition matrix . Subsequently, we have obtained -step transition matrices for , i.e., corresponding to trips of 30, 45, 60, 75, and 90 minutes of duration, by the matrix-multiplication procedure of Section 4.1. Finally, we have computed a convex combination of the -step transition matrices as suggested in Section 4.4, using weights detailed in the caption of Figure 12.

(d) Gravity model
Figure 12: Solutions to Problem 4 obtained using (LP-C) and three variants of the vector , compared against the estimates of the public transport operator (ZTM).

Notice that we did not have an accurate estimate of the trip-duration distribution, which Figure 12 suggests has a considerable impact. Still, the match seems encouraging, considering the simplicity of the methods applied: we have considered the simplest cost function , we have assumed the constant number of users throughout in (LP-C), and the crudest method of arriving at the multi-step transition matrices.

7 Related Work

7.1 The Demand Modelling

There is a long history of the use of individual mobile phone subscriber data in transportation modelling. Outside of Europe, the most common study is based on CDR (Call Data Record). For example in [4], the CDR are used to analyse flows between the city and the suburban area of New York. In [19], researchers analyse the CDR from the main mobile operator in Mexico, and one of the papers focuses on users movements analysis. In [37], CDR data including phone calls, SMS messages, and web browsing of customers of one US mobile operator in the San Francisco Bay area are used to determine patterns of urban road use. The Orange company organized two editions of the D4D Challenge (, during which the anonymous CDR of Abidjan in Cote d’Ivoire data was made available to developers and researchers. Some of the present authors [23, 28] used the data set to optimize public transport. In a related paper [11], the data from the competition is used to estimate the travel demand and network allocation. The access to the CDR data is particularly limited in Europe, though, where privacy-protection regulation make access to the CDR data impossible without their anonymization and user’s consent, in most cases.

Within Europe, [2] discusses the use of GPS data, CDR, and cellular data (details of base stations) to determine the trajectory of subscriber mobility. The data for this study was collected by a group of volunteer users in cooperation with one of the French operators, and the study covers the metropolitan area of Paris (Ile-de-France). The use of anonymous mobile data and their use for locating subscribers anonymously is described in [13]. [35] discusses the use of data from Italy’s Vodafone users to compute mobility patterns in Italian cities. In this case, the real-time events (due to voice, data, and SMS communication) observed between A Interface and IU-CS interface in 2G and 3G mobile networks have been used. In [31], the authors describe the usage of aggregated mobile phone data from Rome, including the information about the base station telephone traffic, expressed in Erlang units. These records are combined with the location and trajectory data of callers and data from the two city transportation companies (taxi and public bus company). [24] describe the aggregation procedure used to derive the dataset we use.

7.2 The Algorithms

Notice that the linear program (LP-C) has a well-known structure, in that the feasible set is the transportation polytope:

Definition 7 (Transportation Polytope).

Given , , the set of matrices satisfying


is called the -transportation polytope defined by the marginals and .

Related algorithms for solving special-cases of linear programming are employed throughout pattern recognition and content-based content retrieval. The commonly used statistical distances include the Earth mover’s distance (EMD)

[32], also known as the transportation cost metric [16], and Mallows distance [25], which are equivalent [20], in the sense of being dual of each other. These are discretisations of the metrics of Wasserstein and Monge-Kantorovich [8], which are also dual of each other. Computing the distances amounts to solving a certain problem over the so-called transportation polytope. Perhaps the best overview is provided by the two-volume work of Rachev and Rüschendorf [29, 30].

In Computer Vision

[27, 34, 12], a variety of exact and approximate methods have been considered. Perhaps the best known exact method [27] uses the min-cost flow algorithm, yielding run-time of for

bins. Approximations based on gradient flows of certain partial differential equations

[12] can be computed faster than earth mover’s distance, empirically, but do not come with any guarantees, i.e., the approximation ratio. Approximations based on the sum of absolute values of weighted wavelet coefficients of the difference of the histograms [34] have been proposed. Such an approximation can be computed in time linear in the number of bins, but does not come with a bound on its quality, i.e., the approximation ratio. Papers on approximations based on space-filling curves [14] have been withdrawn, recently.

In Machine Learning

[6], Sinkhorn distances, which are a regularised variant of the Earth mover’s distance, have been explored. Although the algorithms based on the Sinkhorn distance are substantially faster than the network simplex for Earth mover’s distance, It has also been realised [22] that one can employ the following representation with variables for bins: leading to run-time cubic in the number of bins.

In Theoretical Computer Science, [16] have studied the limits of approximability of earth mover’s distance, [1] have proposed logarithmic approximation in sub-linear time, while [3, 5] proposed algorithms approximation algorithms, whose run-time depends on the dimension of the domain space and the quality of the result required, possibly sub-linear in the number of bins. A similar in spirit is a recent method [21], which considers dithering of the domain space, followed by max-flow computations [27]. We stress that our algorithms are much easier to implement and provide the exact solution.

Overall, the present-best algorithms [27] had run-time of for bins, compared to we present in Theorem 6. Across all the references listed, only the value of the objective function, rather than the value of the argument at which it is achieved, is sought.

8 Conclusions

Across many transport-engineering applications, from planning of public transport to balancing of a fleet of shared vehicles, one needs an up-to-date model of demand for transportation. Further, one may be interested in higher moments of the demand, rather than just the commonly-used expectation. The use of mobile phone data makes it possible to address both issues.

For the first time, we have formalised the problem of extraction of a model of demand for transport from the aggregate data on the use of the base stations. We have shown that the associated optimisation problems have the form of a trace maximisation over the transportation polytope, which allows for much faster algorithms than the usual simplex or interior point methods. In particular, we show that there exist closed-form expressions for the values of the objective functions of trace maximisation over the transportation polytope. This allows for the computation of the objective function in time linear in the dimension of the input, in theory, and many orders of magnitude faster than solvers in the published literature, in practice. Numerous extensions are possible, aiming to infer mode choice, to study the variability of the demand over time, as well as to improve the calibration techniques. We envision this may open up a new direction of research on the interface of statistics and transportation engineering.


We would like to thank Marco Cuturi for having kindly provided us with his code for the comparison reported in Section 6. We would also like to thank Yossi Rubner et al. [32] and Ofir Pele and Michael Werman [27] for releasing their code on-line (, This work received funding from the European Union Horizon 2020 Programme (Horizon2020/2014-2020), under grant agreement no. 688380.


  • [1] Alexandr Andoni, Piotr Indyk, and Robert Krauthgamer. Earth mover distance over high-dimensional spaces. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’08, pages 343–352, 2008.
  • [2] Fereshteh Asgari, Alexis Sultan, Haoyi Xiong, Vincent Gauthier, and Mounîm A. El-Yacoubi. Ct-mapper: Mapping sparse multimodal cellular trajectories using a multilayer transportation network. Computer Communications, 95:69 – 81, 2016. Mobile Traffic Analytics.
  • [3] Khanh Do Ba, Huy L. Nguyen, Huy N. Nguyen, and Ronitt Rubinfeld. Sublinear time algorithms for earth mover’s distance. Theory of Computing Systems, 48(2):428–442, 2010.
  • [4] R. A. Becker, R. Caceres, K. Hanson, J. M. Loh, S. Urbanek, A. Varshavsky, and C. Volinsky. A tale of one city: Using cellular network data for urban planning. IEEE Pervasive Computing, 10(4):18–26, April 2011.
  • [5] Siu-On Chan, Ilias Diakonikolas, Gregory Valiant, and Paul Valiant. Optimal algorithms for testing closeness of discrete distributions. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’14, pages 1193–1203. SIAM, 2014.
  • [6] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2292–2300. Curran Associates, Inc., 2013.
  • [7] Jesús A De Loera and Edward D Kim. Combinatorics and geometry of transportation polytopes: an update. Discrete Geometry and Algebraic Combinatorics, 625:37–76, 2014.
  • [8] Yuxin Deng and Wenjie Du. The Kantorovich metric in computer science: A brief survey. Electronic Notes in Theoretical Computer Science, 253(3):73–82, 2009. Proceedings of Seventh Workshop on Quantitative Aspects of Programming Languages (QAPL 2009).
  • [9] Sven Erlander and Neil F Stewart. The gravity model in transportation analysis: theory and extensions, volume 3. Vsp, 1990.
  • [10] Jacek Gondzio. Interior point methods 25 years later. European Journal of Operational Research, 218(3):587–601, 2012.
  • [11] David Gundlegård, Clas Rydergren, Nils Breyer, and Botond Rajna. Travel demand estimation and network assignment based on cellular network data. Computer Communications, 95:29 – 42, 2016. Mobile Traffic Analytics.
  • [12] Steven Haker, Lei Zhu, Allen Tannenbaum, and Sigurd Angenent. Optimal mass transport for registration and warping. International Journal of Computer Vision, 60(3):225–240.
  • [13] S. Hoteit, S. Secci, S. Sobolevsky, G. Pujolle, and C. Ratti. Estimating real human trajectories through mobile phone data. In 2013 IEEE 14th International Conference on Mobile Data Management, volume 2, pages 148–153, June 2013.
  • [14] Min-Hee Jang, Sang-Wook Kim, Christos Faloutsos, and Sunju Park. A linear-time approximation of the earth mover’s distance. CoRR, abs/1106.1521, 2011.
  • [15] Roy Jonker and Anton Volgenant. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38(4):325–340, 1987.
  • [16] Subhash Khot and Assaf Naor. Nonembeddability theorems via Fourier analysis. Mathematische Annalen, 334(4):821–852, 4 2006.
  • [17] Harold W Kuhn. The Hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  • [18] François Le Gall.

    Powers of tensors and fast matrix multiplication.

    In Proceedings of the 39th international symposium on symbolic and algebraic computation, pages 296–303. ACM, 2014.
  • [19] Yannick Leo, Anthony Busson, Carlos Sarraute, and Eric Fleury. Call detail records to characterize usages and mobility events of phone users. Computer Communications, 95:43 – 53, 2016. Mobile Traffic Analytics.
  • [20] E. Levina and P. Bickel. The earth mover’s distance is the mallows distance: some insights from statistics. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, volume 2, pages 251–256 vol.2, 2001.
  • [21] Longjie Li, Min Ma, Peng Lei, Xiaoping Wang, and Xiaoyun Chen. A linear approximate algorithm for earth mover’s distance with thresholded ground distance. Mathematical Problems in Engineering, 2014, 2014.
  • [22] H. Ling and K. Okada. An efficient earth mover’s distance algorithm for robust histogram comparison. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5):840–853, May 2007.
  • [23] G. Di Lorenzo, M. Sbodio, F. Calabrese, M. Berlingerio, F. Pinelli, and R. Nair. Allaboard: Visual exploration of cellphone mobility data to optimise public transport. IEEE Transactions on Visualization and Computer Graphics, 22(2):1036–1050, Feb 2016.
  • [24] Marcin Luckner, Aneta Rosłan, Izabela Krzemińska, Jarosław Legierski, and Robert Kunicki. Clustering of Mobile Subscriber’s Location Statistics for Travel Demand Zones Diversity, pages 315–326. Springer International Publishing, Cham, 2017.
  • [25] C. L. Mallows. A note on asymptotic joint normality. Ann. Math. Statist., 43(2):508–515, 04 1972.
  • [26] Rasmus Pagh. Compressed matrix multiplication. ACM Trans. Comput. Theory, 5(3):9:1–9:17, August 2013.
  • [27] O. Pele and M. Werman. Fast and robust earth mover’s distances. In 2009 IEEE 12th International Conference on Computer Vision, pages 460–467, Sept 2009.
  • [28] F. Pinelli, R. Nair, F. Calabrese, M. Berlingerio, G. Di Lorenzo, and M. L. Sbodio. Data-driven transit network design from mobile phone trajectories. IEEE Transactions on Intelligent Transportation Systems, 17(6):1724–1733, June 2016.
  • [29] S.T. Rachev and L. Rüschendorf. Mass Transportation Problems: Volume I: Theory. Probability and its Applications. Springer, 1998.
  • [30] S.T. Rachev and L. Rüschendorf. Mass Transportation Problems: Volume II: Applications. Probability and its Applications. Springer, 1998.
  • [31] J. Reades, F. Calabrese, A. Sevtsuk, and C. Ratti. Cellular census: Explorations in urban data collection. IEEE Pervasive Computing, 6(3):30–38, July 2007.
  • [32] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas.

    The earth mover’s distance as a metric for image retrieval.

    International journal of computer vision, 40(2):99–121, 2000.
  • [33] A. Schrijver. Combinatorial Optimization: Polyhedra and Efficiency. Algorithms and Combinatorics. Springer, 2002.
  • [34] S. Shirdhonkar and D. W. Jacobs. Approximate earth mover’s distance in linear time. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8, June 2008.
  • [35] D. Tosi and S. Marzorati. Big data from cellular networks: Real mobility scenarios for future smart cities. In 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService), pages 131–141, March 2016.
  • [36] Fahui Wang and Yanqing Xu. Estimating O–D travel time matrix by Google Maps API: implementation, advantages, and implications. Annals of GIS, 17(4):199–209, 2011.
  • [37] Pu Wang, Timothy Hunter, Alexandre M Bayen, Katja Schechtner, and Marta C González. Understanding Road Usage Patterns in Urban Areas. Scientific Reports, 2(arXiv:1212.5327):1001. 47 p, Dec 2012.

Appendix A Proof of Theorem (6)

Let us present the proof of


Given , with , consider


Then, . This is computable in time linear in .


The proof proceeds in two steps. First, we prove that for any feasible . Second, we construct a feasible achieving this upper bound.

Step 1: Since , we certainly have and . Hence in particular and , and so by choosing the tightest bound at each , follows.

Step 2: Before we proceed with Step 2 proper, we make two assumptions and explain that these are without any loss of generality.

Assumption 1: . Notice that if , one lets for and for all off-diagonal elements Such an trivially satisfies all constraints and , and hence the result holds for this case. We can hence assume without the loss of generality.

Assumption 2:The marginals are ordered such that:

If that is not the case, then let denote the permutation matrix so that and are ordered.111 can be constructed easily: let . Then and we obtain by rearranging the identity matrix so that the columns with indices in are the first columns, in any order. Then, if is feasible for (IP2) with the modified marginals and , then is feasible for (IP2) and achieves the identical objective, since .

Note that we certainly have , because implies for all , which in turn implies , which contradicts , hence ; together with Assumption 1 () leads to an analogous contradiction.

The construction of Step 2 fixes all values on the diagonal of , the first columns and the last rows:

Denote the remainders of the marginals by , , where the is taken element-wise.

Figure 13: Illustration of the proof of Theorem 6: After possibly reordering the marginals, the diagonals are filled in, and the hatched regions are filled with zeros. The elements in the rectangular block are then used to satisfy the remaining constraints, which are collected in and .

Let us now consider several properties of this construction; see also Figure 13 for an illustration:

  1. , which matches the bound of Step 1.

  2. for and for , i.e., all constraints but the first row sums and last column sums are already satisfied.

  3. Only for and are unassigned. Denote this matrix by .

  4. , , and for and . Denote the remaining elements of and by and respectively, i.e.  and and

    We still have , since we only removed zero elements.

By property P1 and P3, if we are able to find a matrix satisfying , and , we can recover a feasible that achieves the bound of Step 1. By P3 and P4, this amounts to finding a point in the ( transportation polytope defined by the marginals and . Since by P4 we have that , it follows from Lemma 5 that such a does exist, which concludes Step 2. ∎