Biases in human mobility data impact epidemic modeling

12/23/2021
by   Frank Schlosser, et al.
0

Large-scale human mobility data is a key resource in data-driven policy making and across many scientific fields. Most recently, mobility data was extensively used during the COVID-19 pandemic to study the effects of governmental policies and to inform epidemic models. Large-scale mobility is often measured using digital tools such as mobile phones. However, it remains an open question how truthfully these digital proxies represent the actual travel behavior of the general population. Here, we examine mobility datasets from multiple countries and identify two fundamentally different types of bias caused by unequal access to, and unequal usage of mobile phones. We introduce the concept of data generation bias, a previously overlooked type of bias, which is present when the amount of data that an individual produces influences their representation in the dataset. We find evidence for data generation bias in all examined datasets in that high-wealth individuals are overrepresented, with the richest 20 skewing the datasets. This inequality is consequential, as we find mobility patterns of different wealth groups to be structurally different, where the mobility networks of high-wealth users are denser and contain more long-range connections. To mitigate the skew, we present a framework to debias data and show how simple techniques can be used to increase representativeness. Using our approach we show how biases can severely impact outcomes of dynamic processes such as epidemic simulations, where biased data incorrectly estimates the severity and speed of disease transmission. Overall, we show that a failure to account for biases can have detrimental effects on the results of studies and urge researchers and practitioners to account for data-fairness in all future studies of human mobility.

READ FULL TEXT VIEW PDF

page 3

page 27

01/07/2021

Monitoring the COVID-19 epidemic with nationwide telecommunication data

In response to the novel coronavirus disease (COVID-19), governments hav...
08/20/2021

User Localization Based on Call Detail Records

Understanding human mobility is essential for many fields, including tra...
01/10/2022

On the interplay of data and cognitive bias in crisis information management – An exploratory study on epidemic response

Humanitarian crises, such as the 2014 West Africa Ebola epidemic, challe...
09/18/2009

Planet-scale Human Mobility Measurement

Research into, and design and construction of mobile systems and algorit...
03/28/2019

The Geography of Pokémon GO: Beneficial and Problematic Effects on Places and Movement

The widespread popularity of Pokémon GO presents the first opportunity t...
07/27/2018

On the Inability of Markov Models to Capture Criticality in Human Mobility

We examine the non-Markovian nature of human mobility by exposing the in...
03/11/2022

Comparing Global Tourism Flows Measured by Official Census and Social Sensing

A better understanding of the behavior of tourists is strategic for impr...

Introduction

A large range of applications, from urban planning [wang2012understanding], dynamic population mapping [deville2014dynamic], estimation of migrations  [palotti2020monitoring], to epidemic modeling [wesolowski2014commentary] rely on large-scale human mobility datasets. Mobility data has even been proposed as a cost-effective way of estimating population level statistics in low- and middle-income countries [blumenstock2015predicting, pokhriyal2017combining]. More recently, human mobility data has gained massive attention in the wake of the COVID-19 pandemic [oliver2020mobile], where it has been used to monitor the impact of mobility restrictions world-wide [kraemer2020effect, Flaxman2020, galeazzi2020human], to understand the complex effects of governmental policies [bonaccorsi2020economic, schlosser2020covid, chang2020mobility, Dehning2020], and as a key ingredient for epidemiological modelling [chinazzi2020effect, Gatto2020, arenas2020mathematical]

. For all these applications it is crucial to have unbiased estimates of human travels that accurately represent the behavior of the underlying populations.

In our increasingly digital world, large scale mobility data is often passively collected as a by-product of digital technologies used for billing, service, or marketing purposes. This includes call detail records (CDRs) collected through regular mobile phone usage [gonzalez2008understanding], GPS traces collected via smartphone apps [stopczynski2014measuring], check-ins from online social media services [jurdak2015understanding], and smart travel card data [batty2013big] to mention a few. However, it is an open question how well these proxy datasets capture the true movements of people. It is widely recognized that access to, and usage of, big-data technologies is heterogeneous across populations [james2007mobile, lazer2014parable, sapiezynski2020fallibility]. Differences in technology usage can lead to disparities in how, or whether, individuals are captured in digitally collected datasets [sekara2019mobile]. For instance, it has been established that in certain countries mobile phone ownership is biased towards predominantly wealthier, better educated, and for the most part male populations [blumenstock2010mobile, wesolowski2012heterogeneous], meaning that mobility datasets captured in this context will mainly contain the travels of these demographics. Unequal access to digital technologies is especially troubling as it is often deeply intertwined with socioeconomic, geographic, ethnic, gender, ability, and class disparities [wesolowski2013impact].

The question of representation, i.e. whose behavior do digitally collected datasets represent, is a pressing matter as uncorrected biases in mobility data can lead to incorrect conclusions with potentially harmful adverse impacts. For example, in the context of the current COVID-19 pandemic, government policies have in many countries been based on epidemiological models which use human mobility as a key ingredient. Mis- and underrepresention of specific population demographics, however, could lead to discriminatory policies which disadvantage populations that are not properly captured in the data. This is especially problematic as researchers often use pre-computed datasets shared by third party data providers or other entities, and thus have only limited insight into the methodology of data collection, leaving potential sources of bias unnoticed. Despite the potential harmful impacts of biases, few studies have systematically investigated the presence of biases in human mobility datasets and what effects they can have on dynamic processes simulated on these datasets, such as epidemic models.

In this study we examine mobility datasets estimated from call detail records (CDRs) from 3 countries (Sierra Leone, the Democratic Republic of the Congo (DRC), and Iraq) for different forms of bias and we report a new type of bias which has not previously been reported in human mobility data. In addition to the previously well-documented technology access bias (not everybody has access to digital technologies such as mobile phones), we find evidence of a fundamentally different type of bias — data generation bias — where different demographics produce, or generate, unequal amounts of data.

Access to technology bias concerns the question of how access to the recording technology is distributed across different demographics. This type of bias is related to ownership of mobile devices and is well documented in the literature, where gender, age, disability, education, ethnicity, and wealth characteristics have been identified as factors that influence ownership [blumenstock2010mobile, wesolowski2012heterogeneous, wesolowski2013impact, aranda2019understanding]. As such, there exist methodologies to both quantify and partially reduce this type of bias in mobility data [wesolowski2013impact, tizzoni2014use, coston2020leveraging, pestre2020abcde].

Data generation bias is fundamentally different, it deals not with if, but how a user is captured in a dataset. Even with homogeneous societal access to the recording technology (in this case mobile phones), equal representation is not guaranteed if technology usage depends on the socioeconomic properties of individuals. If an activity has an associated cost, for instance the cost of making a phone call, poorer individuals might limit their activity to save money, in turn lowering their representation in the collected data. For human mobility derived from mobile phones, data generation bias will manifest in individuals with low mobile phone activity (few calls and text messages) having less trips captured in the data (see Fig. 1a). In fact, previous studies have demonstrated that low mobile phone activity (low number of calls and texts) is reflected in the resolution with which individual mobility patterns can be reconstructed [ranjan2012call, zhao2016understanding], even if an individual has undertaken the same amount of trips as a person with high phone activity.

We demonstrate the presence of both technology access and data generation bias in the examined datasets and present a framework to correct for, or debias, mobility datasets for these biases in order to improve the representativeness in terms of socioeconomic characteristics. Using epidemic simulations as an example, we compare the outcome of epidemic models running on biased and debiased data and demonstrate that biases can severely impact the result of dynamic processes. As such, if biases are not taken into account they can greatly influence insights derived from epidemic simulations.

Results

We analyze CDR estimated mobility data from Sierra Leone, Democratic Republic of the Congo (DRC), and Iraq (see Supplementary Materials for a full description of the datasets). Data is generated in the form of mobile phone activity (texts or calls), which is registered at the closest cell tower. Trips are then recorded as movements between cell towers. The data processing differs slightly among the datasets used here: For the Sierra Leone and DRC datasets, trips take place between subsequent locations of data activity; For Iraq, a trip is counted from the users home location to the location of data activity (see SI for details). Trips are estimated on an individual level, and, to preserve privacy, aggregated by the mobile networks operators (MNOs) spatially and temporally to create the mobility networks for each country. The flow quantifies the total number of trips from region to , recorded within the time frame of the dataset, including flows that start and end in the same region. Here, the regions are the districts in each country (or their corresponding administrative level 3 division, see SI for details).

To understand the impact of socioeconomic status of users on data generation and on mobility patterns, we study subsets of the mobility networks which are distinguished by wealth. Before the datasets are aggregated the MNOs split up the users in each dataset into 5 equally sized groups (quintiles ) according to their airtime expenditure, with Q1 having the lowest and Q5 the highest expenditure (see Fig. 1b). Airtime expenditure has been previously established to to correlate well with underlying wealth [wesolowski2013impact] and food security [decuyper2014estimating]

. We thus classify the quintile Q1 (Q5) of users with the lowest (highest) airtime expenditure to have the lowest (highest) socioeconomic status, and refer to them as poorer (richer) users (see SI section 1.1 for more details). The MNOs provided us with separate mobility networks

, each containing only trips of users in quintile . The original, aggregate network is the union of these quintile networks , where the flows are added up to the total flow .

Imbalances in data generation across wealth quintiles

We find that the wealth of users has an impact on the amount of mobility data they generate, as measured by the number of trips captured in the dataset. We calculate the fraction of trips generated by the users in each quintile , as where is the total number of trips in each quintile network, , and the number of trips in the full network, . Fig. 1c shows there are large inequalities in data representativeness: high-wealth users are overrepresented, with the wealthiest 20% of users (Q5) contributing approximately 50% of all recorded trips, while the poorest 20% (Q1) produce less than 5% of all trips. Taken together, the bottom 80% of users produce approximately the same amount of data as the wealthiest 20%, a finding which is consistent across countries. (Similar distributions have been identified across a multitude of systems and are often called Pareto distributions [newman2005power].) This is an important observation as the travel behavior of wealthy individuals dominates the captured mobility data.

Figure 1: Illustration of data generation bias and evidence from data. a. Users 1 and 2 have an identical real path (black arrow), but user 1 has a higher data generation rate leading to more recorded phone activities (white circles). As a result, for user 1 more trips between cell tower regions are captured in the CDR mobility dataset (colored arrows). b. Mobility networks are divided into 5 sub-networks , each representing a quintile or 20% of the total user base. The users are sorted by their socioeconomic status as measured by airtime expenditure, so that the quintiles range from the poorest group of users Q1 to the wealthiest group Q5. c. Empirical evidence from three countries shows large inequalities in how users are represented in mobility data. Shown are the fraction of trips in each quintile among all trips. Richer groups contribute more trips and are thus over-represented in all countries. In effect, the top quintile of users (Q5) accounts for roughly half of all trips, while to poorest (Q1) accounts for less than 5%.

The effects of bias on the structure of mobility networks

Figure 2: The effects of data generation bias on the structure of mobility networks. a. The mobility networks for Sierra Leone for each socioeconomic quintile. The network for wealthier quintiles (Q5) is denser and contains more unique, more long-distance, and more high-flow connections. Edge color and width are scaled according to the number of trips. At the time of data collection there were in Sierra Leone 78.9 mobile phone subscribers per 100 people [international2020itu]. b. The cumulative distribution of the Shannon entropy of outgoing trips from districts for different quintiles. For Sierra Leone and Iraq, richer quintiles have a higher entropy indicating more diverse mobility connections. The relationship is weaker and reversed for DRC. c. The cumulative distribution of the weighted clustering coefficient of districts in each quintile network. The clustering coefficients are generally higher for low-wealth quintiles (Q1), meaning that they are more locally dense with fewer long-distance connections.

Imbalances in data generation cause a profound over-representation of wealthy individuals, which matters, because as we show in the following, mobility patterns of the quintiles differ. As a consequence, the aggregate travel network is skewed towards the mobility patterns of rich people, and as we demonstrate in the last section, this distorts the outcomes of dynamic simulations, such as the conclusions we can draw from epidemic models.

In Fig. 2a we compare the mobility networks of low- and high-wealth users and find distinct differences between them— large enough to be visually observed. The mobility networks of wealthier users contain more trips overall (comparing Q1 and Q5 for Sierra Leone, we find million and million), have more unique links ( and ) but the same number of regions , and thus have a higher density ( and ). In addition, wealthier quintiles have relatively more connections over long-distances, and have a higher proportion of edges with a high number of trips (see results in SI section 2).

In addition, we find significant differences between the quintiles in more advanced topological measures, namely the district-wise Shannon entropy (Fig. 2b) and weighted clustering coefficient (Fig. 2c, see definitions in methods). Shannon entropy is a measure of diversity of the mobility connections starting in district . It has previously been shown to be connected to the socioeconomic status of a region: Locations with a higher socioeconomic status in general have a more diverse set of connections, corresponding to a higher entropy [Eagle2010, wesolowski2012heterogeneous]. The relationship is not universal though  [xu2018human] and can be reversed depending on the spatial organization of cities; for example, in certain places low-income households can be located in the outskirts of cities, and in other places they can be located downtown [barbosa2021uncovering]. In our data, we find significant differences in the distributions of entropies between quintiles, see Fig. 2b. In Sierra Leone and Iraq, richer quintiles have a higher entropy, indicating a more diverse connections, while the relationship is reversed for DRC. In all countries, the distributions for Q1 and Q5 are significantly different (, two-sample Kolmogorov-Smirnov test, see detailed statistics in SI).

Similarly, we find that the distributions of the weighted clustering coefficient differ between quintiles, see Fig. 2c. The clustering coefficient measures the average flow between triplets of neighboring districts, where a large value indicates that two neighbors of a district are likely to have large flows between them, too. We find that poorer quintiles generally have higher clustering coefficients, indicating that these mobility networks are locally more dense, whereas higher wealth network are less locally dense, coinciding with them having a higher proportion of long distance trips as stated above. Again, differences between Q1 and Q5 are significant in a KS test (, see SI). Importantly, the differences in network structure across poor and wealthy groups persist even when we account for imbalances in the total number of trips between the quintiles. We show this by calculating the metrics for resampled networks , which all contain the same total number of trips, and find that the observed metrics are qualitatively unaffected (SI, Fig SI2).

In addition to the observed data generation bias, we also find evidence of technology access bias in the mobility networks, meaning that only a fraction of the population is present in the dataset at all. This is frequently observed in mobile phone data as no single MNO has a monopoly, and commercial factors play a large role on which regions and populations are covered. We quantify this bias using the technology coverage rate , which is the number of users of the MNO among the population in district . Similar to previous studies [tizzoni2014use, coston2020leveraging], we find that is on average smaller than 1, that it is varies considerably across districts, and that some districts are not present in the network at all, i.e. with and no in- or outgoing flows (see results in SI sec.4). Taken together, data generation and technology access biases hamper our ability to truthfully reconstruct the true mobility network, and choosing not to correct for these biases will create datasets that over-represent the behavior of wealthier groups.

Debiasing mobility data

We formulate a methodology to construct debiased mobility networks that accounts for the data generation and technology access bias present in the original network , see Fig. 3. The debiased network is our estimation of what “unbiased mobility data” would look like in the optimal situation, where all users in the population had equal access to and equal means of data production. Our methodology is based on a general mathematical debiasing framework which can be adapted to other specific contexts (the full framework is described in SI sec. 3).

The quintile networks contain very different amounts of trips (see Fig. 1), leading to an unequal representation in the aggregate network . To mitigate for the skew in data generation we construct resampled quintile networks by sampling trips from the original quintile networks so that each graph contains the same amount of trips , i.e. we over- or undersample the previously under- and overrepresented networks (see Materials and Methods). The resampled networks are then combined to form a resampled aggregate network , where each quintile of users contributes the same amount of trips.

In addition to the resampling, we also use established techniques [tizzoni2014use] to account for technology access bias, i.e. for users and regions which are not represented in the dataset at all. First, in districts where mobility data is present, but where only a fraction of the total population are users of the MNO, we rescale all flows starting in the district with the coverage to estimate the mobility of the total population, , yielding the resampled and rescaled network . Second, for districts where there is no coverage at all (), and where there consequently is no recorded mobility data, , we estimate the missing flows starting and ending in this district from a gravity-like human mobility model (see details in Materials and Methods).

The end result is a realization of the debiased network , which corrects for data generation and technology access biases. The average debiased network differs from the original network in many ways, most notably in that it covers the full area of the country and overall includes more trips (Fig. 3). Further, trips are more evenly distributed among the edges and nodes in the debiased networks, indicating a more equal representation of districts (see results in SI section 4).

Figure 3: Illustration of the debiasing process. a. The original mobility network exhibits biases regarding technology access (apparent in that some regions entirely lack data) and unequal data generation (wealth quintiles contribute an unequal fraction of trips, illustrated by bars below maps). b. We account for data generation bias by resampling trips from the original mobility network, where we over- and under-sample previously under- and over-represented quintiles, such that a resampled network has an equal fraction of trips from all wealth groups. In addition, we partially correct for technology access bias by rescaling existing flows, accounting for the fact that MNOs have only a fraction of the population as customers, resulting in the resampled and rescaled network c. Finally, for regions where no travel data is captured at all (

), we impute flows from a gravity model, resulting in a realization of the debiased mobility network

.

How biased mobility data effects epidemiological predictions

Unaddressed biases in mobility datasets are dangerous because they can affect the outcome of dynamic processes based on mobility, which we demonstrate using the example of epidemic spreading. We simulate epidemics on the original mobility network as well as on the debiased network and compare the differences. We use an SIR model to simulate a contagion process in metapopulations that are connected by a commuter-type mobility [schlosser2020covid, tizzoni2014use]. For each simulation, we start the epidemic with a seed of infecteds in a randomly chosen districts present in the original data (for details see Materials and Methods).

Figure 4: Effects of biases on epidemic spreading. We simulate and compare epidemics on the original mobility network and realizations of the debiased network. Epidemics are started in random districts present in both datasets and results are averaged over multiple realizations (see Methods and Materials). a. The fraction of infected in three sample districts of Sierra Leone over time. Depending on the district, the epidemic curve can be shifted to earlier or later times, have a higher or lower peak, or an entirely different shape, which shows the diverse impact of biases on a sub-national level. Lines show the median of , and shaded areas encompass the most central of curves (see Materials and Methods). b. We find a substantial variation in the impact of biases on districts, as shown by the wide distribution of the arrival time changes in districts , which is the change in arrival time when switching to the debiased dataset. c. Changes in arrival times are spatially heterogeneous, with the epidemic often arriving earlier in remote, low-populated areas. Black circles depict national capitals.

On a sub-national level, we find stark differences between the spreading pattern of the epidemic on the debiased mobility networks compared to the original network, see Fig. 4. Depending on the district, the disease can arrive both substantially earlier or later than predicted, and the epidemic curve can have an entirely different shape (Fig. 4a). To quantify the spatially heterogeneous impact of debiasing, we calculate the arrival time change for each district as the difference in the average arrival time in the original mobility network (averaged across simulations) compared to the average arrival time in the debiased mobility networks. We find a strong variation in the arrival time change across districts (Fig. 4b). In Sierra Leone, the epidemic arrives on average 6 days earlier across all districts ( change), but for some districts the virus can arrive up to 16 days earlier than predicted based on the original data ( change). Results for DRC and Iraq show similar discrepancies for arrival time changes, although less extreme in magnitude, on average 2.7 days before for DRC, and 0.3 days later for Iraq. A spatial investigation of the arrival time changes reveals that the districts where the epidemic arrives earlier are often remote and low populated areas (Fig. 4c). On a national level, the epidemic curves show a smaller change when debiasing the mobility data (see additional results in SI), which highlights that the heterogeneous impact of biases on a regional level can be missed when focusing on national averages.

Discussion

The mobile phone datasets used in many fields as a proxy for large-scale human mobility are prone to biases which under- and mis-represent the travels of different groups of individuals and communities. As mobility data is gaining prominence within different research communities and practitioners, including the extensive global use during the COVID-19 pandemic, where it is actively used to inform governmental measures and evaluate their effects, the identification and mitigation of these biases is a pressing manner.

In our study of mobile phone mobility datasets from 3 countries we find large imbalances in how much data each wealth group generates: The wealthiest of users contribute more than half of all trips in the dataset, while the poorest account for less than of all trips — a finding which is consistent across countries and contexts. One explanation for this observation could be that wealthier individuals simply travel more, and that mobile phone data captures more of these trips. Travel survey data for cities in Colombia, however, show that this is not the case; poor individuals travel the same distance as wealthier people [lotero2016rich]. Similarly, a big-data analysis for mobility patterns in France has shown that there is no connection between travel distance and socioeconomic status [pappalardo2015using]. As such, it is unlikely that differences in travel distances can fully explain the vast imbalances observed in the data, which we believe can mainly be understood when taking data generation into account.

The over-representation of data from wealthy individuals is vital to address because there can be differences in where people travel to and which places they visit, especially between social classes. As Fig. 2 shows, and as the literature documents [barbosa2021uncovering], these differences can be pronounced. Further, as these structural differences persist when evening out the total number of trips across classes (SI Fig 2), it means that they are not merely caused by imbalances in trip volume but signify different mobility characteristics. As such, neglecting to account for imbalances in data generation we run the risk of institutionalizing discriminatory practices by using mobility datasets which are skewed towards the travel behavior of predominantly wealthy individuals. This can have a detrimental effect on our ability to accurately predict the evolution of dynamical systems.

To illustrate the potential hazardous effects biases can have on insights drawn from mobility data we use epidemic simulations — which have been extensively used to understand, and predict the spread of COVID [oliver2020mobile, kraemer2020effect, Flaxman2020, galeazzi2020human, bonaccorsi2020economic, schlosser2020covid, chang2020mobility, chinazzi2020effect, Gatto2020, arenas2020mathematical]. In our study we show that biases embedded in data can have a substantial impact on the pattern of epidemic spread. Biases can cause a severe over- or underestimation of key epidemic properties such as the severity of the peak, or the arrival time of the epidemic; both are characteristics which governments use for planning and responding to pandemics. These effects are especially deceiving because of their heterogeneity, as they can heavily impact certain regions but leave others unaffected, and can even go unnoticed when focusing on national averages, an effect that was previously referred to as the fallacy of the mean [vandemoortele2010taking]. Our research indicates that remote regions are especially impacted by biases, but more research is required to make clear whether there is, in fact, a connection to the demographic properties of the region. Disease dynamics are one type of dynamical process, but we expect other types of dynamical processes (traffic prediction, population dynamics, migration studies, etc.) to be equally affected by biased data.

To fully answer the question to which degree the debiasing procedure improves representativeness, one would have to compare the data to the true mobility of the population. Unfortunately, ground truth data with the necessary scale, breadth, and resolution is currently unavailable—We are now aware of any dataset which can be used for a truthful comparison. One might argue, that other digital data-sources can be used, but they will inevitably not capture all trips in the population, and be affected by the same biases. As such, our study is limited by the same validation factors as other papers focusing on human mobility. Nonetheless, the debiasing methods we present here are reasonable mitigation steps as they are derived from a mechanistic framework of how these biases arise, backed by observational data. It is important to keep in mind that many dynamic processes which are informed by mobility data are inherently probabilistic, where no single scenario, set of parameters, or input data should be seen as the sole truth [castro2020turning, juul2020fixed]. In these situations it is vital to model dynamic processes according to multiple alternative scenarios. Mobility datasets transformed with our debiasing method would provide one such scenarios where travels of underrepresented individuals are more fairly included.

The debiasing framework we present aims to alleviate the issue and to increase data representativeness. Our study can be extended in many ways to further explore the topic and improve current practices. Here, we focus only on a single socio-demographic variable, wealth, to define population sub-groups. Other factors are equally important in understanding how bias effects human mobility data. As such, our analysis would benefit from an intersectional approach across multiple additional characteristics including age, gender, ethnicity, and ability [crenshaw2017intersectionality]. An important step to facilitate this would be to include more meta-data about the users in mobility datasets, which we urge researchers and MNOs to consider in privacy preserving ways. We also only focus on mobility data in this study, but we expect the same biases to be present in almost all “big data” datasets of human behavior which are passively captured through digital means. Overall, we show that a failure to account for data generation biases can have distinct real-world effects and urge researchers and policy makers to take these issues into account.

Methods & Materials

Shannon entropy

We examine the Shannon entropy of each district for a given mobility network , given by [Morzy2017]

(1)

where is the set of districts that district is connected to, and is the proportion of trips that go from to ,

(2)

Clustering coefficient

We calculate the node-wise clustering coefficient , which is a measure for the possible triangles through the node that exist, and thus for the cliquishness or transitivity of the network. We use an extension of the clustering coefficient to weighted networks as proposed in [Saramaki2007, Onnela2005, Fagiolo2007], which is defined as

(3)

where the edge weights are normalized by the maximum weight in the network, .

Resampling to account for data generation bias

We debias the mobility network with respect to data generation bias by over- or undersampling trips from the quintile networks that are under- or overrepresented, respectively. Empirically, we find a stark difference in the number of trips

recorded in the quintile mobility networks . If the data generation rate of users were equal across quintiles, and assuming that users undertake the same amount of trips on average irrespective of wealth (see discussion), we would expect the number of trips per quintile network to be equal, , where is the total number of trips in the network. Thus, we create a realization of the resampled quintile networks by sampling trips from the original quintile networks according to

(4)

Here,

is a probability matrix with

being the probability to sample a trip for the connection . We estimate the probabilities from the frequencies of trips recorded in the mobility dataset,

(5)

The full resampled mobility network is then given by adding all quintile networks,

(6)

and contains an equal amount of trips from each quintile. A detailed motivation for this procedure is given in SI section 3.3. One might wonder why we resample the flows as in Eq.4 instead of rescaling the flows by a constant factor to achieve the desired number of trips. We argue that the sampling method described here better estimates the true structure of the network, see SI section 3.3.5 for details.

Correction for technology access bias

For districts where mobility data is present, but where only a fraction of the total population are users of the MNO, we follow the methodology in [tizzoni2014use] and rescale the resampled flows to calculate the total flow in the population, . In districts with no users () and no recorded flow () at all, we estimate the flows starting and ending in this district from a gravity-like human mobility model [de2011modelling]. We fit the gravity model to the mobility network of the poorest quintile for each country, as we deem those mobility patterns to be the most representative for the population in regions with no network coverage. We then add links from the gravity model to the mobility network for all districts with no in- or outgoing flow, yielding the debiased network . To avoid drastic changes to the structure of the mobility network we do not add all possible flows but only a fraction of links such that density of the network is preserved (see details in SI sec. 3).

Epidemic Simulation

We implement a SIR metapopulation model used in [schlosser2020covid, tizzoni2014use]. A full definition of the model is given in the SI. In our simulations we use a reproductive number of and recovery rate days. We have chosen these epidemic parameters loosely modelled after the wild type of the SARS-CoV-2 virus, for which meta-reviews estimate the basic reproduction number in the range of 2-3 and the infectious period as 4-8 days [alimohamadi2020estimate, park2020systematic]. The epidemic is started with a seed of infecteds in a district . We run simulations on the original mobility network and on 10 realizations of the debiased network . For each country and network, we run 10 simulations per district, starting the epidemic in all districts present in the original dataset in turn. We use this semi-random seeding (instead of choosing a random district each run) to minimize the variation between the simulations due to a different sampling of the districts, and thus improve the comparability of the results. The arrival time is defined as the first time when the fraction of infecteds passes the threshold of in district . The epidemic curves in Fig. 4a show the median of and the area that contains the most central of curves over time, where the centrality of curves is determined according to the method described in [juul2020fixed] using the Python software package curvestat.

Acknowledgments

We thank UNICEF offices in Sierra Leone, Iraq and DRC for their field support and interest on Big Data for social good and for the questions raised on representativeness of Big Data. In particular we thank the technical support of Shane O’Connor, Khulood Malik, Bilal Al-Kiswani, Atif Khurshid, Rhawaa Khalid, Gibson Riungu and Tajudeen Oyewale. We thank Elisa Omodei, Alex Rutherford and Michael Szell for helpful comments on the manuscript. UNICEF Innovation wants to thank Takeda for their Investment in Innovation and Frontier Technology for better health through UNICEF Venture Fund and MagicBox initiatives.

References