1. Introduction
Recent developments on global positioning systems (GPS) for wearable technology such as smartphones have drawn a great amount of interest from scientists studying the effects of environmental influences on different population groups [34, 26, 33, 21, 3, 44, 22, 41, 14]. A recent article [27] documents more than 100 studies from 20 disciplines that collect and analyze human timestamped GPS location data. This type of data is key for learning about the places where people routinely spend their time during activities of daily living in order to establish their relationship with socioeconomic outcomes, crime victimization, and physical and mental wellbeing. There have been extensive studies on the social stratification of mobility, such as health disparities of different neighborhoods, mental health, and substance abuse intervention [13, 38, 41], on the assessment of human spatial behavior and spatiotemporal contextual exposures [26, 33, 21], on the characterization of the relationship between geographic and contextual attributes of the environment (e.g., the built environment) and human energy balance (e.g., diet, weight, physical activity) [3, 44], on the study of segregation, environmental exposure, and accessibility in social science research [22], or on the understanding of the relationship between healthrisk behavior in adolescents (e.g., substance abuse) and community disorder [41, 1, 42].
Notwithstanding a general consensus across disciplines about the tremendous potential of GPS location data for studying human mobility, very little is currently known about how long a GPS study should last. There is an inherent tradeoff between collecting location data from people for longer vs. shorter periods of time. Recording more GPS locations yields more information about the locations where an individual spends their time, as well as about the frequency, duration and timing of their visits to these places. However, an individual’s participation in a GPS study comes with burdens that often become significant if accumulated over longer periods of time: the individual needs to carry the device recording the data (a GPS tracker) everywhere they go, and needs to make sure the device is properly charged at all times and functions properly. Until recently, most GPS study designs stipulated mandatory regular visits to project coordination sites to download data from the location trackers, to replace batteries, and replace the GPS tracking devices that were lost or were malfunctioning. While some of these issues have been addressed by using specialized apps on smartphones to collect GPS data and wirelessly transmit them into secure cloud databases, the costs of distributing smartphones to study participants, data plans, software development, and cloud computing are quite significant. In addition, there are important privacy considerations related to recording locations that might sensitive for study participants for long periods of time. For these reasons, it is desirable to design GPS studies that are as short as possible to reduce the costs of the projects and the burden of study participants, while in the same time still providing guarantees that sufficient location data have been collected to properly address the research aims.
Despite the constant growth in the number of human mobility studies that collect GPS location data in the last 20 years, the question about the determination of the amount of time of GPS monitoring has not been asked until recently [43]. In this paper, the authors argue that an effective GPS study should last until a minimum of 14 to 15 days of valid GPS data have been collected. While this finding is relevant for numerous research groups that, in the past, have designed GPS studies with a duration of 7 days (see [43] and the references therein), two weeks seems to severely underestimate the duration of other, more recent, GPS studies whose duration is significantly longer. For example, [6] and [29] represent studies that tracked adolescents in the San Francisco Bay area for one month. Another study [11] employs a more complex three site design that comprises five assessments that take place every six months over two years of followup for participants enrolled in Chicago, and three assessments that take place every six months over one year of followup for participants enrolled in Jackson and New Orleans. During each assessment, participants wear a GPS tracker for two weeks. Thus this study [11] records GPS locations for a total of 10 weeks and 6 weeks, respectively, but splits the period of observation into several contiguous two week periods of GPS monitoring. These longer periods of observation time were suggested in [25] who found 17 weeks to be an adequate period of time to monitor human mobility based on geotagged social media data.
In this paper we lay out a theoretical framework for assessing the temporal stability of human mobility based on GPS location data. Such a framework is missing from the current statistical literature. Previous work [43, 25] on the assessment of the duration of GPS observation periods is based on empirical findings, and lack any theoretical underpinnings. We address this gap by introducing several measures of the temporal dynamics of spatiotemporal trajectories of individuals. We illustrate the use of these measures with publicly available data from a study that recorded GPS locations of 185 individuals that live in a city in Switzerland over the course of 18 months.
2. Methods
The spatiotemporal trajectory of an individual in a reference time frame and spatial observation window is a curve
(1) 
where and represent the longitude and latitude coordinates, respectively, and is the location visited by this individual at time . We assume that this curve is smooth: and have continuous derivatives. The length of the curve in Eq. (1) is defined as [9]:
(2) 
The complete trajectory is never observed in the real world. Instead, observation times are sampled from a distribution on with density , and the corresponding locations on the curve
are recorded. These locations are realizations of a random variable
where . Ideally we would liketo follow a uniform distribution to have the same chance of recording a visited location anywhere in the reference time frame
. Due to technological limitations (e.g., GPS devices running out of power), heterogeneous built environments that prevent GPS devices to obtain a location (e.g., skyscapers in downtown areas or buildings without windows and WIFI coverage), or human behavioral factors (e.g., individuals turning off their GPS devices around certain locations sensitive to them) the distribution of can be far from the uniform distribution.We assume that GPS positional data from study participants were recorded. We denote by the unobserved spatiotemporal trajectory of the th study participant. The observation times in the reference time frame can vary between study participants. The GPS data for the th study participant are the time stamped longitude and latitude locations:
(3) 
where , the time was sampled from a distribution with density independently of the rest of the observation times, and . Here represents the time when the th location of study participant was recorded. Our framework allows for the possibility of having different reference time frames for various groups of study participants.
2.1. Measuring the temporal stability of human mobility patterns
One possible measure of the dynamics of the spatiotemporal trajectory is the average velocity at time which is a function of the length of the subcurve of from Eq. (1):
(4) 
for and . A sample estimator of the average velocity for the th study participant is
(5) 
where represents an estimate of the distance traveled between times and . In what follows we will assume that study participants traveled in a straight line or “as the crow flies” between two consecutive observed GPS locations. This is the simplest assumption one can make which leads to an easy way of calculating Great Circle (WGS84 ellipsoid) distances between two spatial locations [4]. However, this assumption underestimates actual distances traveled, and consequently underestimates the average velocity. More accurate approximations of distances traveled can be defined based on the shortest distances between two locations on a road network that spans the spatial observation window . Calculating distances based on a road network is more complex than calculating straight line distances, and involves significant GIS work since the maximum speed of travel on different segments of road needs to be taken into account [10]. Nevertheless, as the span of time between two consecutive observed locations becomes shorter, the difference between the road network and straight line distances decrease.
More generally, consider a stochastic process , where is a mapping of the subcurve into . The mapping is chosen such that . We define the absolute percentage error (APE, henceforth) which measures the error made when approximating with for :
We quantify the temporal stability of the process by introducing a related process called the last crossing time process , where
(6) 
In Eq. (6), is the last time when the APE made when is approximated with is above a threshold . The last crossing time is well defined since .
Consider the process associated with the th study participant, , and let be its sample estimator based on the positional data in Eq. (3). The average velocity in Eq. (4) and its sample estimator in Eq. (5) are examples of processes and . A sample estimator of the last crossing time is
(7) 
We note that in the APE is determined based on the locations recorded for the th study participant before time : . As an illustration, Figure 1 shows estimates of the average velocity of an individual in the MDC data, together with the last crossing time estimate at .
The last crossing time of the APE associated with a process that is a function of the spatiotemporal trajectory of a study participant represents a measure of this individual’s mobility. Study participants that have more irregular mobility patterns (e.g., regular travel to locations at various distances from the individual’s residence that change after a few days or weeks) are expected to have larger last crossing times compared to study participants that travel to the same locations each week. An example individual with a very regular mobility pattern that travels every day from his home to his office and back by following the same route, and goes nowhere else will record an APE equal to after one day which leads to last crossing times of less than one day in Eq. (7).
Previous work [43] on the temporal stability of spatiotemporal trajectories has used the mean absolute percentage error (MAPE) which is the average of the APE across study participants:
(8) 
We define two measures of the overall temporal stability of the spatiotemporal trajectories of multiple study participants. The first overall measure is the last crossing time process of the MAPE process . We refer to this measure as . The second overall measure is defined as the average of the last crossing times of the APE of for , i.e. where
We denote this second measure by . These two measures are the same only if they are calculated for a single study participant (). They are useful for comparing the temporal regularity of mobility patterns of groups of study participants (e.g., younger vs. older individuals, men vs. women, high SES vs. low SES).
2.2. The activity distribution of human mobility patterns
The average velocity associated with the spatiotemporal trajectory of an individual does not provide any information about the spatial configuration of locations visited. Consider two example individuals that drive without stopping with the same speed for a long period of time. The first example individual drives back and forth between two places and . The second example individual drives in a cycle from a place to another place , then to places and , then back to place . Since the spatiotemporal trajectory of the second individual involves two additional places, more sample locations will be needed to understand the mobility pattern of the second individual compared to the mobility pattern of the first individual. However, the mobility patterns of these two example individuals will be indistinguishable based on the last crossing time process associated with their average velocity processes. We address this issue by introducing a distribution of the locations visited by an individual.
We assume that the observation window is partitioned into a set of grid cells . Each location on the curve representing the spatiotemporal trajectory of an individual is mapped into a grid cell . The observed locations for this individual mapped into are the sequence of grid cells that are realizations of a random variable where is a random variable on with a distribution with density .
We define the activity distribution over the grid cells . Here represents the proportion of time in spent by an individual in cell . We assume that follows a uniform distribution on , and define:
(9) 
The activity distributions associated with the two example individuals we introduced earlier can differentiate between their mobility patterns if the grid cells in which and do not coincide with the grid cells of and , and will show that the first example individual did not spend any time in the grid cells associated with and . To employ activity distributions we need to have a method for recovering them from the available data.
The simplest estimator of the activity distribution is based on the relative frequency of visitation of the grid cells :
However, this estimator of is reasonable only if follows a uniform distribution as in Eq. (9). When follows an arbitrary distribution with density , a better approach is to use a weighted average estimator where:
(10) 
Although this estimator can be shown to be statistically consistent, it requires knowledge of the density . There are many methods for estimating
from the data such as histograms or kernel density estimators
[40]. We suggest using an estimation method that assumes that the distribution of is approximated by a piecewise uniform distribution. We take and . If is approximately uniform in for , then . This is a reasonable assumption if the times when locations are collected are roughly equally spaced in time (e.g., a location is collected every 10 minutes) since the mean of is . Thus an estimator of isThe weighted average estimator from Eq. (10) becomes
(11) 
We call the ordinary proportional time estimator of the activity distribution . This estimator relies on the assumption that the length of the time intervals in which an individual transitions between two grid cells is added to the time spent in both the grid cell they leave from, and the grid cell they arrive in. More specifically, assume that the consecutive observation times and are such that . Then allocates to the total time spent in both and .
We introduce a second estimator of the activity distribution :
(12) 
We call the conservative proportional time estimator. This estimator is more conservative than the ordinary proportional time estimator from Eq. (11) in the sense that any time interval defined by consecutive observation times and such that is ignored. That is, the time spent in a grid cell is calculated only based on time intervals in which an individual is known to have remained in that cell.
We show two important properties of the ordinary and the conservative proportional time estimators. First, we prove that both estimators are asymptotically equivalent. Second, we prove that both estimators are statistically consistent, that is, they will eventually recover the true activity distribution if sufficient location data are available. These properties rely on the assumptions (S1), (S2) and (S3) below:

The length of the time intervals between consecutive observation times as the sampling rate .

The sampling period is such that and when .

The number of transitions between grid cells is finite, i.e., there exists such that , where and are the left and right limits of at .
Assumptions (S1) and (S2) describe the meaning of asymptotics in our context. They imply that the observation times will eventually be dense in the reference time frame, i.e., there will not exist a fixed region of without any observation times when . Assumption (S3) requires that the spatiotemporal trajectory is sufficiently smooth such that it will not jump between grid cells infinitely often.
Theorem 2.1 (Asymptotic Equivalence Rule with Large Sampling Rate).
The proof of this result is given in Appendix A.1. We can also show that the same assumptions imply that the two estimators are statistically consistent.
Theorem 2.2 (Convergence Rule with Large Sampling Rate).
The proof of this result is given in Appendix A.2.
2.3. Measuring the temporal stability of human activity distributions
We are interested in determining the temporal stability of the activity distribution of an individual. We assume that the reference time frame is divided into time periods of equal lengths (e.g., days or weeks). We denote by the activity distribution from Eq. (12) associated with time period , . Then can be viewed as an
dimensional random vector whose distribution reflects the variability from time period to time period of the individual’s mobility patterns. With this understanding, we are interested in determining the expectation
. We call the time period activity distribution (e.g., daily or weekly activity distribution). The th component of is interpreted as the average proportion of time spent by the individual in grid cell in a given time period (a day or a week).A simple estimator of is
(13) 
where is the ordinary proportional time estimator from Eq. (11) or the conservative proportional time estimator from Eq. (12).
Because is a consistent estimator of , the error we make when approximating with decreases as we observe the spatiotemporal trajectory of the individual for a larger number of time periods . We define the last crossing time of the sequence of estimators as follows:
(14) 
where is the usual norm for a vector , i.e., . Note in Eq. (14) we used the fact that for any .
The last crossing time in Eq. (14) is a measure of the temporal stability of the entire time period activity distribution . Individuals that spend approximately the same amount of time in the same places in every time period need to be observed for a smaller number of time periods to calculate estimator with the same APE compared to individuals with heterogeneous mobility patterns that spend different amounts of times at locations that change substantially across time periods. Therefore will be smaller for individuals whose time period to time period mobility changes less, and larger for individuals with irregular mobility patterns.
The disadvantage of using the last crossing time in Eq. (14) as a measure of temporal stability comes from the fact that it gives the same weight to the error made when estimating the proportion of time spent in grid cells in which an individual spends a lot of their time, and to the grid cells in which the individual rarely visits. The number of grid cells with a large proportion of time spent in them is likely significantly smaller than the total number of grid cells because most people tend to spend time at their residence, to their work place and perhaps in a few other select locations. For this reason, the error made when estimating the proportion of time spent in grid cells with sparse presence could dominate the overall APE of , and lead to larger values of . To remedy this issue, we define a new measure of temporal stability that focuses on the grid cells in which an individual spends larger proportions of time.
We define the ranking time period activity distribution associated with by replacing each component of with the sum of those components of that are no larger than that component, as follows [7]:
(15) 
The level set () of is defined to consist of all the grid cells whose corresponding components in exceed :
(16) 
It turns out that the level set covers grid cells whose total sum of components of is larger than :
Levels sets have an easy to understand interpretation: for a given level , say , all the grid cells with a ranking time period activity distribution above will jointly cover at least % of the time in the time period. Values of closer to lead to level sets with a smaller coverage that comprise only the grid cells in which the individual spends the largest amounts of time. Values of close to lead to level sets with a larger coverage that comprise the majority of grid cells the individual spent time in.
Let be the ranking distribution of the estimator of in Eq. (13), and be the level set associated with as in Eq. (16). Given a level and a stability threshold , we define the last crossing time of the sequence of level sets as follows:
(17) 
where denotes the symmetric difference of two sets, and denotes the number of elements in a set.
The LCT of the level sets from Eq. (17) is a measure of temporal stability of the time period activity distribution that takes into account only the error made when estimating the time spent in the grid cells in which an individual spent most of their time. For the same value of , is decreasing as the level is increasing.
3. Application
The data we analyze comes Nokia’s Mobile Data Challenge (MDC) [18, 23, 24]. This was a mobile computing research initiative focusing on generating a deeper scientific understanding of social and behavioral patterns related to mobile technologies. The study took place in Switzerland, and collected various types of longitudinal information including time stamped GPS data from the cell phones of 185 study participants over the course of 18 months. Demographic data such as age and sex is also available. There are approximately 57.5 million GPS location records. The average length of observation for study participants was about 55 weeks. These data are publicly available upon request from the Idiap Research Institute.
Most activities of daily living of the study participants took place in a rectangular area that we partitioned into square grid cells with sides of length 28 meters. The locations that do not belong to this spatial observation window were dropped. These locations typically correspond with longer trips took by study participants away from their places of residency. Figure 2 displays summaries of the GPS locations that fall in our chosen spatial observation window.
For each study participant, we calculated three measures of temporal stability of their mobility patterns: the last crossing time of the average velocity (LCTvelocity) as defined in Eq. (5) and Eq. (7), the last crossing time of the activity distribution (LCTdistribution) as defined in Eq. (14), and the last crossing time of the level sets of the weekly activity distribution as defined in Eq. (17). In the calculation of LCTdistribution and LCTlevel set, we use the ordinary proportional time estimator defined in Eq. (11). We used in the determination of level sets, and as the stability threshold for all three measures. The results are summarized in Table 1.
Mobility Measure  Mean  Median  St. Dev. 

LCTvelocity  30.04  26  17.29 
LCTdistribution  37.18  37  16.06 
LCTlevel set ()  17.69  17  9.50 
Means, medians and sample standard deviations of three measures of temporal stability of mobility patterns. The unit of time is weeks.
About 30 weeks of observation is needed until the mobility patterns stabilize according to the LCTvelocity measure. A longer period of time, 37 weeks, is needed until the weekly activity distribution stabilizes. The increased length of the period of observation for this measure is not surprising since it is based on an estimated of the full weekly activity distribution in grid cells. About half of this observation time (18 weeks) is needed to obtain estimates of the level set of the weekly activity distribution which comprise the grid cells in which the study participants spend % of their weekly time.
We exemplify how the level set from Eq. (16) and its corresponding LCTlevel set from Eq. (17) change for different values of . To this end, we define an adjacency graph whose vertices are the grid cells in the spatial observation window. Two grid cells are connected by an edge in if they share an edge or a corner in their arrangement in the spatial observation window [39, 4]. We denote by the subgraph of defined by the grid cells in . We chose a study participant, and determined the level set , the last crossing time and the number of connected components of for – see Figure 3. For smaller values of , contains grid cells in which the study participant spend the largest proportion of time. When , has one connected component which implies that the grid cells that belong to are spatially adjacent, and define a single area in which the study participant spends larger amounts of time. The corresponding values of are less than 20 weeks which represents the length of observation time needed for reliably detecting this spatial area. For , has two connected components, and for , , has three connected components. Thus this study participant spends their time in grid cells that define two or three spatially contiguous areas. Since these areas include grid cells in which the study participant spends smaller proportions of their weekly time, the length of the observation time needed to identify these areas doubles to about 40 weeks. For , has 72 connected components because includes grid cells in which the study participant spends very little time. Figure 3 shows that approximately 70 weeks of observation time are needed to detect these grid cells. The same type of plots constructed for other study participants show similar relationships between , , and .
Next we want to determine whether the temporal stability of activity distributions varies by the demographic characteristics of the population. We group the study participants by sex (male, female) and age group (young age 15–34 years old, middle age 35–54 years old, and old age 55 years old). For each of these five demographic groups, we calculated the average of the last crossing times of the activity distribution for every . The resulting curves are presented in Figure 4. The last crossing times at all levels are similar for men and women (see the top left panel). As such, there do not seem to be any sexbased differences in the temporal stability of men and women who live in Switzerland. However, since Switzerland is known to be a country with very high equality between the two sexes, this finding might not extend to other countries with profound sex inequality.
In the top right and bottom panels of Figure 4, we find evidence that the average last crossing times decrease with age especially for levels below . This means that mobility patterns are more regular, and consequently are more temporally stable for older study participants compared to younger study participants. The average last crossing times are larger and become very similar across demographic groups for levels above compared to smaller levels below . Thus study participants that belong to any of the five demographic groups tend to visit locations they do not typically visit. Longer observation periods are needed to successfully determine these locations. Nevertheless, in order to identify the areas in which study participants spend most of their time, Figure 4 suggests that 10 weeks of observation of GPS locations should suffice for individuals older than . Middle age individuals require about 15 weeks of observation time, while young individuals require about 20 weeks.
4. Discussion
The contribution we made in this paper is two fold. On the theoretical side, we proposed the use of last crossing time processes associated with spatiotemporal trajectories of individuals to assess the temporal stability of their mobility patterns. We defined several measures of the temporal dynamics of spatiotemporal trajectories based on the average velocity process, and on human activity distributions in a spatial observation window. We defined the ordinary and the conservative proportional time estimators of human activity distributions, and proved that they are consistent and asymptotically equivalent. We introduced the time period and the ranking time period activity distributions that capture the change in human activity distributions across time periods. We presented related estimators based on GPS location data.
On the empirical side, we analyzed GPS location data collected over a period of 18 months. The previous empirical study [43] that focused on assessing the duration of GPS studies is based on data collected over 30 days. By using our new statistical methods and GPS data collected over a much longer period of time, we determined that GPS monitoring needs to be done for at least 15 weeks which represents a minimum study duration about 7 times longer than the 14 days minimum duration recommended in [43]. We also put forward the idea that the duration of GPS studies should be assessed by demographic groups. We determined that younger population groups should be monitored for longer periods of time compared to middle age population groups because of their more irregular patterns of mobility. On the other hand, shorter monitoring periods might be needed for older population groups that exhibit mobility patterns that are temporally more stable. We also suggest using our methods to assess the need for different time spans of GPS monitoring for men and women in countries with a known history of inequality between the two sexes. To the best of our knowledge, differential periods of GPS data collection based on demographic groups has not been discussed before. Our work suggests that GPS study designs should take demographic groups into account.
Funding
The work of Z.D. and A.D. was partially supported by the National Science Foundation Grant DMS/MPS1737746 to University of Washington. Y.C. received partial support from the National Science Foundation Grant DMS1810960 and National Institutes of Health Grant U01AG016976. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Acknowledgment
Portions of the research in this paper used the MDC Database made available by Idiap Research Institute, Switzerland and owned by Nokia.
Appendix A Proofs of theoretical results
a.1. Proof of Theorem 2.1
Proof.
We note that the ordinary proportional time estimator in Eq. (11) can be written as
(18) 
where . We will first show that the denominators of and are asymptotically the same. Assumption (S2) implies that , which shows the asymptotic behavior of the denominator of . For , we have
where is the constant from assumption (S3). The limit in the above equation is due to assumption (S1). Thus, the denominators of and are asymptotically the same. Next we focus on the numerators of the two estimators.
The numerator of can be written as
where . Let . Using Eq. (18), the numerator of can be written as
When , we have . By assumption (S3), there are at most number of time points such that the equality does not hold. Thus
which implies that
(19) 
Again, using the fact that there are at most number of time points such that the equality does not hold, we obtain
It follows that
which is the same limit in Eq. (19) we obtained for . Therefore the numerators of and are asymptotically the same, which proves that and are asymptotically equal.
∎
a.2. Proof of Theorem 2.2
Proof.
Theorem 2.1 proves that the two estimators are asymptotically equivalent. Thus, we only need to derive the convergence of one of the two estimators to the true activity distribution from Eq. (9). In what follows we focus on the conservative proportional time estimator.
Without loss of generality, we assume that there exist disjoint time intervals in which the individual is inside grid cell , i.e., there are such that for , , and
Since, in the definition of the true activity distribution , follows a uniform distribution on the reference time frame , we can express as
As before, .
For the interval , we let be the first observation time after , and be the last observation time before :
Because for all , we have for all . The conservative proportional time estimator estimates the length of the interval based on the length of the interval . The corresponding error is
due to assumption (S1).
By applying the above argument to each interval , , we conclude that
Because
we further conclude that
This proves the convergence of the conservative proportional estimator to the true activity distribution:
Comments
There are no comments yet.