Whether for short-term travel or longer-term migration, the movements of human populations impact culture, language and economics in fundamental and lasting ways. As a result, human migration patterns are a topic of intense interest to scholars, governments, human rights groups, and other organizations.
Migration has always been challenging to study due to a lack of high resolution and up-to-date data. What data does exist, in the form of census statistics or survey results, are static, have a bias towards legally-protected populations, lack valuable descriptive population demographics, and lag the current population by months, often years. New solutions are needed.
The widespread adoption of online social platforms such as Twitter, Facebook, WhatsApp, and Skype present exciting new ways of measuring and characterizing migration patterns. Such platforms obtain information about a user’s historical locations through two major means. A user may self-report their location (e.g., “Having fun with @jsmith in NYC!”) or even moves (e.g., “Just arrived in London. #sotired”). A user’s location is also implied by the IP address their computer or phone is using — public IP addresses are easily resolvable to latitude-longitude coordinates that, while rarely exact, can bound the user’s location by several miles. Over time, either source of data can provide a history of the locations the user has stayed at.
A number of existing research efforts have used such digitally-constructed location histories to infer and then study country- or region-level residence histories: the dates at which a user changed their home and where they moved to [zagheni2014inferring, state2014migration]. Intuitively, residence may seem like a straightforward concept. One simply “lives” wherever they spend the night and mobility events translate neatly into residence changes. And yet we do not consider a short trip as changing where the traveler lives. Thus, even though location and residence are deeply intertwined notions, they are not the same. Indeed, in “the lay meaning of residence, [it] certainly arises after one has lived in a place for a reasonably long time; it may also arise after a comparatively short stay, or even immediately upon arrival, provided that one intends to remain there for a considerable period in the future.” [reese1952elusive].
Thus, the residence history inference problem that these existing studies have had to solve is, by its nature, ill-defined. As we will see, once properly defined, the problem is also a non-trivial one. To our knowledge, every prior study of residence has employed a heuristic in order to infer residence histories. Heuristics are naturally employed when the exact solution is intractable or impossible to obtain. So we might take the exclusive use of heuristics to be an indication that, at some point, the residence history inference problem has been proven NP-complete or worse. Surprisingly, no such study has been done: the formal properties of the residence history inference problem are entirely unknown. The practical implication of this is that an entire field of research may be resorting to heuristics, satisfying themselves with approximate residence histories, when the exact histories are, in fact, perfectly obtainable. This is the subject of the present work.
In this paper we focus on specifying and analyzing this residence history inference problem: determining an individual’s historical residence locations and time-intervals from a list of time-stamped locations at which that individual has been. In this work, we make two key contributions.
First we formalize the problem and show that it is, in fact, quite tractable to solve.
Second, we provide an exact, polynomial-time algorithm solution.
Given continued and growing interest in migration studies, particularly using online and cell phone traces, the formalization of this problem and the exact solution we provide will put future quantitative work on migration and mobility on stronger, more theoretically sound footing.
In order to make our findings useful, we have also released a software tool that implements our algorithm111https://github.com/networkdynamics/resin.
In the remaining sections we first review past approaches to quantifying migration from trace data. We then motivate and provide a formal definition of the residence inference problem. Finally, we offer an exact solution to the problem and prove its correctness and its efficiency.
Human mobility includes a range of phenomena ranging from daily commuting routines to urbanization trends that span decades. In this paper, we are focused on migration - a particular kind of mobility that captures when individuals change their durable home or base of operation.
During the past two decades, a substantial and growing body of research has taken a scientific lens to human migration patterns. This work has been fueled, in part, by the emergence of very large human trace datasets created by new technologies such as the Internet and cellphone networks. These datasets uniquely capture the distinct activities of individuals (as opposed to groups or communities). Scientists have long recognized the promise these datasets hold for the advancement of our understanding of basic human social processes [lazer2009computational].
In particular, computational social science has recently begun showing promise for the advancement of research into human migration patterns using large human trace datasets. Cellphone call-record data has been particularly helpful in studying dynamics up to the national level. In a study of an entire country’s call records, [phithakkitnukoon2012socio] examined the relationship between patterns of internal migration in Portugal and the evolution of social networks. Similarly, [blumenstock2012inferring]
used call-record data to create estimates of internal migrations for Rwanda.[zagheni2012you] were the first to show that IP geolocation can be used to create country-dyad-level estimates of migration. [weber2013studying] advanced this method further by producing full country-to-country migration and tourism matrices from IP geolocation data. [zagheni2014inferring] showed how Twitter data could be used to generate estimates of both international and internal migration patterns. [state2014migration] used data from the professional network LinkedIn to produce estimates of highly-skilled migrant stocks across the world.
In order to conduct such analysis, every one of these studies (and all studies like them) must identify migration events in individual activity records. Since a migration event is, in effect, a change in the individual’s “home”, detection migration requires identifying the individual’s residence at each point of time in the past. Thus, in essence, every migration study must solve the residence history inference problem. Existing work has, without exception, used heuristics that appeal to common-sense or legalistic notions of migration, thereby avoiding the task of formalizing the inference problem.
One such heuristic that has been used widely is the modal location approach (e.g., [Fiorio2017]). This method consists of simply dividing the residence history into intervals of fixed length, and assigning the modal location during each interval as the residence for that interval. This approach has the advantage of low computational complexity, offering a linear-time solution. Nonetheless, there are conceivable situations for which this approach would disagree with the exact solution: for instance, a user could spend 16 days in location A, move to location B and spend 14 days there, return to location A for 16 days, and then spend 44 days in location B. Using 30-day intervals would assign the user’s residence to location A for the first two periods, and to location B only for the last period. A move to location B would thus be detected only two months after the user actually changed residence. Despite these shortcomings, it’s worth noting that even in this contrived example the heuristic catches up with the user’s real location history, as the correct residence is eventually assigned.
The Residence History Inference Problem
In this section, we formalize, for the first time, the problem which prior work has implicitly approached using heuristic methods.
The residence assignment problem seeks the most likely set of residence locations and intervals (hereafter, the residence history) that explains a series of time-stamped observations of a user at different locations, which we call an individual’s location history.
To formalize this problem, we begin by breaking the time period of interest into time unit intervals within which we can assign the person to one specific location. Here, for the purpose of clarity and concreteness, we will assume intervals to be days, but, the temporal scale is a parameter of the model which does not impact the algorithmic properties.
Our observation data (hereafter, location history), , provides a location for each day in our time period. So, for example, if our time period of interest is a year, then .
We seek to infer from this location history the locations and intervals during which the user resided at different places. We represent this, like the observational data, as a sequence of locations, , with one location per day. Notice that when , the user has moved residences on day . Similarly, when , the user is traveling away from home.
This problem is effectively a latent attribute inference task where the observational data is giving signal about where the user lives. Thus, we are interested in the residence histories that do the best job of explaining the locations observed in .
We submit that a strong location-based signature of residence is time spent in that location: a person who intends to live in a place will eventually end up spending significant time there. This is the intuition that informs tax and immigration law as well as numerous studies of migration — and we employ it here.
In this way of thinking, the best residence history will be the one in which the individual spends the most days at their residence locations. Generalizing this idea, the best residence history will minimize
Selecting a residence history based on this single criterion admits a trivial solution: always assert that the user resides wherever we observe them (). Residence histories are typically more complex as people take trips which do not correspond to changes in residence.
We need to introduce a second criterion which penalizes solutions that create too many residence changes — effectively over-fitting the location history. Myriad approaches to modeling and legislating residence suggest the use of a minimum residence interval length (e.g., 90 days for the UN, 183 days for international tax law). This acknowledges the practical reality that, while one may intend to reside in a place, considerable time spent in that location, to the exclusion of other places, constitutes evidence of that initial (and continuing) intention. Thus, in addition to the inequality in Equation 1, we also require that each residence period last at least contiguous days.
This yields the problem definition given in Figure 1.
Note that other alternatives to the minimum residence interval might be taken. For example, we could directly penalize longer residence intervals by using the objective function , where is the number of residence intervals and is the total number of days the individual spent away from their residence locations. A likelihood-based approach might also be used with a objective function having the form , where
is the probability of a residence period with intervaland is the number of days the individual spent at their residence locations.
Both of these alternative formulations present the serious challenge of learning weighting and other parameters from labeled data. As labeled migration data is very scarce, here we focus on the original formalization (which uses the minimum residence interval length, ), and identity these other criteria as promising directions for future work.
An Exact Solution
A generalized solution to the residence inference problem proceeds as follows: every observed location change, , is a possible change in residence (i.e., a possible in the final solution). Thus, if we have a residence history constructed up to day , then the change in location requires us to consider two derivative residence histories and : in the first, the location change was not a residence change; in the second, it was a residence change. Every at which induces such a branching on all existing solutions up to . Once the final time interval has been processed, the total days-away-from-residence is computed for each candidate solution and the residence history with the lowest score is returned.
Crucially, this approach yields a branching exploration of the solution space which yields, in the worst case, a number of solutions that is exponential in , the number of observed moves (i.e., ). As will be somewhat correlated to , this suggests that this approach becomes computationally intractable with longer observation periods. Naturally, we must see if we can do better.
Happily, the additive cost involved in the objective function (see Equation 1), admits a computationally tractable dynamic programming approach. This is because an observation on day cannot affect the solution up through day . One way to think about this is that the minimum residence period defines the number of subsequent days that inform the residence location at time .
Because the addition of another observation at time does not change the optimal solution to the sub-problem to days through , we can formulate this using the following dynamic programming function:
Here is the minimum number of days the user must have been away from her residence for the time interval with a residence history ending with location . The optimal residence history can be constructed from the subproblems embedded in the final solution: the sequence of ’s in the solution indicate the time at which and destination to which the user moved.
The time complexity of this approach is where is the number of intervals in the location history and is the set of locations that appear in the history.
Optimizing by time-warping the location history.
Notice that there is never a reason to infer a change in residence when the user has not changed location. Such a change would either (1) be to the current location, , in which case it would have been better to have changed residence when the user first arrived at that location, or (2) be to a different location, in which case a change of residence at this time does not decrease the number of away days being accumulated. As a result, one simple optimization is to construct time intervals so that they are of variable duration and end when the user is observed in a different location. This would give the time-warped history, , where is the location of the user and is the number of days the user is in that location. We can revise the dynamic program to:
where is the set of indices into the time-warped history that start at least days before time interval . The key difference here is that we are no longer looping over all time intervals (e.g., days), but rather over the time-warped intervals, skipping over periods when the user did not change location.
This approach improves the time complexity of the algorithm to . Since , this avoids a necessary increase in computational cost with longer location histories (only an increase in the number of moves will require additional effort).
Different cost functions.
Notice that this dynamic programming solution is possible because of the independence of earlier subsolutions from later observations. In particular, the dynamic program separates previous residences from the addition of a new residence (location ). This same approach will work for other cost functions (e.g., besides ) that update the score of the solution based only on the duration/attributes of the current residence interval. For example, a likelihood function which penalizes the residence history based on its length would only involve replacing the summation in the dynamic program, which does not affect the complexity of the problem.
In this paper, we have made two important contributions to the active field of migration studies using online trace data. First, we have formalized the residence history inference problem, the core computational task involved in deriving migration events from social trace data. Up until now, prior work has employed heuristics which have informally engaged with this problem without clearly stating the singular problem they all have been seeking to solve. Our second contribution is an efficient algorithm that will always infer the optimal residence history from a user’s online location trace.
Our hope is that this exact solution will provide researchers an effective tool for conducting future studies of migration using online location trace data. Furthermore, as we have pointed out, a number of alternative formulations of the residence history inference problem exist which are likely both more complex and more expressive. Our hope is that these and related formulations will yield promising ground for further methodological work on this important topic.