Markov Chain Models of Refugee Migration Data

by   Vincent Huang, et al.
University of Illinois at Chicago

The application of Markov chains to modelling refugee crises is explored, focusing on local migration of individuals at the level of cities and days. As an explicit example we apply the Markov chains migration model developed here to UNHCR data on the Burundi refugee crisis. We compare our method to a state-of-the-art `agent-based' model of Burundi refugee movements, and highlight that Markov chain approaches presented here can improve the match to data while simultaneously being more algorithmically efficient.



page 1

page 2

page 3

page 4


Information-Theoretic Reduction of Markov Chains

We survey information-theoretic approaches to the reduction of Markov ch...

Markov chain aggregation and its application to rule-based modelling

Rule-based modelling allows to represent molecular interactions in a com...

Markov Chain Neural Networks

In this work we present a modified neural network model which is capable...

Markov Chain-Based Stochastic Strategies for Robotic Surveillance

This article surveys recent advancements of strategy designs for persist...

Understanding the Dynamics of Information Flow During Disaster Response Using Absorbing Markov Chains

This paper aims to derive a quantitative model to evaluate the impact of...

Mix and Match: Markov Chains Mixing Times for Matching in Rideshare

Rideshare platforms such as Uber and Lyft dynamically dispatch drivers t...

Simulating acculturation dynamics between migrants and locals in relation to network formation

International migration implies the coexistence of different ethnic and ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A record 65.6 million refugees were forcibly displaced in 2017 [1]. The economic disparities between different parts of the world, and spontaneous violent ethnic or political upheavals, have made migration one of the most significant issues of the modern age. Notably, the civil war in Syria triggered large numbers of refugees destined for Europe or neighboring Arab nations; the ethnic cleansing of Rohingya Muslims in Myanmar led to many Rohingyas fleeing to adjacent nations such as Bangladesh; and armed conflict in Somalia, coupled with widespread drought and famine, scattered refugees throughout other nations in the Horn of Africa.

Accurate models of refugee movements would in principle allow us to predict the number of refugees who will arrive at a particular area or city, the date at which they will arrive, and the distribution of these refugees across multiple regions, a few days or even weeks in advance. This would allow refugee and governmental organizations to determine where to best distribute aid resources to maximize impact and efficiency.

Migration has been studied since the 19th Century along the lines of general heuristics and simple “gravity” models

[2, 3, 4], inspired by Newton’s theory of gravity. More recently, a variety of algorithmic approaches have been proposed, e.g. [9, 5, 7, 6, 8, 10, 11], many based on agent-based Model approaches [13] which bring significant improvements in their predictive capabilities. In this work we outline an alternative approach to simulating modern refugee crises by modeling the migration of refugees via stochastic matrices or Markov processes (see e.g. [22, 23]). The models developed here more accurately predicts the local movements of refugees in near-real time and improves on existing models in the literature. To this end we have developed for modern refugee crises on a regional level in near-real time, applied it to real world data and demonstrated that it runs efficiently from an algorithmic standpoint.

This type of Markov chain modelling has been previously successfully applied to a variety of topics, a small selection include viral epidemics [24], the internet [25] (notably, Google’s PageRank algorithm [26] which determines the relevance of webpages), financial systems [27], and evolutionary biology [28]. Markov chain models have also been applied to study large-scale problems of long-distance and more global migration patterns [14, 15, 16, 17, 18, 19, 20, 21], but to our knowledge this presents the first application to local movements of people and in particular to the study of refugees. We shall argue that Markov chains are an efficient and powerful manner to model refugee movements and present an alternative to popular agent-based models.

This work is structured as follows: In Section 2, we outline the geographical component of our approach to modelling refugee movements. In Section 3, we outline the algorithmic component of the Markov chain migration model. In Sections 4 & 5, we apply several versions of the MC Migration Model to the recent refugee crisis in Burundi and present results. In Section 6, we compare our simulations with results from an existing model of the Burundi refugee crisis. In Section 7 we provide a summary and some concluding remarks.

2 Graphs from Geography

The first key component in modelling refugee movements is accurately encapsulating geographical information from the region of interest. The most common approach is to construct a weighted graph , where the vertices

represent cities in the region and the edges in the graph represent roads between cities, weighted by the lengths of the roads. However, this leaves some degree of freedom in choosing which roads and cities to include, and how to assign weights; moreover refugee crises often occur in underdeveloped nations where population centers are small and the geographic data incomplete. For instance, one approach to determining which urban centers are significant enough to be included as vertices in the graph when reliable population data is unavailable is to utilizes nighttime images from satellites along with population estimates


A geographic map is a projection of some physical region to and cities can be represented as points in . Subsequently, we construct a planar graph by identifying the vertices as and introducing edges which correspond to the most major roads in the region in such a manner that any two edges intersect at their endpoints only. As a result, edges typically only link cities with other cities in their immediate vicinity.

An edge connecting two cities and is assigned a weight corresponding to the physical distance in kilometers between and , as determined via Google Maps. The graph construction thus far follows that of a “Local Interaction Model” detailed in [6], however, since many refugees will move along minor roads not included in this most basic planar graph, we will refine this initial graph construction with certain heuristics detailed below.

For any two cities , let be the minimal distance between using edges from , meaning is the length of the shortest path from to . We propose that there is some characteristic maximum distance that a refugee can travel in one day. The value of must be assigned with intuitive reasoning, rather than some prescribed method, and should incorporate information such as the geography of the region and the quality of roads. The value of is significant because it allows one to distinguish between different refugee movements based on their distance. We construct a refined graph of a given geographical region with cities by adding additional edges to . Specifically, for each non-adjacent pair with , an additional edge with weight is introduced in the planar graph . The resulting construction is typically non-planar.

The Floyd-Warshall algorithm [30, 31, 32] can be utilized to efficiently construct this new graph. The Floyd-Warshall algorithm takes a weighted graph with vertices such that each pair of vertices is connected by an edge of weight ; if are not adjacent, we adopt the convention . The algorithm then computes the shortest paths between every pair of vertices . It accomplishes this by letting be the minimum length of a path between vertices such that any intermediate vertices satisfy . Then is because there can be no intermediate vertices and one can recursively calculate


because the first term is the shortest valid path from to which does not pass through , while the second term is the shortest valid path from from to and then from to . The recursive step takes time to compute each new value for in terms of previously stored values , so the entire algorithm runs in time. Then is simply , so all minimal distances are computed. To our knowledge, this approach of distinguishing between different pairs of vertices using their minimal distance in a planar graph has not been applied to migration before.

3 Markov Processes and Refugee Movements

In what follows we will combine the graph constructions outlined in the previous section with novel modelling approaches which we discuss next in order to generate a new class of migration models, whose application we explore in Section 4. Consider a graph with vertices such that some pairs of vertices are connected by edges of weight . We demonstrated our approach to constructing such a graph which represents a geographical region in Section 2 and we next explain how migration can be simulated on such a graph. To do this we adapt some ideas and conventions developed in previous approaches to modelling refugee movements using agent-based models, due to Groen [5].

The vertex in the graph indicate specific geographical locations. To model refugee migration it is useful to differentiate these nodes between three types of location, specifically

  • Camps: Refugees in a refugee camp have a high probability of remaining in the camp, and refugees near a refugee camp have a high probability of migrating into that camp.

  • Conflict: Refugees in a conflict site have a high probability of leaving, and refugees near a conflict site have a low probability of entering the conflict site.

  • Neutral: A neutral vertex is a location which is neither a refugee camp or conflict site. Refugees in a neutral city have a moderate probability of entering or leaving.

A refugee camp is a location set up to house refugees and a conflict site is a city which has been the location of an armed conflict or other disaster. In principle a location can change type during the course of the simulation, for instance, as new refugee camps are established. Each vertex in our model carries this ‘type’ information as a label, along with three other labels:

  • Location name;

  • location type (Camp/Conflict/Neutral);

  • Time stamp ;

  • Refugee population at a given time stamp .

To simulate the movement of refugees the population value of each vertex evolves between each time stamp. Our model iterates through a series of discrete timestamps with each timestamp representing a single day and corresponding to the initial configuration of the system. The premise of our model is that on each day, refugees make a series of moves between locations in a region; once the total distance a refugee has travelled in a single step reaches or exceeds the maximum threshold , the refugee cannot make any more movements on that day. Between timestamps the population of each vertex changes due to the movements of the refugees.

The path that a refugee takes through the system is determined by the maximum threshold and also the properties of the vertices. Specifically, in our model a refugee at vertex has a probability of of remaining in that state. We assume that depends only on status of as either a camp, conflict, or neutral site. In addition to this we next define the probabilities that a refugee at a given location migrates to a different location between timestamps. Consider two consecutive timestamps and and let and represent two adjacent vertices in . For a refugee at vertex at timestamp which is able to migrate we denote by the probability that a refugee at for moves to at . We refer to as the intermediate probability at timestamp .

Note that represents the probability that the refugee remains in city , so . Furthermore, we assume that a negligible number of refugees leave the system of urban centers, so the laws of probability imply that for each where is the number of vertices in . It follows that


In our model the probability that a refugee will move to a particular location is a function of the edge weight (which encodes the geographical distances) and a scaling factor that takes different constant values depending on whether the location is a camp, conflict or neutral site (we specify the values of in Section 4 for a specific refugee crisis). For each vertex adjacent to we define the intermediate probability to be


where is a constant scaling factor fixed such that the probabilities add to unity after the type rescaling, thus eq. (2) is satisfied.

This setup has the following desired properties:

  • The total sum of the intermediate probabilities from any city is .

  • Refugees are more likely to migrate to refugee camps than to neutral cities.

  • Refugees are more likely to migrate to neutral cities than to conflict sites.

  • Refugees cannot directly migrate between non-adjacent cities.

  • The probability of moving to an adjacent city is inversely proportional to the distance separating the cities.

Note that this last point is in accordance with Ravenstein’s Laws of Migration [2], which state that migrants prefer to move shorter distances.

Thus far we have discussed how to encode the local preferences for refugee movements which is characterized by the intermediate probability defined above. We next introduce the notion of a transition probability , being the probability that a refugee in city at the start of timestamp , will move to city at the start of the next timestamp .

Our approach maintains that a refugee may make multiple moves between cities in a single day until the refugee’s total distance travelled exceeds . Each individual move is made according to the intermediate probabilities , while the transition probabilities measure the net effect of these movements taken in series. Thus the transition probabilities are completely determined by the choice of intermediate probabilities.

Suppose that at timestamp that the refugee population of city is for , we define the

dimensional ‘population’ vector which is a function of the timestamp


For a system with transition probabilities for

, the stochastic matrix of the system at time

can be denoted as


The entries of the matrix are each probabilities with the rows summing to one, thus the matrix describes transitions in a Markov chain.

The stochastic matrix dictates the probability of moving from to between timestamp and . Thus the expected population vector at timestamp can be described compactly as . Moreover, the th entry of gives the expected number of refugees at vertex at timestamp given by


More generally for a given timestamp the vector can be defined from the initial population vector recursively


Thus at the start of each sequential timestamp the vertex populations are inherited from the final populations at the end of all operations of the previous timestamp. This is traditional Markov process of an iteratively evolving system. More specifically, since the form of is not fixed but intermittently evolves over the timestamps if a neutral city becomes a camp or conflict site, this is a non-homogeneous Markov chain [23] .

This Markov system outlined above is advantageous for modelling migration because its allows for a relatively compact model solely in terms of the transition probabilities and concerns about intermediate probabilities and about refugees moving across multiple cities in one day are irrelevant because they are already encapsulated into the transition probabilities. As a result, the main difficulty in implementing a stochastic matrix model is to compute the values of . Our algorithm for evolving the vertex populations between two timestamps performs a set of order operations and thus one can break the problem of computing transition probabilities into smaller pieces. This permits the application of dynamic programming techniques [29] which are essentially an efficient form of recursion. To implement this we introduce a new location specific -tuple associated to vertex . The elements of represent the probability that the refugee will be at vertex at the end of timestamp , which depends on the distance a refugee has previously travelled during the current timestamp such that they currently reside in city .

Specifically, if , then the refugee cannot make any further movements to different cities, so will remain in city for the remainder of the timestamp. Therefore


where the single occurs in the th entry of the vector. If , then for each , the refugee moves from to with intermediate probability ; at the end of this move, the refugee is now in city and has moved a total distance of

. At this point, the probability distribution for the final destination of the refugee is

. It follows that


Note that when are not adjacent, our sum contains a term ; we adopt the convention for all to ensure the expressions are well-defined, though the multiplication by zero implies the exact value of is irrelevant.

Thus eq. (9) yields a recursive formula for in terms of expressions of the form , where . As a result, we can efficiently compute all vectors by first computing all vectors of the form and storing the results, then computing all vectors of the form , and so on; eq. (8) handles terms with that occur in the recursion. Computing each in terms of terms of the form takes time: It is essentially summing vectors of the form , each of which has already been computed and is dimensional. Therefore, this algorithm computes the vectors of the form for in time.

The transition probability is equivalent to the probability that a refugee currently located at city during timestamp who has travelled distance ends up at city by the end of timestamp . As a result, is the th entry in for all , so the computation of all terms allows one to compute all the transition probabilities (t).

The transition probabilities are constant over time unless the intermediate probabilities change which only happens when a city becomes a refugee camp or conflict site and whilst this occurs in our models, it is not a common occurrence. As a result, the above procedure should only run a few times during each refugee crisis. The computation of transition probabilities from intermediate probabilities completes the algorithmic aspect of our model.

4 Application: Modelling the Burundi Crisis

We utilize data on the Burundi crisis from a previous model of the crisis called Flee due to Suleimenova-Bell-Groen (SBG) [8], who in turn obtained their data from the United Nations High Commissioner for Refugees (UNHCR). The UNHCR data extracted by SBG [8] runs from May of 2015 to June of 2016, for a total of days, so the timestamps in the model range from . There were five active refugee camps in the Burundi crisis, each established by the UN at the cities of Nyarugusu, Nduta, Nakivale, Mahama, and Lusenda. Figure 1a shows the 30 major cities in and around Burundi which were modelled in Flee and Figure 1b illustrates the graph used to model the system constructed and used in SBG [8]. Figure 1c shows a non-planar graph representing additional routes connecting relevant locations, which we constructed by adding further edges to the graph in panel (b), as detailed in Section 3.

In SBG [8] the authors assumed that the probabilities of remaining in a city are


These estimates seem reasonable so we maintain them in our model too. We also take a list of major armed conflicts in Burundi from SBG’s Flee model [8] and turn the corresponding locations into conflict sites at the appropriate timestamp. The first such conflict took place on May 1st, 2015, which we set as the beginning date of the model. SBG [8] estimate that refugees travel a maximum distance of km in one day.222The supplementary materials of [8] presents sensitivity testing for Flee and shows that the results are robust under modest variations of the maximum distance and the staying probabilities . Arguably, this overly optimistic as most refugees travel on foot [8], so we instead set a daily distance threshold of (in units of km), corresponding to at 8 km/h.

Next, we initialize the population vectors to match the number of refugees in the Burundi region; again we obtained this data from [8], who retrieve it from UNHCR refugee camp registration data. SBG [8] notes that it is incredibly difficult to predict the number of refugees created by armed conflicts, but rough estimates can be obtained through various means such as satellite images [6]. As a result, rather than predicting the number of refugees resulting from a crisis, SBG [8] aim to predict the distribution of these refugees given the total number of them; our model also focuses on this objective. We will return in Section 5 to directly compare our results with those from the Flee model of SBG [8].

Figure 1: (a). Location labels overlaid on a map of Burundi and surrounding regions (from Google Maps [33]). Note, Nakivale is located further north than illustrated. (b). A planar graph representing major roads connecting locations relevant to the Burundi refugee crisis as used in [8]. (c). Non-planar graph representing additional routes connecting relevant locations, constructed by adding additional edges to the graph in panel (b), as detailed in Section 3.

Since represents the expected distribution at time of the refugees from time , the total number of refugees in is the same as the total number of refugees in . This implies that any increases in the number of refugees are not reflected in the matrix multiplication and therefore must be performed through an additional process. Specifically, we denote by the number of refugees in Burundi refugee camps at time according to the UNHCR. After each timestamp , the population discrepancy , between timestamp and timestamp , is compute and then an additional refugees is randomly added into the population vector by distributing them into the conflict sites in the model with probability proportional to the population of the conflict site. This ensures the model always contains an accurate number of refugees.

Due to this mechanism of adding refugees, after timestamp , the total number of refugees in the model is the same as the total number of refugees in UNHCR camps after day . As a result the model continually under populates the graph and therefore under predicts the number of refugees in each camp, because many refugees in the model are still making their way to camps. To account for this delay between the creation of refugees and their arrival in camps in the model, when the refugee populations in each refugee camp are extracted, following [8] we rescale each such population upwards by a constant factor so that the total number of refugees in the camps matches the UNHCR data for each day. This rescaling occurs in three of the four Markov chain models we outline below and in Flee [8]. Because our stated goal is to predict the distribution of refugees in camps given the total number of refugees, this rescaling is unimportant. However, since this scaling is somewhat ad hoc one of the Markov chain variants we outline below is designed to remove the need for this rescaling.

We now apply the modelling algorithms outlined in Sections 2 & 3 to the recent Burundi refugee crisis, due to a civil war, focusing on a set data on camp registrations during a period from 2015-2016 . We run four different models using Markov chains, each of which is a variation on the principles outlined in Sections 2 & 3, as we outline below:

Markov Chain: Initial graph. We apply the Markov Chain approach to the planar graph constructed in [8] from an analysis of major roads in Burundi on Bing Maps, see Figure 1b.

Markov Chain: Graph-adjusted. We additionally apply our Markov chain approach to the modified graph which we derive from using the procedure from Section 2, show in Figure 1c, in what we call the graph-adjusted model.

Markov Chain: Camp-adjusted. Long movements to refugee camps are systemically assigned smaller intermediate probabilities in our model. This heuristic makes sense for movements to neutral cities because refugees prefer making a series of shorter movements rather than a single long movement [2], but this reasoning breaks for refugee camps because they are refugees’ final destinations rather than intermediate locations. In addition, there are a variety of motivations for refugees to prefer faraway refugee camps, such as perceived safety and distance from danger. To realise this observation, we create an additional graph by taking edges in that represent long movements to refugee camps and reassigning these edges with weight so that long movements to refugee camps have the same chance of occurring as medium movements to camps. We call this variation the camp-adjusted model.

Markov Chain: Time-adjusted. We now present a fourth Markov chain model, which we call the time-adjusted model, which removes the need for the overall rescaling to match the total refugee population at the end of each timestamp. Denote by the average time it takes for a refugee in the model to travel from a conflict site to a refugee camp. Then if there are refugees in the model at timestamp , most of these refugees should arrive in refugee camps within days, so the number of refugees residing in camps within the model will be approximately at timestamp . We can use this observation to remove the need for the rescaling at the end of each timestamp by introducing an appropriate number of refugees in the model days before they are observed. Specifically, at timestamp we introduce refugees instead of refugees, and at the end of each timestamp we add an additional refugees rather than refugees.

It remains to compute , the average travel time for a refugee from a conflict site to a refugee camp, which can be found by running a mock simulation where the entire population of the conflict sites become refugees at timestamp , and then let denote the total camp population after days for for some suitably large constant . Then refugees take exactly days to arrive, so the average travel time is

For we find and thus this is the value we use in the time-adjusted model.

5 Results: Modelling the Burundi Crisis

We will now present the results of the Burundi models outlined above. Figure 2 shows five different graphs which are model results for number of refugees in the major regional camp as a function of time. Each graph represents one of the five different camps relevant to the Burundi crisis, and in each graph we plot the refugee population in the given camp as a function of the timestamp (days). Within each graph we show five curves: the four solid curves correspond to the four Markov chain models outlined in Section 4, the dashed curve is the UNHCR camp population data, and the shaded region indicates a 10% error on the data.

It is difficult to determine from these population plots alone how effective each model is at predicting refugee populations in the five camps. To quantify the goodness of fit we now introduce metrics to measure the success of a given refugee model. Denote by the set of relevant refugee camps. For each camp , let be the population of that camp at timestamp in a given model, and let be the population of that camp at day according to UNHCR data. Note that is precisely the sum of over each of the camps: . The first measure we consider is the Averaged Relative Difference (ARD) for a given camp on day , given by


The total ARD on a given timestamp is then the sum over the ARD values of all of the camps


Moreover, it is also useful to consider time averaged ARDs for a given model, by averaging over all the daily ARD values in a certain period. In particular, we consider the Average ARD over the entire period covered by the data (up to )

Figure 2: Plots of the population growth in each of the five major refugee camps relevant to Burundi refugee crisis predicted by each of our models as a function of timestamp (days). The four solid curves in each plot correspond to the four Markov chain models outlined in Section 4 and the dashed curve is the UNHCR camp data from [8], the shading indicates a 10% error in the data.
Figure 3: Total Averaged Relative Difference (ARD), as defined in eqns. (11) & (12), on each timestamp (day) for each of the Markov chain models. It can be seen that the refined graphs offer an improved fit of the data. Coloured curves match those in Figure 2.

Figure 3 shows the total ARD for each model as a function of time. This provides a good comparison tool to assess how well the models match the data. Observe that the Markov chain model using the ‘Initial graph’ (Figure 1b) consistently has relatively high error margins, while the graph-adjusted and camp-adjusted models (Figure 1c) are substantially improved.

Notably, there is potentially a lot of noise in the first few days of every model and in actual refugee crises, thus we expect the quality of the data to improve over time, as registration practices improve and the situation in the camps becomes more stable. To better analyze the long-term accuracy of the models we also consider the averaged ARD omitting the earliest period of the data, since this likely provides a more reliable measure of ‘goodness of fit’. Specifically, we call the Average ARD omitted the first days (counting from day ) the -day Clean Averaged ARD and denote this by with


We will consider the 30-day and 100-day Clean Averaged ARD, denoted and , the values of which are shown in Table 1 for the four Markov chain models of Section 4.

Initial Graph Camp-Adjusted Graph-Adjusted Time-Adjusted
0.41 0.22 0.32 0.37
0.38 0.21 0.28 0.34
0.37 0.21 0.24 0.28
Table 1: All time Average ARD , and Clean Average ARDs and for each model.

From the statistics from Table 1 we can draw some general conclusions about our models. Firstly, the Camp-Adjusted Model has a lower Average ARD than every other model in each of the three numerical measures , , and . Furthermore, the results clearly demonstrate the merit of the geographical adjustments introduced in Section 2, with the Camp-Adjusted and Graph-Adjusted Models yielding the lowest long-term average ARDs. That the Camp-Adjusted and Graph-Adjusted models give an improved match to data compared to graphs which represent the major road system may be indicative of refugees travelling off-road, or on minor roads not well documented on Bing or Google Maps. It would be interesting to further validate this conclusion over additional conflicts and longer term data sets.

The Time-Adjusted Model begins with relatively inaccurate results, but the average error quickly drops, as expected because this is a time-sensitive model. However, omitting the first 100 days and considering , the Time-Adjusted Model gives a comparably low average ARD to the other models, indicating that we were successfully in removing the rescaling of refugee camp populations at the end of each timestamp while still producing a relatively accurate model. This is significant because removing the rescaling process results in a more transparent model and less reliance on UNHCR data.

Figure 4: The population of each camp at the final entry in the data set predicted by the Markov Chain: Camp Adjusted model as the maximum distance threshold is varied. The prediction for each camp is indicated by a different solid curve, the dashed lines indicate the population reported in the UNHCR data as given in [8]. Notably, as is varied the number of edges in the graph varies as described in Section 2.

A key parameter in the Markov chain models is the distance threshold which determines the maximum distance a refugee can travel in one day. This parameter is also a source of some uncertainty since, whilst our assumed choice (in units of km) seems quite reasonable, it could conceivably vary by moderate margins depending on road quality, climate, and transportation options. Moreover, for the Camp Adjusted model as is varied the number of edges in the graph varies as described in Section 2. Thus it is prudent to explore how varying impacts the late time distribution of refugees. We show the impact of varying on the camp populations at (the final entry in the data set) in Figure 4, the solid curves give the prediction for the Markov Chain: Camp Adjusted model, and the dashed lines indicate the value reported in the UNHCR data. Observe that for then the fit to the late time data becomes marginally worse, however, for the results are relatively insensitive to changes in which provides some confidence that we need not be concerned by the exact value or indeed the possibility local variations in the maximum travel distance.

6 Discussion: Comparison to Agent-Based Models

We next discuss the differences between our Markov chain model and previous approaches using agent-based models, with specific reference to one state-of-the-art example due to Suleimenova, Bell, & Groen [8] called Flee. We will briefly summarize the main components of Flee and then compare our results to those obtained from running Flee (which is publicly available).

Flee [8] implements a drastically different approach known as agent-based modelling [13]. An agent-based model consists of a series of fixed states and a group of agents. Individual agents transition through these states according to the rules of the system, updating their current state at each integer timestamp. In modelling migration, a common approach is to define a weighted graph to contain the geographic information of a region of interest, and introducing one agent to represent each individual the region. These agents then migrate between the vertices of the graph according to some predefined algorithm. This approach has been well-studied and utilised in areas such as disaster-driven migration [34], the ongoing conflict in Syria [35], and in [8] applied to conflicts in Burundi, Mali, and the Central African Republic.

As in our Markov chain model, Flee [8] consists of a graph storing geographic information and a prespecified algorithm used by refugees to move between cities of the graph. We noted earlier that we derive the planar graph (Figure 1b) in our Markov chain model from Flee. Meanwhile, their agent-based model differs significantly from our Markov chain model, because Flee treats each refugee in the simulation as an entirely separate entity making independent choices. Whereas we model refugee movements in terms of general probabilities of transitioning between states, Flee iterates over all agents in the system during every timestamp and simulates for each individual agent its expected movement through various cities.

We noted earlier that the heuristic we use for properly initializing the population vectors at any given timestamp is derived from Flee’s heuristic for generating new agents at each day of the simulation, the number of new agents added to the system is computed using real data about refugee numbers. This is acceptable because Flee aims to predict the end distribution of refugees over a network rather than the actual number of refugees displaced, as such, a rescaling of the distribution is performed at the end of every timestamp to ensure the total number of refugees in the camps in Flee matches the corresponding total according to UNHCR data, the same rescaling we use in three of the Markov chain models.

Despite some similarities between Flee and our Markov chain models, differences in algorithms lead to significantly different results. We now compare the evolution of the camp populations and the ARD measures defined in eqns. (11)-(13) to quantitively compare our Markov chain models to Flee. Figure 5 is analogous to Figure 2, but illustrates instead the change in the populations of each camp as predicted by Flee (which match results presented in [8]) to our Markov chain models implemented on the Initial Graph (using same graph as [8]) and the camp adjusted graph (Figure 1c).

Additionally, in Figure 6 we show the total ARD as a function of timestamp for the three models displayed in Figure 5, which gives provides a quantitive assessment of the relative goodness of fit of each model. It is noteworthy that by the end of the simulation, each of the five models has more or less converged to a constant value, and Flee has the highest total ARD. The trends are captured concisely in Table 2 which shows the all time Average ARD , and the Clean Average ARDs and for each model.

Initial Graph Camp-Adjusted Flee
0.41 0.22 0.29
0.38 0.21 0.27
0.37 0.21 0.28
Table 2: All time Average ARD , and Clean Average ARDs and for each model.

The Camp-Adjusted Model and Flee display relatively stable average ARDs of approximately and over time, while the other models take longer to converge to lower average ARD values. In particular, note that in terms of long-term predictive power, which is the main focus of refugee models, the Markov chain: Camp Adjusted model outperforms both Flee and the Markov Chain: Initial Graph. The Camp-Adjusted model has a value of for the is roughly that of Flee, implying a reduction in the long-term error.

Figure 5: Analogous to Figure 2 but here we compare our best model Markov chain: Camp-adjusted to the Flee model of SBG [8] and also to Markov chain: Initial Graph. Note Flee and Markov chain: Initial Graph use the same graph, Figure 1b, whereas the Camp-adjusted model uses the graph of Figure 1c.
Figure 6: Comparison of the total Averaged Relative Difference (ARD) on each timestamp (day) for the agent-based model ‘Flee’ [8] and our Markov chain model applied to the graph of Figure 2b, alongside the Markov chain model applied to the camp-adjusted graph as in Figure 2c.

In [8] to quantify their results the authors use instead the camp-wise ARD for a given day as the main metric of success in modelling the Burundi crisis. Here we have focused instead on the total ARD which we believe is a clearer indicator of overall modelling accuracy, however the camp-wise ARDs for each model are provided in the Appendix to provide a more direct comparison and one can draw similar conclusions on the goodness of fit of each model from looking at the ensemble of the individual camp ARDs.

Thus we conclude that Markov chain models can give a modest improvement on the goodness of fit compared to Flee, moreover, Markov chains yield a significantly more efficient simulation tool than an agent-based approach in terms of runtime and code length. In terms of code length, the components necessary to make complex agent-based models such as Flee operate are spread over multiple folders, dozens of files, and thousands of lines in code [8]. In contradistinction, each of our Markov chain models is contained within a single file and less than 500 lines of code. While program length is no measure of efficiency or suitability, the maintenance costs of code and the likelihood of bugs are directly correlated with program length.

To quantify the complexity of the agent-based Flee, consider a total of refugees and cities in a region of interest, for a model which runs time steps. At each timestamp, Flee iterates through all agents in the system to update their status, because agent-based models treat each agent separately. Furthermore, each agent runs through the list of cities before deciding which city to migrate to, thus each of agents takes time to update at each timestamp, leading to a total time complexity of at least .

Analyzing the overall time complexity of our Markov chain models is slightly harder. If there are cities in the region of interest, then we noted in Section 2 that our graph modification algorithms utilizing Floyd-Warshall run in time, however, these algorithms only run once, at the very beginning of the simulation, to establish the graph. Meanwhile, we noted in Section 3 that we compute the transition probabilities in time. Once the transition probabilities are computed once, they do not need to be recomputed, as we reuse the same stochastic matrix over multiple timestamps, unless a fundamental change is made to the cities in the graph, for instance, a neutral city turning into a conflict site changes the intermediate probabilities and therefore the transition probabilities of the system. The number of such updates to vertices is on the order of the number of vertices , because neutral cities can become conflict sites, but the opposite is much rarer. As a result, computing transition probabilities takes around time. Finally, at each timestamp, the Markov chain only performs a simple multiplication of an matrix with an -dimensional vector, which only takes operations, thus updates to the system comprise a total of time. Thus we estimate the time complexity of the Markov chain models is .

In the case of Burundi, refugees, days, and there are vertices, while , thus Flee runs in operations, while the Markov chain models are operations, because the term dominates the asymptotic. Indeed, this order of magnitude difference in runtimes is apparent when executing these different simulations on a standard desktop or laptop. In general, migration models will typically only involve significant urban centers, so the value of is reasonably small, while can be incredibly large, making agent-based model less feasible than Markov chains implementations.

7 Concluding Remarks

In this work we have implemented Markov chain models of refugee migration in refugee crises based on new heuristics about the connections between refugee movements and local geography. By applying our model to the Burundi refugee crisis and comparing with an existing agent-based model of that same crisis, we concluded that our approach using Markov chains was more efficient, reduced unnecessary complications in agent-based modelling, and exhibited a reduction in long-term prediction errors. The agent-based model Flee was explicitly validated against three separate conflicts (Burundi, Mali, and the Central African Republic) in [8], whereas for brevity we have restricted our analysis here to just the Burundi conflict. However, we highlight in passing that the Markov Chain models developed here similarly provide good fits (comparable to Flee) to these other two conflicts when applied to the data and graphs of Mali and the Central African Republic presented in [8].

It should be noted that collecting data on refugee registrations at camps is a challenging task and it is highly feasible that there could be sizeable discrepancies between the UNHCR data and real camp populations. Moreover, refugees that leave a camp for a new destination, say another camp or perhaps a relatives house in a neutral city, are likely not to see deregistering as a priority and thus the data my not accurately reflect departures from a camp. We have endeavored to account for these potential discrepancies in the data by including a 10% error margin in Figures 2 & 5, however this is a rather blunt heuristic and a better understanding of the intrinsic errors in the data would be highly desirable. Applying our model to higher quality and larger data sets would allow us to better refine the Markov chain models presented here.

As a result of the recent increases in refugee crises, there are many reasons as to why improved refugee models would be useful. Accurate refugee models would allow us to predict the number of refugees who will arrive at a particular area or city, the date at which they will arrive, and the distribution of these refugees across multiple regions, a few days or even weeks in advance, given good information about the causes of the migration. This would be extremely beneficial as accurate modelling could allow refugee and governmental organizations to determine where to best distribute aid resources to maximize impact and efficiency. Predictive models would furthermore allow cities and regions to take the appropriate measures to accommodate a large influx of people seeking shelter, which can typically lead to highly disruptive situations. Indeed, with the onset of global climate change, increasing inequality, and decreasing global stability it is highly likely that the number of displaced peoples will increase significantly in the future [36]. Thus there is a growing need for better simulations and in this work we have advocated for adopting Markov chain based models in order to more accurately and efficiently model these crises.


We thank William Jones, Laura Schaposnik, and especially Derek Groen for invaluable discussions, as well as Tanya Khovanova and Claude Eicher for comments on the draft. JU is grateful to the Simons Center for Geometry and Physics (Program: Geometry and Physics of Hitchin Systems) and New College, Oxford for their hospitality and support. This work was completed as part of the MIT PRIMES Program.

Appendix: ARDs for Individual Refugee Camps

In this appendix we present the ARD for each camp as a function of timestamp. The ARD for a camp gives a good measure for the amount of error present in modelling a single camp as a fraction of total refugee population. Note that Figures 3 & 6, which show the total ARD for each model on each day essentially “sum” of the five graphs in Figure 7.

Figure 7: Averaged Relative Difference (ARD) for each model for each Camp.

Figure 7 shows the ARDs in each of the five refugee camps of each of the five models, the four Markov chain models developed here, and SBG’s Flee model [8]. Observe that in two of the camps, Lusenda and Nakivale, all five models perform quite well, leading to an almost indistinguishable long-term ARD of approximately zero. However, this trend does not hold in the other three camps.

In Nduta, all three models utilizing our graph modification algorithm from Section 2 perform significantly better than than Flee and the Markov Chain Model: Initial graph, which both use the unmodified graph. In Nyarugusu, Flee initially produces relatively accurate results, but the four Markov chain models eventually converge to a more accurate prediction. Finally, in Mahama, it is not immediately clear which models yield more accurate results, although the Graph-Adjusted and Time-Adjusted Models more consistently produce smaller ARDs.

This camp specific ARD is the main tool used in [8] for quantifying the success of their model of the Burundi crisis and thus 7 allows for a direct comparison with the results of [8]. Moreover, an informal inspection of the ensemble of camp ARD graphs indicates similar conclusions as found in the main text, specifically that the Markov chain models provide a modest improvement in the fit to the data compared to Flee and that the Markov Chain: Camp Adjusted model seems the best fit out of all of the models studied. In the main text these conclusions were quantitively supported by an analysis of the total ARD and Averaged ARGs, see Figures 3 & 6 and Tables 1 & 2.


  • [1] United Nations High Commissioner for Refugees [n.d.] Figures At a Glance. Retrieved January 24, 2018, from
  • [2] E. G. Ravenstein, The Laws of Migration, Journal of the Statistical Society of London. Vol. 48, No. 2 (Jun., 1885), pp. 167-235.
  • [3] E. Lee, A theory of migration, Demography 3, 471-486 (1966).
  • [4] F. Willekens, Migration flows: Measurement, analysis and modeling, International Handbook of Migration and Population Distribution, Springer Netherlands (2016) pp. 225-241.
  • [5] D. Groen, Simulating Refugee Movements: Where Would You Go? Procedia Computer Science, Volume 80, 2016, Pages 2251-2255.
  • [6] T. Gulden, J. Harrison, A. Crooks, Modeling Cities and Displacement through an Agent-based Spatial Interaction Model, Computational Social Sciences (2011).
  • [7] A. Klabunde, and F. Willekens, Decision-Making in Agent-Based Models of Migration: State of the Art and Challenges , Eur. J. Population (2016) 32: 73.
  • [8] D. Suleimenova, D. Bell, and D. Groen. A generalized simulation development approach for predicting refugee destinations, Scientific reports 7, no. 1 (2017): 13377.
  • [9] C. Christou, Simulation of Human Migration Based on Swarm Theory, 2010 13th International Conference on Information Fusion, Edinburgh, 2010, pp. 1-8.
  • [10] D. Kniveton, C. Smith and S. Wood, Agent-based model simulations of future changes in migration flows for Burkina Faso, Global Environmental Change 21, 34-40 (2011).
  • [11] R. Johnson, T. Lampe, and S, Seichter, Calibration of an agent-based simulation model depicting a refugee camp scenario, Proceedings of the 2009 Winter Simulation Conference (WSC), 1778-1786 (2009).
  • [12] J. Sokolowski, C. Banks and R. Hayes, Modeling population displacement in the Syrian city of Aleppo, Proceedings of the 2014 Winter Simulation Conference, 252-263 (2014).
  • [13] See for instance, C. Macal and M. North, Tutorial on agent-based modeling and simulation, Simulation conference: 2005 proceedings of the winter IEEE (2005).
  • [14] I. Blumen, M. Kogan, and P. McCarthy, The Industrial Mobility of Labor as a Probability Process, Ithaca: Cornell University Press (1955).
  • [15] M. Salkin, T. Lianos and  Q. Paris, Population Predictions for the Western United States: A Markov Chain Approach, Journal of Regional Science, 15: 53-60 (1975).
  • [16] L. Goodman, Statistical Methods for the Mover-Stayer Model, Journal of the American Statistical Association, vol. 56, no. 296, 1961, pp. 841-868.
  • [17] L. Brown, On the Use of Markov Chains in Movement Research, Economic Geography, vol. 46, 1970, pp. 393-403.
  • [18] M. Hirst, A Markovian Analysis of Inter-Regional Migration in Uganda, Geografiska Annaler. Series B, Human Geography, vol. 58, no. 2, 1976, pp. 79-94.
  • [19] A. Nagurney, Migration equilibrium and variational inequalities, Economics Letters, vol. 31, no. 1, (1989) 109-112.
  • [20] J. Pan, and A. Nagurney Using Markov chains to model human migration in a network equilibrium framework, Mathematical and Computer Modelling, vol. 19, no. 11 (1994) 31.
  • [21] H. Kim and H. Y. Song, Formulating Human Mobility Model in a Form of Continuous Time Markov Chain Procedia Computer Science, vol. 10 (2012) 389-396.
  • [22] P. Gagniuc, Markov Chains: From Theory to Implementation and Experimentation, John Wiley & Sons. (2017) pp. 9-11. ISBN 978-1-119-38755-8.
  • [23] See for instance, D. Isaacson and R. Madsen, Markov Chains Theory and Applications, Wiley, New York (1976).
  • [24] H. Baumann, and W. Sandmann, Structured Modeling and Analysis of Stochastic Epidemics with Immigration and Demographic Effects, PLoS ONE 11(3): e0152144 (2016).
  • [25] L. Muscariello, M. Mellia, M. Meo, M. Ajmone Marsan, and R. Lo Cigno Markov models of internet traffic and a new hierarchical MMPP model, Computer Communications, vol, 28, no. 16, (2005), 1835-1851.
  • [26] L. Page, Method for node ranking in a linked database, U.S. Patent 6285999.
  • [27] See for example, M. Kijima, Stochastic Processes with Applications to Finance, New York: Chapman and Hall/CRC (2002) ISBN:9781420057607.
  • [28] I. Djordjevic Markov Chain-Like Quantum Biological Modeling of Mutations, Aging, and Evolution, Life (2015) 5(3):1518-38.
  • [29] R. Bellman, The theory of dynamic programming, Bull. Amer. Math. Soc., 60 (6): 503-516 (1954).
  • [30] R. Floyd, Algorithm 97: Shortest Path, Commun. ACM. 5 (6): 345 (1962).
  • [31] B. Roy, Transitivite et connexite, C. R. Acad. Sci. Paris. 249: 216-218 (1959).
  • [32] S. Warshall, A theorem on Boolean matrices, Journal of the ACM. 9 (1): 11-12 (1962).
  • [33] Google (n.d.). [Burundi]. Retrieved January 24, 2018, from
  • [34] B. Entwisle, et al, Climate shocks and migration: An agent-based modeling approach, Population and Environment 1- 25 (2016).
  • [35] M. Aatek, S. Rizi, and A. Geller, Verification through calibration: An approach and a case study of a model of conflict in Syria, Proceedings of the 2013 Winter Simulation Conference: Simulation: Making Decisions in a Complex World, 1649-1660 (2013).
  • [36] N. Myers, Environmental refugees: a growing phenomenon of the 21st century, Philos. Trans. Royal Soc. B 357.1420 (2002): 609-613.