Reinforced Epidemic Control: Saving Both Lives and Economy

08/04/2020 ∙ by Sirui Song, et al. ∙ Tsinghua University 0

Saving lives or economy is a dilemma for epidemic control in most cities while smart-tracing technology raises people's privacy concerns. In this paper, we propose a solution for the life-or-economy dilemma that does not require private data. We bypass the private-data requirement by suppressing epidemic transmission through a dynamic control on inter-regional mobility that only relies on Origin-Designation (OD) data. We develop DUal-objective Reinforcement-Learning Epidemic Control Agent (DURLECA) to search mobility-control policies that can simultaneously minimize infection spread and maximally retain mobility. DURLECA hires a novel graph neural network, namely Flow-GNN, to estimate the virus-transmission risk induced by urban mobility. The estimated risk is used to support a reinforcement learning agent to generate mobility-control actions. The training of DURLECA is guided with a well-constructed reward function, which captures the natural trade-off relation between epidemic control and mobility retaining. Besides, we design two exploration strategies to improve the agent's searching efficiency and help it get rid of local optimums. Extensive experimental results on a real-world OD dataset show that DURLECA is able to suppress infections at an extremely low level while retaining 76% of the mobility in the city. Our implementation is available at



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. The overview of reinforced epidemic control system.

The epidemic has always been a threat to human society by exposing us in front of a dilemma between saving lives or economy. The virus infects gathering people and spreads through daily commute (Poletto et al., 2012; Balcan et al., 2010; Wesolowski et al., 2012). Controlling the spread of the virus must cut off daily mobility, which is a pillar of the modern economy. For instance, the recent outbreak of COVID-19 has caused millions of infections and hundreds of thousands of death tolls. The epidemic forces many municipal governments to issue a stay-at-home order, which is a Fully LockDown (FLD) policy. FLD in most cities lasts for weeks thus deeply hurts the economy (Barua and others, 2020). Some municipalities try to only quarantine symptomatic people and their close contacts at the early stage of the epidemic. However, this infected-individual-quarantine policy would be only implementable when governments are able to accurately and comprehensively trace risky people. It is also unreliable when there exist many asymptomatic infected people. Current computer-science explorations pursue using smartphone data to infer and trace highly-risky people (Ferretti et al., 2020; Oliver et al., 2015). However, fully tracing individual mobility and contacts requires full coverage of smartphones and further raises the concern of threatening privacy (Cho et al., 2020). According to an investigation by the University of Maryland and The Washington Post, around 60% of respondents either prefer not sharing their private information or do not own a smart phone (Timberg et al., 2020). In summary, the vast amount of complex individual mobility and asymptomatic infected people prevent current epidemic-control policies from cutting off virus spread without hurting the economy when private information cannot be fully captured.

We in this research demonstrate that a smart epidemic control policy is still available even if private mobility information is unavailable. We develop a dynamic control framework to avoid an epidemic outbreak by limiting the probability of risky mobility’s occurrence. Instead of targeting and limiting risky individual’s mobility according to private data, our framework estimates each urban region’s risk of having a high infected population and uses the estimation to control inter-regional mobility. Highly-risky inter-regional mobility will be limited to suppress the probability of infected people’s movement. Because the infected people are a small proportion of the population even in a seriously infected city, only a small number of mobility must be restricted. It is possible to avoid an epidemic outbreak by heterogeneously limiting little inter-regional mobility. Furthermore, the estimation is based on the regional aggregate demand for mobility and the regional epidemic statistics. Thus, private data is dispensable.

However, there exist three specific challenges causing the complexity of estimating and controlling inter-regional mobility for suppressing the infection and protecting the economy. First, urban mobility is vast and temporally varying, making it hard to target the really risky mobility. Further, the requirements of the policy’s practicality sophisticate the design of the epidemic-control policy. An implementable control policy cannot continuously quarantine the same urban region for too long. Last but not the least, the search for policy is difficult. Due to the exponentially-increasing nature of epidemics, the number of future infections is a highly non-convex function of each previous decision, making it hard to explore the policy space. Furthermore, the dual objectives cause the policy exploration often end up stuck in local optimums, which is also exacerbated by the non-convexity of infections.

With the consideration of the above challenges, we develop a DUal-objective Reinforcement-Learning Epidemic Control Agent (DURLECA) framework by combining Graph Neural Network (GNN) and Reinforcement Learning (RL) approach, to search out an effective mobility-control policy. DURLECA hires a GNN to estimate the virus-transmission risk induced by urban mobility, which is a dynamic flow on a graph. Based on the estimated risk, the RL agent periodically determines the extent of the restriction on each inter-regional mobility. The GNN of DURLECA is developed with a novel architecture, namely Flow-GNN, to fit the virus spread process on mobility flows, which existing GNN architectures are incompatible to characterize. We also carefully construct a reward function for the RL agent to precisely capture the natural trade-off relation between epidemic control and urban-mobility retaining. The reward function also considers the difference between continuous and intermittent restrictions on the same region. Furthermore, we develop two RL exploration strategies that appropriately incorporate epidemic expert knowledge for guiding and stabling policy exploration.

Supported by a Susceptible-Infected-Hospitalized-Recovered (SIHR) epidemic simulation environment developed from the traditional SIR model (Kermack and McKendrick, 1927), DURLECA is able to successfully search out a mobility-control policy that suppresses the epidemic and retains most of the mobility. Our experiments on a real-world mobility dataset collected in Beijing demonstrate the effectiveness of DURLECA. Even if the city starts to suppress an epidemic111The is for SARS, for Influenza, for COVID-19, according to whose after 20 days of discovering the first patient, DURLECA still finds out a policy where:

  • The peak demand for hospitalization is under 1.3‰222The hospital bed density is 2.9 ‰ in U.S., 4.2‰ in China, and 13.4‰ in Japan, according to of the whole population. The average demand for hospitalization is controlled under 0.4‰.

  • 76% of the total mobility is retained. In more than intervened days, two-thirds regions retain over mobility. No region ever experiences a stringent day, i.e., daily retained mobility lower than 20%.

In summary, the contribution of this paper is in three-folds:

  • We bypass the privacy concern for smart epidemic control. Instead of directly tracing and quarantining risky individuals, we suppress the risk of an epidemic outbreak by estimating and restricting risky inter-regional aggregate mobility.

  • We develop DURLECA to dynamically generate customized control actions for inter-regional mobility, which allows a smart solution for the life-or-economy dilemma of epidemic control.

  • We propose innovative approaches to guarantee DURLECA’s capability. We design a novel GNN architecture that can fit the epidemic transmission dynamics. Our RL reward function captures the nature of the trade-off relation between epidemic suppressing and mobility retaining, and reflects practical requirements. We also develop two RL exploration strategies that appropriately incorporate epidemic expert knowledge for guiding and stabling policy exploration.

2. Preliminary and Problem Formulation

During the stay-home order of COVID-19, governments distribute mobility quotas per day to each household for retaining the basic economic activities, such as people’s procurement for food. According to the current quota regulation, we develop a new policy environment. We assume that the government periodically predicts or collects aggregate demands for inter-regional mobility of every Origin-Destination (OD) pair. The government also collects information about the number and location of current discovered patients. Those pieces of information are used to determine the quotas for each inter-regional mobility. The quota-distribution aims at minimizing the risk of epidemic break out in the foreseeable future periods while maximizing the mobility demands. We in this section present the modeling of mobility and epidemic that supports quota allocation.

2.1. Mobility Modeling

We model a city’s urban-mobility demand at time step as a mobility matrix , whose element represents the inter-regional mobility demand, i.e., the number of people who demand to move, from to . According to and the epidemic information, the city government determines a mobility quota matrix at , whose element is the quota rate distributing to the mobility demand from to . Therefore, the allowed inter-regional mobility denoted by is calculated according to the following equations.


where refers to the mobility control function and denotes for element-wise multiplication. Note that , , and are matrices, where is the number of regions in the studied city. We summarize the mobility-related notations in Appendix.

2.2. Epidemic Modeling

The main challenge for urban epidemic control comes from infected people who are infectious but asymptomatic. Therefore, we develop a new epidemic model to capture the difference between asymptomatic people and symptomatic people. Our model is based on the traditional Susceptible-Infected-Recovered () model in public-health literature (Kermack and McKendrick, 1927). We introduce a new state beyond and denote it by Hospitalized (). People in state are infected with symptoms and thus will be quarantined or hospitalized. They will not participate in urban mobility and will not contribute to new infections. We refer our model as model333This modeling is different from the SEIR model (Li and Muldowney, 1995) which assumes the asymptomatic people are not infectious..

We use our model to capture the dynamic process of infection spread over urban mobility. We denote region ’s epidemic state by , whose each element respectively denotes the susceptible, infected, hospitalized, and recovered population of at . We use to represent the visible state of at , where the healthy people cannot be differentiated from infectious asymptomatic people. We denote the total population of at by .

The epidemic state is updated in each time step. For each time step , we separate into two sub-steps: mobility happens and infection occurs. At the mobility-happening sub-step, people accomplish their moves between regions. We use to represent the epidemic state of the staying people while represents the new arrival’s. The overall epidemic state at the mobility-happening sub-step, denoted as , is calculated as follows:


At the infection-occurring sub-step, people that stay at infect each other. Simultaneously, new arrivals at infect each other. Therefore, the epidemic state is updated as follows:


where are elements of . are the epidemic’s transmission rate for the staying people and the moving people respectively. is the hospitalized rate and is the recover rate. We use one set of for all regions at all time steps for simplification. We introduce how we estimate in Appendix.

2.3. Multi-Objective Sequential Control Problem Formulation

The above mobility and epidemic modeling allow us to formulate the dynamic inter-regional mobility control problem for minimizing infections and maximizing mobility retaining, shown in Equation (10)(11):


where is the objective function, satisfying because of the trade-off nature between epidemic control and mobility retaining. should also meet some practical requirements. In the next section, we detail the design of the objective function and use it as the reward function of the RL module of DURLECA. Besides, we particularly consider the fact that the frequency of government interventions is lower than the frequency of mobility. Therefore, the mobility and infection updates per hour while the government determines mobility quotas per four hours.

3. Durleca

DUal-objective Reinforcement-Learning Epidemic Control Agent (DURLECA) is a GNN-enhanced RL agent to estimate regional infection risk and determine mobility quota. An overview of DURLE-CA is shown in Figure 2. At each time step , DURLECA acquires an observation from the environment. According to and the demand mobility , our RL agent gives a control action for the optimization problem in Equation (11). In the rest of this section, we provide the details of DURLECA.

Figure 2. The details of the proposed DURLECA.

3.1. Reinforcement Learning

We now re-formulate the multi-objective sequential control problem using the basic factors of RL, i.e., state, action, reward, and learning algorithm.

State: We take the visible epidemic state , its temporal one-order derivatives , the mobility demand , and the historical mobility loss (defined later) as the state for RL.

Action: The action of RL is defined as the mobility restriction determining the quota rate for each inter-regional mobility at . Each element of is a real number between 0 and 1.

Reward: The reward function is designed to reflect the objective of the optimization problem in Equation (11). It includes two terms: an infection-spread-cost term and a mobility-restriction-cost term. In order to guide the RL agent to effectively find an effective and practical mobility-control policy. We design the reward function to satisfy the following three requirements:

  • Reflecting the trade-off relation between infection control and mobility retaining.

  • Capturing the exponential growth of the social cost caused by infection spread. The social cost is low when the infected population is small. However, the social cost will skyrocket once the infection population exceeds the capacity of the city’s healthcare system.

  • Penalizing continuous mobility restrictions in the same region. People’s tolerance for mobility restrictions is limited. Thus, the reward function has to include a growing penalty for continuously restricting the same region.

According to the above three requirements, we separately design the infection-spread-cost term and mobility-restriction-cost term. We denote the infection-spread-cost by and model it as follows:


where is a hyper-parameter determining the start-up social cost of a city having the first patient while is a hyper-parameter determining how the social cost increase along with the number of patients. The mobility-restriction-cost term is denoted by and defined below:


Here, , the historical mobility loss, is the amount of mobility restricted in history and induces an exponentially-growing penalty on the current restriction. The hyper-parameter determines the discount rate of historical restrictions’ impacts. The hyper-parameter determines how large the penalty is for continuously limiting the same region. Finally, we develop the RL reward function as follows:


Note that our design enables the reward function to reflect that the infection-spread-cost booms once the whole city’s hospitalized population exceeds the city’s healthcare system capacity while the mobility-restriction-cost skyrockets if any single region is continuously restricted for multiple periods.

Learning Algorithm: DURLECA employs a Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) agent to search for mobility-control policy because the action space is continuous. The DDPG agent is composed of a critic network and an actor network. The critic network aims to estimate the expected reward gained by a control action. The actor network searches for the best action, which gives quota rates for all inter-regional mobility, by maximizing the critic network’s output. We use Parameter Noise (Plappert et al., 2017) to improve exploration during RL training.

3.2. Flow-GNN

Both critic and actor networks have to well capture the graph nature of urban mobility, where regions are nodes connected by OD flows. Therefore, we adopt GNN to develop both of them. We design a novel GNN architecture so that GNN can characterize the epidemic transmission process driven by regional infection aggregation upon inter-regional mobility. We refer our proposed GNN as Flow-GNN, which is developed on the basis of GraphSage (Hamilton et al., 2017).

In particular, we design Flow-GNN fit for the low-frequency mobility control associated with high-frequency mobility dynamics. Considering that we determine mobility quota per four hours, we include 4 Flow-GNN layers in our network and input edge information chronologically. The edge-input information for the -th layer is . We use to denote the feature of region outputted by the -th GNN layer and calculate it according to the following equations:


Here denotes for concatenation,

is a non-linear activation function, and

are trainable parameters. Specifically, we input the first layer with .

The above equations correspond to our modeling of epidemic transmission in Section 2.2, where we separate each time step into mobility-happened sub-step and infection-occurred sub-step. Equation (16) describes the epidemic feature of staying population at mobility-happened sub-step while Equation (17) represents the new-arrival population’s epidemic feature in the same sub-step. Equation (18) characterizes the epidemic transmission in the staying population and the new-arrival population.

3.3. Exploration Strategies

The exponentially-increasing nature of epidemics and our dual objectives cause difficulties for RL exploration and increase the agent’s risk of falling into local optimums. We design two RL exploration strategies to address this problem. The first strategy is to incorporate pseudo-expert knowledge to improve RL searching efficiency. The second is to protect the agent from falling into local optimums by stopping it from exploring apparently unreasonable policies.

Generating-and-Incorporating Pseudo Expert: We can generate simple but dynamic policies according to current epidemic-management experience, which can be a good start point for RL exploration. For instance, most cities currently restrict the mobility of regions with a large symptomatic population while a region has urgent reopening demand if it has been continuously locked down for a long time. Thus, we design a pseudo expert, which control as follows:


The expert will lock a region down based on two conditions: 1) the number of hospitalized, or symptomatic patients in this region, exceeds the threshold ; 2) this region has not been restricted very much in history, reflected by that does not exceed the threshold . During testing, this expert is also used as a comparing baseline.

We let the agent first explore with expert’s guidance and then gradually learn to explore by itself to outperform the expert. The idea is inspired by the approach adopted to develop AlphaGO (Silver et al., 2016). Specifically, we set an adaptive probability for the agent to directly choose the expert action instead of taking an action by itself during training. This design enables the agent to compare the pseudo-export strategy with its own, which avoids the agent to move towards inefficient directions at the initial stage of training. The adaptive probability decreases along with training steps, which enables the agent to broadly explore and outperform the expert at the later stage of training.

Avoiding Extreme Points: The wide exploration might lead the agent to fall into some extreme points. The training might be unstable due to a sudden large loss caused by a poor control action. Meanwhile, the strong incentive of avoiding the large loss will force the agent to fall into local optimal control policies, such as a forever fully-lockdown. To avoid such extreme points, we set two rules:

  • The infection threshold : If the agent explores into a state where the regional mean number of infected people exceeds , it will end the episode and receive a large penalty.

  • The lockdown threshold : If the agent explores into a state where there exists a region that exceeds , it will end the episode and receive a large penalty.

The two rules are straightforward but effective to help the RL agent avoid potential local optimums.

4. Experimental Evaluations

In this section, we conduct extensive experiments to answer the following research questions:

RQ1: Can DURLECA resolve the life-or-economy dilemma?

RQ2: Can DURLECA adapt to both early intervention and late intervention?

RQ3: Can DURLECA be generalized to different cities and different diseases?

Besides, we conduct ablation studies in Appendix to evaluate the effectiveness of our proposed Flow-GNN and RL exploration strategies.

4.1. Dataset

We use a real-world OD dataset collected by a mobile operator in Beijing to evaluate DURLECA. The dataset divides Beijing into regions and covers 544,623 residents. Averagely, each region has observed residents. The dataset covers 24-hour OD-flows for the whole month of January 2019. We repeat the one-month data 24 times and get a prolonged dataset of 24 months so that we have a sufficiently long period for discussing epidemic control. We list other details in Appendix.

4.2. Metrics and Settings

We design six metrics to evaluate the performance of DURLECA on resolving the life-or-economy dilemma. We introduce the metrics in the following and summarize them in Table 1.

Metric Value Physical Meaning
Mean/Max 0-1686 Temporal mean/max of
Total 0-1686 Total after the epidemic
0-1 Total quota rate
0-744 The city 20%-mobility duration
0-744 The region 20%-mobility duration
Table 1. The summary of metrics and related value ranges.

We select three metrics to assess the epidemic-suppressing performance of an epidemic control policy, including the total number of infected people that is equal to at the end of the epidemic period, the mean number of hospitalized people whose value is the mean of over time, and the peak demand for hospitalization capacity that is equal to the max value of over time. Total determines the total social medical costs while both Mean/Max of reflect the sustained and peak pressure on the healthcare system. We also select three metrics to assess the mobility-retaining performance of an epidemic control policy, including the total ratio of retained mobility , the duration of stringent mobility restrictions on the whole city , and the duration of stringent mobility restrictions on the most restricted region .

Epidemic Settings: Without the loss of generality, we set in most of our experiments. The estimated basic reproduction number is 2.1.

Intervention time: We define as the time when the policymakers discover the epidemic and start to intervene. In our experiments, we compare results with .

For more details about our experiment settings, please refer to Appendix. Our implementation is available online at

4.3. Performance Comparison

Baselines: We set four different expert baselines to simulate different real-world expert policies and compare them with DURLECA on resolving the life-or-economy dilemma.

  • EP-Fixed: In the real world, a simple but inflexible control is to restrict all mobility in the city. For simulation, we design EP-Fixed to give a fixed quota rate to all inter-regional mobility during the whole epidemic period. We set in our experiments, as we find them at the boundary of successfully controlling the epidemic.

  • EP-Soft: We design an expert baseline following Equation (19), which softly depends on the historical mobility loss and the current hospitalized population to determine whether to lock down a region. We set in our experiments. guarantees the expert receive equivalent information compared with DURLECA. corresponds to the real-world control policy in some countries: a continuous 7-day (168-hour) lockdown.

  • EP-Hard: Without softly depending on the historical mobility loss, an expert can reopen a region if it has been locked down for successive days. This expert, namely EP-Hard, gives daily quota as follows:


    We set for a similar reason of EP-Soft.

  • EP-Lockdown: The most robust and conservative policy is to lock down the whole city until the epidemic ends. To simulate it, we design an expert following Equation (19) but with . It can lock down a region for an any-long time until the hospitalized population becomes zero.

Figure 3. The simulation visualization of DURLECA and selected baselines when . Main Figure: The number of hospitalized people along with time. Upper Right: The histogram for . Best viewed in color.
Mean/Max Total
No Intervention 27.21/157.55 1069.29 1 0 0
EP-Fixed 4.03/17.96 877.32 0.20 724 724
EP-Fixed 0.44/1.55 10.18 0.15 724 724
EP-Soft 4.66/53.80 1040.68 0.57 9 19
EP-Hard 0.45/1.78 8.31 0.13 36 36
EP-Lockdown 0.41/1.42 5.75 0 27 27
DURLECA 0.60/2.28 19.07 0.76 0 0
Table 2. The simulation results of DURLECA and all baselines when .
Figure 4. The spatial distribution and histogram of given by DURLECA. We select four periods from a ”high-risk day” and a ”low-risk day”. Each grid in one of the four 1719 maps represents a region in Beijing. Each histogram summarizes the distribution of quota rates in the respective period.

Results and Analysis: We compare DURLECA with all baselines when in Table 2. We also visualize three selected results in Figure 3. Expert baselines can achieve only one goal in the life-or-economy dilemma, while DURLECA can achieve both.

EP-Soft can retain 57% of the total mobility. However, it leads to an epidemic outbreak, reflected by the super-high value of Mean/Max and Total . The healthcare system will break down. EP-Fixed (), EP-Hard and EP-Lockdown can keep Mean/Max at a low level so that the healthcare system will not be overwhelmed. However, the low value of indicates that all of them fail to retain mobility. The large value of and demonstrates that some regions and the city have to experience long-term lockdown, which is an unacceptable damage to the economy. Besides, the differences of EP-Fixed (15%) and EP-Fixed (20%) in Mean/Max and Total also indicate that the expert control is very vulnerable to mobility perturbation. The above results also manifest that all those expert policies fail at resolving the life-or-economy dilemma of epidemic control.

Compared with those baselines, DURLECA simultaneously suppresses the epidemic and retains a large amount of mobility. DURLE-CA achieves low values of Mean/Max , which guarantee the demands for hospitalization will not exceed the capacity of most countries. DURLECA also suppresses the total infected population at a low level, about of the total population. The red curve in Figure 3 presents the performance of DURLECA in epidemic suppression.

DURLECA also retains the most mobility. 76% of the total mobility during the intervened period is retained. Furthermore, no regions will be fully locked down. DURLECA retains 70-100% mobility for most regions in the city. The economic loss due to epidemic control can be significantly reduced. In all, DURLECA successfully resolves the life-or-economy dilemma.

DURLECA’s control is highly customized and dynamic, which is hard to be mimicked by human experts. In Figure 4, we visualize the spatial distribution of quota rates and the associated histogram in four selected periods. Figure 4 manifests that DURLECA’s smartness in distributing quotas according to both epidemic risks and mobility patterns. The agent tends to give more quotas in a low-risk and low-mobility period and give fewer quotas in either a high-risk or a high-mobility period.

4.4. Comparison of the Scenarios of Early/Late Intervention

To examine whether DURLECA is still effective if the government’s intervention is later than the discovery of the first patient, we compare DURLECA’s performance in three scenarios. We have discussed the very-late-intervention scenario where the government starts to act 20 days after discovering the first patient () in Section 4.3. Here, we compare the early-intervention scenario () and the late-intervention scenario (). The results are shown in Table 3.

We find that: 1) EP-Soft can control the epidemic in the early-intervention scenario. Because the virus has not widely spread, restricting the few infected areas is enough for epidemic suppressing. However, it fails to avoid an epidemic outbreak in the late-intervention scenario. 2) EP-Hard and EP-Lockdown can control the epidemic under both scenarios. However, it will lock all risky regions down and cut off most mobility. 3) DURLECA successfully suppresses the epidemic while retains the majority of urban mobility in both scenarios.

Mean/Max Total
EP-Soft 0 0.02/0.03 0.08 0.18
EP-Hard 0 0.03/0.04 0.10 0.13
EP-Lockdown 0 0.01/0.03 0.09 0
DURLECA 0 0.03/0.04 0.27 0.73
EP-Soft 10 4.67/62.10 1041.64 0.57
EP-Hard 10 0.08/0.14 0.64 0.11
EP-Lockdown 10 0.08/0.14 0.55 0
DURLECA 10 0.07/0.16 1.31 0.74
Table 3. The simulation results of DURLECA and three baselines when .

4.5. Generalization Ability

We examine the generalization ability of DURLECA under different urban settings and diseases. Cities have different capacities for hospitalization treatment. Heterogeneous economic structures also cause cities’ divergent tolerance for mobility restrictions. We vary the setting of , which represents the change of urban features, and examine DURLECA’s performance. The results are shown in Table 4. The results demonstrate that DURLECA can find out different policies responding to the change of urban settings. For instance, we find that a higher leads to more mobility and more hospitalizations, which suggests that cities with higher hospitalization capacities can take more patients and retain more mobility. We also examine DURLECA’s adaptiveness to various diseases. We vary the setting of , , , to simulate different diseases with different . We find that DURELCA is also able to adjust epidemic-control policy to adapt to different diseases (Table 5). For instance, DURLECA provides loose mobility restrictions on low- diseases but stringent mobility restrictions on high- ones. DURLECA’s adaptiveness to urban-setting and disease-setting changes not only demonstrates its generalization ability but also its smartness.

Mean/Max Total
1 72 0.54/1.74 10.02 0.38
3 72 0.60/2.28 19.07 0.76
10 72 2.79/6.34 223.29 0.90
3 48 1.69/4.60 153.54 0.88
3 72 0.60/2.28 19.07 0.76
3 168 0.45/1.58 16.69 0.71
Table 4. The simulation results of DURLECA with different when .
Mean/Max Total
1.4 0.35/1.20 10.96 0.86
2.1 0.60/2.28 19.07 0.76
3.5 0.81/4.08 23.18 0.46
Table 5. The simulation results of DURLECA under epidemics with different .

5. Related work

Epidemic Modeling: The SIR model is a widely used mathematical model in epidemiology, which divides the population into three states: susceptible, infected and recovered (Kermack and McKendrick, 1927). Based on the SIR model, Ogren and Martin used an embedded Newton algorithm to help find an optimal control strategy (Ogren and Martin, 2000). The distributed delay and discrete delay of SIR was also studied (McCluskey, 2010). Considering a more practical epidemic scenario, the SEIR model added an Exposed state to deal with the incubation period (Li and Muldowney, 1995). Others also strengthened the differential equations considering vaccination consequences for a measles epidemic (Allen, ). Later works also tried to incorporate human spatial patterns into the epidemic model. Sattenspiel et al. presented how contacts occur between individuals from different regions and how they influence epidemic spreads (Sattenspiel et al., 1995). Balcan et al. presented the GLobal Epidemic and Mobility model, which integrated sociodemographic and population mobility data in a spatially structured stochastic approach (Balcan et al., 2010). Different from previous works, we distinguish visible and invisible infections and model epidemic transmission upon traffic flows, so that to support exploring mobility-control policies for epidemic control.

Graph Neural Network for OD-flows: The problem of estimating, predicting and controlling human flows between regions has been addressed using neural networks since (Lorenzo and Matteo, 2013). Especially due to the reason that most OD flows are modeled based on graphs, Graph Neural Network (GNN) shows great importance and was first suggested in (Scarselli et al., 2009). Later GNNs were used to predict future mobility flows (Chai et al., 2018; Geng et al., 2019; Wang et al., 2019). Besides, (Derr et al., 2020) borrowed knowledge from epidemic models to design GNN for node prediction in documents. However, existing GNN architectures lack the ability to model the virus-spreading flow. Our designed Flow-GNN allows our model to characterize the virus-spreading flow and guarantees DURLECA’s capability.

Deep Reinforcement Learning: Deep Reinforcement Learning (DRL) has been proved to be effective for control problems that have a large action space (Mnih et al., 2015; Lillicrap et al., 2015; Van Hasselt et al., 2016). DQN (Mnih et al., 2015) and DDPG (Lillicrap et al., 2015) are two representative DRL algorithms, proposed for discrete control problems and continuous control problems, respectively. To enable the agent to find an optimal solution, later works proposed to enhance exploration (Plappert et al., 2017; Fortunato et al., 2017; Henderson et al., 2018)

. Imitation learning is another area of RL, where the goal is to enable the agent to behave like a human expert

(Hussein et al., 2017). AlphaGo proposed to start from imitation but further explore to outperform expert (Silver et al., 2016). In (Wijayanto and Murata, 2019), DQN was also used for node protection against epidemic under a single objective. Compared with it, both our control action and objectives are more complex and practical, and thus our RL training are more challenging. We design two strategies to address the exploration challenge.

6. Future research and Conclusion

A series of problems ask for future study on smart-and-privacy-protected epidemic control while DURLECA can be the framework. In this research, we do not consider the uncertainty of mobility and epidemic information when DURLECA explores epidemic policies. It asks for future work to explore the algorithm for searching a robust policy when the information is uncertain. A practical policy has to be robust even if there exist errors in the input data.

To conclude, this research demonstrates a sequence of important facts, which broaden the vision of human society for epidemic control and are listed below:

  • Private data is dispensable because restricting the aggregated inter-regional mobility sufficiently lowers the probability of infectious people’s movement and thus suppresses the risk of epidemic transmission.

  • Resolving the life-or-economy dilemma of epidemic control must allow dynamic and customized regional policies.

  • The powerfulness of our GNN-enhanced RL in epidemic control manifests that field knowledge is critical for AI-system architecture and valuable for neural network training.

In all, smart governance empowered by AI will protect future society from the loss of lives due to epidemics and the economic risk caused by epidemic control.


This work was supported in part by The National Key Research and Development Program of China under grant 2018YFB1800804, the National Nature Science Foundation of China under U1936217, 61971267, 61972223, 61941117, 61861136003, Beijing Natural Science Foundation under L182038, Beijing National Research Center for Information Science and Technology under 20031887521, research fund of Tsinghua University - Tencent Joint Laboratory for Internet Innovation Technology, and The Information Core Technology Center at Institute for Interdisciplinary.


  • [1] L. J. Allen An introduction to mathematical biology. 2007. ISBN 10, pp. 0–13. Cited by: §5.
  • D. Balcan, B. Gonçalves, H. Hu, J. J. Ramasco, V. Colizza, and A. Vespignani (2010) Modeling the spatial spread of infectious diseases: the global epidemic and mobility computational model. Journal of computational science 1 (3), pp. 132–145. Cited by: §1, §5.
  • S. Barua et al. (2020) Understanding coronanomics: the economic implications of the coronavirus (covid-19) pandemic. SSRN Electronic Journal https://doi org/10/ggq92n. Cited by: §1.
  • D. Chai, L. Wang, and Q. Yang (2018) Bike flow prediction with multi-graph convolutional networks. pp. 397–400. Cited by: §5.
  • H. Cho, D. Ippolito, and Y. W. Yu (2020) Contact tracing mobile apps for covid-19: privacy considerations and related trade-offs. arXiv preprint arXiv:2003.11511. Cited by: §1.
  • T. Derr, Y. Ma, W. Fan, X. Liu, C. Aggarwal, and J. Tang (2020) Epidemic graph convolutional network. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 160–168. Cited by: §5.
  • L. Ferretti, C. Wymant, M. Kendall, L. Zhao, A. Nurtay, L. Abeler-Dörner, M. Parker, D. Bonsall, and C. Fraser (2020) Quantifying sars-cov-2 transmission suggests epidemic control with digital contact tracing. Science 368 (6491). Cited by: §1.
  • M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, et al. (2017) Noisy networks for exploration. arXiv preprint arXiv:1706.10295. Cited by: §5.
  • X. Geng, Y. Li, L. Wang, L. Zhang, Q. Yang, J. Ye, and Y. Liu (2019) Spatiotemporal multi-graph convolution network for ride-hailing demand forecasting. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 3656–3663. Cited by: §5.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: Appendix C, §3.2.
  • P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger (2018) Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §5.
  • A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne (2017) Imitation learning: a survey of learning methods. ACM Comput. Surv. 50 (2). External Links: ISSN 0360-0300, Link, Document Cited by: §5.
  • [13] J. H. Jones Notes on r0. Cited by: §B.3.
  • W. O. Kermack and A. G. McKendrick (1927) A contribution to the mathematical theory of epidemics. Proceedings of the royal society of london. Series A, Containing papers of a mathematical and physical character 115 (772), pp. 700–721. Cited by: §1, §2.2, §5.
  • M. Y. Li and J. S. Muldowney (1995) Global stability for the seir model in epidemiology. Mathematical biosciences 125 (2), pp. 155–164. Cited by: §5, footnote 3.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §3.1, §5.
  • M. Lorenzo and M. Matteo (2013) OD matrices network estimation from link counts by neural networks. Journal of Transportation Systems Engineering and Information Technology 13 (4), pp. 84–92. Cited by: §5.
  • C. C. McCluskey (2010) Complete global stability for an sir epidemic model with delay—distributed or discrete. Nonlinear Analysis: Real World Applications 11 (1), pp. 55–59. Cited by: §5.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §5.
  • P. Ogren and C. Martin (2000) Optimal vaccination strategies for the control of epidemics in highly mobile populations. In Proceedings of the 39th IEEE Conference on Decision and Control (Cat. No. 00CH37187), Vol. 2, pp. 1782–1787. Cited by: §5.
  • N. Oliver, A. Matic, and E. Frias-Martinez (2015) Mobile network data for public health: opportunities and challenges. Frontiers in public health 3, pp. 189. Cited by: §1.
  • M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz (2017) Parameter space noise for exploration. arXiv preprint arXiv:1706.01905. Cited by: §3.1, §5.
  • M. Plappert (2016) Keras-rl. GitHub. Note: Cited by: §B.2.
  • C. Poletto, M. Tizzoni, and V. Colizza (2012) Heterogeneous length of stay of hosts’ movements and spatial epidemic spread. Scientific reports 2, pp. 476. Cited by: §1.
  • L. Sattenspiel, K. Dietz, et al. (1995) A structured epidemic model incorporating geographic mobility among regions. Mathematical biosciences 128 (1), pp. 71–92. Cited by: §5.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2009) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §5.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016) Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–489. Cited by: §3.3, §5.
  • C. Timberg, D. Harwell, and A. Safarpour (2020) Most americans are not willing or able to use an app tracking coronavirus infections. that’sa problem for big tech’s plan to slow the pandemic. Washington Post. Retrieved from https://www. washingtonpost. com/technology/2020/04/29/most-americans-are-not-willing-or-able-use-an-app-tracking-coronavirus-infections-thats-problem-big-techs-plan-slow-pandemic. Cited by: §1.
  • H. Van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, Cited by: §5.
  • Y. Wang, H. Yin, H. Chen, T. Wo, J. Xu, and K. Zheng (2019) Origin-destination matrix prediction via graph convolution: a new perspective of passenger demand modeling. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1227–1235. Cited by: §5.
  • A. Wesolowski, N. Eagle, A. J. Tatem, D. L. Smith, A. M. Noor, R. W. Snow, and C. O. Buckee (2012) Quantifying the impact of human mobility on malaria. Science 338 (6104), pp. 267–270. Cited by: §1.
  • A. W. Wijayanto and T. Murata (2019) Effective and scalable methods for graph protection strategies against epidemics on dynamic networks. Applied Network Science 4 (1), pp. 18. Cited by: §5.

Appendix A Notation Summary

Term/Notation Definition
superscript At time step .
subscript The original demand without restrictions.
subscript With restriction .
subscript , Region index
The mobility. A matrix.
The OD flow from to . A scalar.
. The out-flow from . A scalar.
. The mean out-flow from . A scalar.
. The region quota rate. A scalar.
. The city quota rate. A scalar.
. The total quota rate. A scalar.
Table 6. The summary of mobility-related notations. Subscript can be either or .

Appendix B Experiment Settings and Reproducibility

b.1. Dataset

We list the dataset details in Table 7, where counts the mean probability for an individual to move in one hour.

City Regions Mean Population Duration
Beijing 1686 0.18 744 Days
Table 7. The summary of the prolonged dataset.

Privacy and ethical concerns: We have taken the following procedures to address privacy and ethical concerns. First, all of the researchers have been authorized by the data provider to utilize the data for research purposes only. Second, the data is completely anonymized. Third, we store all the data in a secured off-line server.

b.2. Implementation Details

Without the loss of generality, we set the moving transmission rate , the staying transmission rate , the hospitalized rate and the recover rate . Without intervention, the estimated basic reproduction rate is 2.1 at the initial stage of the epidemic. For the reward, we mainly set . For the pseudo expert, we set . For the extreme point threshold, we set .

During training, we randomly initialize an epidemic state at the start of each episode. We train DURLECA for 400,000 steps, using Adam optimizer with the learning rate as 0.0001. During testing, we fix one epidemic-initialization setting and compare different baselines. Considering the randomness of training, we train DURLECA with different random seeds 5 times for each set of configurations, and choose the one that achieves the best episode reward to report as the result.

We mainly implement DURLECA based on Keras-RL (Plappert, 2016) with our modifications.

b.3. Disease

In a classical SIR model, the basic reproduction rate is calculated as (Jones, ). In our model, as the infection has been divided into two parts, we estimate an averaged over and according to their corresponding population size,


Then we estimate .

In Section 4.5, we vary the setting of and keep the same. To make a fair comparison, we also vary for each simulated disease to make sure the city has nearly the same number of hospitalized people when we start the intervention. For , we set . For , we set . For , we set .

Appendix C Ablation Study

To evaluate the effectiveness of our proposed Flow-GNN and RL exploration strategies, we conduct ablation studies in this section.

GNN-Baselines: To evaluate the effectiveness of our proposed Flow-GNN, we use the well known GraphSageConv layer (Hamilton et al., 2017) and our modified GraphSageConv layer to replace the proposed Flow-GNN layer in the actor network and the critic network. We name the two baselines as GNN-Mean and GNN-Softmax.

The layer calculation of GNN-Mean follows Equation (22):


where denotes the connected regions of .

The layer calculation of GNN-Softmax follows Equation (23):


RL-Baselines: To evaluate the effectiveness of our RL exploration strategies, we remove the pseudo-expert strategy and the avoiding-extreme-points strategy, respectively. We refer to the two baselines as RL-NoEP and RL-NoThre.

Mean/Max Total
No Intervention 27.21/157.55 1069.29 1 0 0
GNN-Mean -/- - - - -
GNN-Softmax 0.53/1.88 7.79 0.06 26 28
RL-NoEP 0.41/1.45 5.87 0.00 27 27
RL-NoThre 1.43/3.68 86.82 0.75 0 9
DURLECA 0.60/2.28 19.07 0.76 0 0
Table 8. Ablation study when .

Results and Analysis: As shown in Table 8, without Flow-GNN or the proposed RL exploration strategies, the agent fails to learn a good policy.

The failure of GNN-Mean comes from its inability to learn weighted edge information, i.e., how many people move from one region to another. With considering weighted edge information, GNN-Softmax still fails to retain mobility, as it can not describe traffic flows and the epidemic transmission upon it. These prove the effectiveness of our proposed Flow-GNN.

RL-NoEP gives a long-term lockdown to the whole city, which is a typical local optimum. As for RL-NoThre, the agent successfully finds one policy that achieves relatively low hospitalizations and high mobility. However, this solution is worse than DURLECA. Besides, we find that the success of RL-NoThre highly relies on luck. During our five repeating experiments, the agent was stuck in local optimums for four times, giving a long-term lockdown to the whole city. These, as discussed earlier in Section 1, are due to the difficulty of exploration. The agent is easy to encounter extreme points during exploration, and the extreme points force the agent to adopt conservative policies, i.e., lock the whole city down. Compared with RL-NoEP and RL-NoThre, DURLECA is guided by a pseudo-expert and is designed to avoid extreme points. Thus, DURLECA can find much better solutions.