1. Introduction
The epidemic has always been a threat to human society by exposing us in front of a dilemma between saving lives or economy. The virus infects gathering people and spreads through daily commute (Poletto et al., 2012; Balcan et al., 2010; Wesolowski et al., 2012). Controlling the spread of the virus must cut off daily mobility, which is a pillar of the modern economy. For instance, the recent outbreak of COVID19 has caused millions of infections and hundreds of thousands of death tolls. The epidemic forces many municipal governments to issue a stayathome order, which is a Fully LockDown (FLD) policy. FLD in most cities lasts for weeks thus deeply hurts the economy (Barua and others, 2020). Some municipalities try to only quarantine symptomatic people and their close contacts at the early stage of the epidemic. However, this infectedindividualquarantine policy would be only implementable when governments are able to accurately and comprehensively trace risky people. It is also unreliable when there exist many asymptomatic infected people. Current computerscience explorations pursue using smartphone data to infer and trace highlyrisky people (Ferretti et al., 2020; Oliver et al., 2015). However, fully tracing individual mobility and contacts requires full coverage of smartphones and further raises the concern of threatening privacy (Cho et al., 2020). According to an investigation by the University of Maryland and The Washington Post, around 60% of respondents either prefer not sharing their private information or do not own a smart phone (Timberg et al., 2020). In summary, the vast amount of complex individual mobility and asymptomatic infected people prevent current epidemiccontrol policies from cutting off virus spread without hurting the economy when private information cannot be fully captured.
We in this research demonstrate that a smart epidemic control policy is still available even if private mobility information is unavailable. We develop a dynamic control framework to avoid an epidemic outbreak by limiting the probability of risky mobility’s occurrence. Instead of targeting and limiting risky individual’s mobility according to private data, our framework estimates each urban region’s risk of having a high infected population and uses the estimation to control interregional mobility. Highlyrisky interregional mobility will be limited to suppress the probability of infected people’s movement. Because the infected people are a small proportion of the population even in a seriously infected city, only a small number of mobility must be restricted. It is possible to avoid an epidemic outbreak by heterogeneously limiting little interregional mobility. Furthermore, the estimation is based on the regional aggregate demand for mobility and the regional epidemic statistics. Thus, private data is dispensable.
However, there exist three specific challenges causing the complexity of estimating and controlling interregional mobility for suppressing the infection and protecting the economy. First, urban mobility is vast and temporally varying, making it hard to target the really risky mobility. Further, the requirements of the policy’s practicality sophisticate the design of the epidemiccontrol policy. An implementable control policy cannot continuously quarantine the same urban region for too long. Last but not the least, the search for policy is difficult. Due to the exponentiallyincreasing nature of epidemics, the number of future infections is a highly nonconvex function of each previous decision, making it hard to explore the policy space. Furthermore, the dual objectives cause the policy exploration often end up stuck in local optimums, which is also exacerbated by the nonconvexity of infections.
With the consideration of the above challenges, we develop a DUalobjective ReinforcementLearning Epidemic Control Agent (DURLECA) framework by combining Graph Neural Network (GNN) and Reinforcement Learning (RL) approach, to search out an effective mobilitycontrol policy. DURLECA hires a GNN to estimate the virustransmission risk induced by urban mobility, which is a dynamic flow on a graph. Based on the estimated risk, the RL agent periodically determines the extent of the restriction on each interregional mobility. The GNN of DURLECA is developed with a novel architecture, namely FlowGNN, to fit the virus spread process on mobility flows, which existing GNN architectures are incompatible to characterize. We also carefully construct a reward function for the RL agent to precisely capture the natural tradeoff relation between epidemic control and urbanmobility retaining. The reward function also considers the difference between continuous and intermittent restrictions on the same region. Furthermore, we develop two RL exploration strategies that appropriately incorporate epidemic expert knowledge for guiding and stabling policy exploration.
Supported by a SusceptibleInfectedHospitalizedRecovered (SIHR) epidemic simulation environment developed from the traditional SIR model (Kermack and McKendrick, 1927), DURLECA is able to successfully search out a mobilitycontrol policy that suppresses the epidemic and retains most of the mobility. Our experiments on a realworld mobility dataset collected in Beijing demonstrate the effectiveness of DURLECA. Even if the city starts to suppress an epidemic^{1}^{1}1The is for SARS, for Influenza, for COVID19, according to https://en.wikipedia.org/wiki/Basic_reproduction_number. whose after 20 days of discovering the first patient, DURLECA still finds out a policy where:

The peak demand for hospitalization is under 1.3‰^{2}^{2}2The hospital bed density is 2.9 ‰ in U.S., 4.2‰ in China, and 13.4‰ in Japan, according to https://www.indexmundi.com/g/r.aspx?v=2227&l=en. of the whole population. The average demand for hospitalization is controlled under 0.4‰.

76% of the total mobility is retained. In more than intervened days, twothirds regions retain over mobility. No region ever experiences a stringent day, i.e., daily retained mobility lower than 20%.
In summary, the contribution of this paper is in threefolds:

We bypass the privacy concern for smart epidemic control. Instead of directly tracing and quarantining risky individuals, we suppress the risk of an epidemic outbreak by estimating and restricting risky interregional aggregate mobility.

We develop DURLECA to dynamically generate customized control actions for interregional mobility, which allows a smart solution for the lifeoreconomy dilemma of epidemic control.

We propose innovative approaches to guarantee DURLECA’s capability. We design a novel GNN architecture that can fit the epidemic transmission dynamics. Our RL reward function captures the nature of the tradeoff relation between epidemic suppressing and mobility retaining, and reflects practical requirements. We also develop two RL exploration strategies that appropriately incorporate epidemic expert knowledge for guiding and stabling policy exploration.
2. Preliminary and Problem Formulation
During the stayhome order of COVID19, governments distribute mobility quotas per day to each household for retaining the basic economic activities, such as people’s procurement for food. According to the current quota regulation, we develop a new policy environment. We assume that the government periodically predicts or collects aggregate demands for interregional mobility of every OriginDestination (OD) pair. The government also collects information about the number and location of current discovered patients. Those pieces of information are used to determine the quotas for each interregional mobility. The quotadistribution aims at minimizing the risk of epidemic break out in the foreseeable future periods while maximizing the mobility demands. We in this section present the modeling of mobility and epidemic that supports quota allocation.
2.1. Mobility Modeling
We model a city’s urbanmobility demand at time step as a mobility matrix , whose element represents the interregional mobility demand, i.e., the number of people who demand to move, from to . According to and the epidemic information, the city government determines a mobility quota matrix at , whose element is the quota rate distributing to the mobility demand from to . Therefore, the allowed interregional mobility denoted by is calculated according to the following equations.
(1)  
(2) 
where refers to the mobility control function and denotes for elementwise multiplication. Note that , , and are matrices, where is the number of regions in the studied city. We summarize the mobilityrelated notations in Appendix.
2.2. Epidemic Modeling
The main challenge for urban epidemic control comes from infected people who are infectious but asymptomatic. Therefore, we develop a new epidemic model to capture the difference between asymptomatic people and symptomatic people. Our model is based on the traditional SusceptibleInfectedRecovered () model in publichealth literature (Kermack and McKendrick, 1927). We introduce a new state beyond and denote it by Hospitalized (). People in state are infected with symptoms and thus will be quarantined or hospitalized. They will not participate in urban mobility and will not contribute to new infections. We refer our model as model^{3}^{3}3This modeling is different from the SEIR model (Li and Muldowney, 1995) which assumes the asymptomatic people are not infectious..
We use our model to capture the dynamic process of infection spread over urban mobility. We denote region ’s epidemic state by , whose each element respectively denotes the susceptible, infected, hospitalized, and recovered population of at . We use to represent the visible state of at , where the healthy people cannot be differentiated from infectious asymptomatic people. We denote the total population of at by .
The epidemic state is updated in each time step. For each time step , we separate into two substeps: mobility happens and infection occurs. At the mobilityhappening substep, people accomplish their moves between regions. We use to represent the epidemic state of the staying people while represents the new arrival’s. The overall epidemic state at the mobilityhappening substep, denoted as , is calculated as follows:
(3)  
(4)  
(5) 
At the infectionoccurring substep, people that stay at infect each other. Simultaneously, new arrivals at infect each other. Therefore, the epidemic state is updated as follows:
(6)  
(7)  
(8)  
(9) 
where are elements of . are the epidemic’s transmission rate for the staying people and the moving people respectively. is the hospitalized rate and is the recover rate. We use one set of for all regions at all time steps for simplification. We introduce how we estimate in Appendix.
2.3. MultiObjective Sequential Control Problem Formulation
The above mobility and epidemic modeling allow us to formulate the dynamic interregional mobility control problem for minimizing infections and maximizing mobility retaining, shown in Equation (10)(11):
(10)  
(11) 
where is the objective function, satisfying because of the tradeoff nature between epidemic control and mobility retaining. should also meet some practical requirements. In the next section, we detail the design of the objective function and use it as the reward function of the RL module of DURLECA. Besides, we particularly consider the fact that the frequency of government interventions is lower than the frequency of mobility. Therefore, the mobility and infection updates per hour while the government determines mobility quotas per four hours.
3. Durleca
DUalobjective ReinforcementLearning Epidemic Control Agent (DURLECA) is a GNNenhanced RL agent to estimate regional infection risk and determine mobility quota. An overview of DURLECA is shown in Figure 2. At each time step , DURLECA acquires an observation from the environment. According to and the demand mobility , our RL agent gives a control action for the optimization problem in Equation (11). In the rest of this section, we provide the details of DURLECA.
3.1. Reinforcement Learning
We now reformulate the multiobjective sequential control problem using the basic factors of RL, i.e., state, action, reward, and learning algorithm.
State: We take the visible epidemic state , its temporal oneorder derivatives , the mobility demand , and the historical mobility loss (defined later) as the state for RL.
Action: The action of RL is defined as the mobility restriction determining the quota rate for each interregional mobility at . Each element of is a real number between 0 and 1.
Reward: The reward function is designed to reflect the objective of the optimization problem in Equation (11). It includes two terms: an infectionspreadcost term and a mobilityrestrictioncost term. In order to guide the RL agent to effectively find an effective and practical mobilitycontrol policy. We design the reward function to satisfy the following three requirements:

Reflecting the tradeoff relation between infection control and mobility retaining.

Capturing the exponential growth of the social cost caused by infection spread. The social cost is low when the infected population is small. However, the social cost will skyrocket once the infection population exceeds the capacity of the city’s healthcare system.

Penalizing continuous mobility restrictions in the same region. People’s tolerance for mobility restrictions is limited. Thus, the reward function has to include a growing penalty for continuously restricting the same region.
According to the above three requirements, we separately design the infectionspreadcost term and mobilityrestrictioncost term. We denote the infectionspreadcost by and model it as follows:
(12) 
where is a hyperparameter determining the startup social cost of a city having the first patient while is a hyperparameter determining how the social cost increase along with the number of patients. The mobilityrestrictioncost term is denoted by and defined below:
(13)  
(14) 
Here, , the historical mobility loss, is the amount of mobility restricted in history and induces an exponentiallygrowing penalty on the current restriction. The hyperparameter determines the discount rate of historical restrictions’ impacts. The hyperparameter determines how large the penalty is for continuously limiting the same region. Finally, we develop the RL reward function as follows:
(15) 
Note that our design enables the reward function to reflect that the infectionspreadcost booms once the whole city’s hospitalized population exceeds the city’s healthcare system capacity while the mobilityrestrictioncost skyrockets if any single region is continuously restricted for multiple periods.
Learning Algorithm: DURLECA employs a Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) agent to search for mobilitycontrol policy because the action space is continuous. The DDPG agent is composed of a critic network and an actor network. The critic network aims to estimate the expected reward gained by a control action. The actor network searches for the best action, which gives quota rates for all interregional mobility, by maximizing the critic network’s output. We use Parameter Noise (Plappert et al., 2017) to improve exploration during RL training.
3.2. FlowGNN
Both critic and actor networks have to well capture the graph nature of urban mobility, where regions are nodes connected by OD flows. Therefore, we adopt GNN to develop both of them. We design a novel GNN architecture so that GNN can characterize the epidemic transmission process driven by regional infection aggregation upon interregional mobility. We refer our proposed GNN as FlowGNN, which is developed on the basis of GraphSage (Hamilton et al., 2017).
In particular, we design FlowGNN fit for the lowfrequency mobility control associated with highfrequency mobility dynamics. Considering that we determine mobility quota per four hours, we include 4 FlowGNN layers in our network and input edge information chronologically. The edgeinput information for the th layer is . We use to denote the feature of region outputted by the th GNN layer and calculate it according to the following equations:
(16)  
(17)  
(18) 
Here denotes for concatenation,
is a nonlinear activation function, and
are trainable parameters. Specifically, we input the first layer with .The above equations correspond to our modeling of epidemic transmission in Section 2.2, where we separate each time step into mobilityhappened substep and infectionoccurred substep. Equation (16) describes the epidemic feature of staying population at mobilityhappened substep while Equation (17) represents the newarrival population’s epidemic feature in the same substep. Equation (18) characterizes the epidemic transmission in the staying population and the newarrival population.
3.3. Exploration Strategies
The exponentiallyincreasing nature of epidemics and our dual objectives cause difficulties for RL exploration and increase the agent’s risk of falling into local optimums. We design two RL exploration strategies to address this problem. The first strategy is to incorporate pseudoexpert knowledge to improve RL searching efficiency. The second is to protect the agent from falling into local optimums by stopping it from exploring apparently unreasonable policies.
GeneratingandIncorporating Pseudo Expert: We can generate simple but dynamic policies according to current epidemicmanagement experience, which can be a good start point for RL exploration. For instance, most cities currently restrict the mobility of regions with a large symptomatic population while a region has urgent reopening demand if it has been continuously locked down for a long time. Thus, we design a pseudo expert, which control as follows:
(19) 
The expert will lock a region down based on two conditions: 1) the number of hospitalized, or symptomatic patients in this region, exceeds the threshold ; 2) this region has not been restricted very much in history, reflected by that does not exceed the threshold . During testing, this expert is also used as a comparing baseline.
We let the agent first explore with expert’s guidance and then gradually learn to explore by itself to outperform the expert. The idea is inspired by the approach adopted to develop AlphaGO (Silver et al., 2016). Specifically, we set an adaptive probability for the agent to directly choose the expert action instead of taking an action by itself during training. This design enables the agent to compare the pseudoexport strategy with its own, which avoids the agent to move towards inefficient directions at the initial stage of training. The adaptive probability decreases along with training steps, which enables the agent to broadly explore and outperform the expert at the later stage of training.
Avoiding Extreme Points: The wide exploration might lead the agent to fall into some extreme points. The training might be unstable due to a sudden large loss caused by a poor control action. Meanwhile, the strong incentive of avoiding the large loss will force the agent to fall into local optimal control policies, such as a forever fullylockdown. To avoid such extreme points, we set two rules:

The infection threshold : If the agent explores into a state where the regional mean number of infected people exceeds , it will end the episode and receive a large penalty.

The lockdown threshold : If the agent explores into a state where there exists a region that exceeds , it will end the episode and receive a large penalty.
The two rules are straightforward but effective to help the RL agent avoid potential local optimums.
4. Experimental Evaluations
In this section, we conduct extensive experiments to answer the following research questions:
RQ1: Can DURLECA resolve the lifeoreconomy dilemma?
RQ2: Can DURLECA adapt to both early intervention and late intervention?
RQ3: Can DURLECA be generalized to different cities and different diseases?
Besides, we conduct ablation studies in Appendix to evaluate the effectiveness of our proposed FlowGNN and RL exploration strategies.
4.1. Dataset
We use a realworld OD dataset collected by a mobile operator in Beijing to evaluate DURLECA. The dataset divides Beijing into regions and covers 544,623 residents. Averagely, each region has observed residents. The dataset covers 24hour ODflows for the whole month of January 2019. We repeat the onemonth data 24 times and get a prolonged dataset of 24 months so that we have a sufficiently long period for discussing epidemic control. We list other details in Appendix.
4.2. Metrics and Settings
We design six metrics to evaluate the performance of DURLECA on resolving the lifeoreconomy dilemma. We introduce the metrics in the following and summarize them in Table 1.
Metric  Value  Physical Meaning 

Mean/Max  01686  Temporal mean/max of 
Total  01686  Total after the epidemic 
01  Total quota rate  
0744  The city 20%mobility duration  
0744  The region 20%mobility duration 
We select three metrics to assess the epidemicsuppressing performance of an epidemic control policy, including the total number of infected people that is equal to at the end of the epidemic period, the mean number of hospitalized people whose value is the mean of over time, and the peak demand for hospitalization capacity that is equal to the max value of over time. Total determines the total social medical costs while both Mean/Max of reflect the sustained and peak pressure on the healthcare system. We also select three metrics to assess the mobilityretaining performance of an epidemic control policy, including the total ratio of retained mobility , the duration of stringent mobility restrictions on the whole city , and the duration of stringent mobility restrictions on the most restricted region .
Epidemic Settings: Without the loss of generality, we set in most of our experiments. The estimated basic reproduction number is 2.1.
Intervention time: We define as the time when the policymakers discover the epidemic and start to intervene. In our experiments, we compare results with .
For more details about our experiment settings, please refer to Appendix. Our implementation is available online at https://github.com/anyleopeace/DURLECA.
4.3. Performance Comparison
Baselines: We set four different expert baselines to simulate different realworld expert policies and compare them with DURLECA on resolving the lifeoreconomy dilemma.

EPFixed: In the real world, a simple but inflexible control is to restrict all mobility in the city. For simulation, we design EPFixed to give a fixed quota rate to all interregional mobility during the whole epidemic period. We set in our experiments, as we find them at the boundary of successfully controlling the epidemic.

EPSoft: We design an expert baseline following Equation (19), which softly depends on the historical mobility loss and the current hospitalized population to determine whether to lock down a region. We set in our experiments. guarantees the expert receive equivalent information compared with DURLECA. corresponds to the realworld control policy in some countries: a continuous 7day (168hour) lockdown.

EPHard: Without softly depending on the historical mobility loss, an expert can reopen a region if it has been locked down for successive days. This expert, namely EPHard, gives daily quota as follows:
(20) We set for a similar reason of EPSoft.

EPLockdown: The most robust and conservative policy is to lock down the whole city until the epidemic ends. To simulate it, we design an expert following Equation (19) but with . It can lock down a region for an anylong time until the hospitalized population becomes zero.
Mean/Max  Total  

No Intervention  27.21/157.55  1069.29  1  0  0 
EPFixed  4.03/17.96  877.32  0.20  724  724 
EPFixed  0.44/1.55  10.18  0.15  724  724 
EPSoft  4.66/53.80  1040.68  0.57  9  19 
EPHard  0.45/1.78  8.31  0.13  36  36 
EPLockdown  0.41/1.42  5.75  0  27  27 
DURLECA  0.60/2.28  19.07  0.76  0  0 
Results and Analysis: We compare DURLECA with all baselines when in Table 2. We also visualize three selected results in Figure 3. Expert baselines can achieve only one goal in the lifeoreconomy dilemma, while DURLECA can achieve both.
EPSoft can retain 57% of the total mobility. However, it leads to an epidemic outbreak, reflected by the superhigh value of Mean/Max and Total . The healthcare system will break down. EPFixed (), EPHard and EPLockdown can keep Mean/Max at a low level so that the healthcare system will not be overwhelmed. However, the low value of indicates that all of them fail to retain mobility. The large value of and demonstrates that some regions and the city have to experience longterm lockdown, which is an unacceptable damage to the economy. Besides, the differences of EPFixed (15%) and EPFixed (20%) in Mean/Max and Total also indicate that the expert control is very vulnerable to mobility perturbation. The above results also manifest that all those expert policies fail at resolving the lifeoreconomy dilemma of epidemic control.
Compared with those baselines, DURLECA simultaneously suppresses the epidemic and retains a large amount of mobility. DURLECA achieves low values of Mean/Max , which guarantee the demands for hospitalization will not exceed the capacity of most countries. DURLECA also suppresses the total infected population at a low level, about of the total population. The red curve in Figure 3 presents the performance of DURLECA in epidemic suppression.
DURLECA also retains the most mobility. 76% of the total mobility during the intervened period is retained. Furthermore, no regions will be fully locked down. DURLECA retains 70100% mobility for most regions in the city. The economic loss due to epidemic control can be significantly reduced. In all, DURLECA successfully resolves the lifeoreconomy dilemma.
DURLECA’s control is highly customized and dynamic, which is hard to be mimicked by human experts. In Figure 4, we visualize the spatial distribution of quota rates and the associated histogram in four selected periods. Figure 4 manifests that DURLECA’s smartness in distributing quotas according to both epidemic risks and mobility patterns. The agent tends to give more quotas in a lowrisk and lowmobility period and give fewer quotas in either a highrisk or a highmobility period.
4.4. Comparison of the Scenarios of Early/Late Intervention
To examine whether DURLECA is still effective if the government’s intervention is later than the discovery of the first patient, we compare DURLECA’s performance in three scenarios. We have discussed the verylateintervention scenario where the government starts to act 20 days after discovering the first patient () in Section 4.3. Here, we compare the earlyintervention scenario () and the lateintervention scenario (). The results are shown in Table 3.
We find that: 1) EPSoft can control the epidemic in the earlyintervention scenario. Because the virus has not widely spread, restricting the few infected areas is enough for epidemic suppressing. However, it fails to avoid an epidemic outbreak in the lateintervention scenario. 2) EPHard and EPLockdown can control the epidemic under both scenarios. However, it will lock all risky regions down and cut off most mobility. 3) DURLECA successfully suppresses the epidemic while retains the majority of urban mobility in both scenarios.
Mean/Max  Total  

EPSoft  0  0.02/0.03  0.08  0.18 
EPHard  0  0.03/0.04  0.10  0.13 
EPLockdown  0  0.01/0.03  0.09  0 
DURLECA  0  0.03/0.04  0.27  0.73 
EPSoft  10  4.67/62.10  1041.64  0.57 
EPHard  10  0.08/0.14  0.64  0.11 
EPLockdown  10  0.08/0.14  0.55  0 
DURLECA  10  0.07/0.16  1.31  0.74 
4.5. Generalization Ability
We examine the generalization ability of DURLECA under different urban settings and diseases. Cities have different capacities for hospitalization treatment. Heterogeneous economic structures also cause cities’ divergent tolerance for mobility restrictions. We vary the setting of , which represents the change of urban features, and examine DURLECA’s performance. The results are shown in Table 4. The results demonstrate that DURLECA can find out different policies responding to the change of urban settings. For instance, we find that a higher leads to more mobility and more hospitalizations, which suggests that cities with higher hospitalization capacities can take more patients and retain more mobility. We also examine DURLECA’s adaptiveness to various diseases. We vary the setting of , , , to simulate different diseases with different . We find that DURELCA is also able to adjust epidemiccontrol policy to adapt to different diseases (Table 5). For instance, DURLECA provides loose mobility restrictions on low diseases but stringent mobility restrictions on high ones. DURLECA’s adaptiveness to urbansetting and diseasesetting changes not only demonstrates its generalization ability but also its smartness.
Mean/Max  Total  

1  72  0.54/1.74  10.02  0.38 
3  72  0.60/2.28  19.07  0.76 
10  72  2.79/6.34  223.29  0.90 
3  48  1.69/4.60  153.54  0.88 
3  72  0.60/2.28  19.07  0.76 
3  168  0.45/1.58  16.69  0.71 
Mean/Max  Total  

1.4  0.35/1.20  10.96  0.86 
2.1  0.60/2.28  19.07  0.76 
3.5  0.81/4.08  23.18  0.46 
5. Related work
Epidemic Modeling: The SIR model is a widely used mathematical model in epidemiology, which divides the population into three states: susceptible, infected and recovered (Kermack and McKendrick, 1927). Based on the SIR model, Ogren and Martin used an embedded Newton algorithm to help find an optimal control strategy (Ogren and Martin, 2000). The distributed delay and discrete delay of SIR was also studied (McCluskey, 2010). Considering a more practical epidemic scenario, the SEIR model added an Exposed state to deal with the incubation period (Li and Muldowney, 1995). Others also strengthened the differential equations considering vaccination consequences for a measles epidemic (Allen, ). Later works also tried to incorporate human spatial patterns into the epidemic model. Sattenspiel et al. presented how contacts occur between individuals from different regions and how they influence epidemic spreads (Sattenspiel et al., 1995). Balcan et al. presented the GLobal Epidemic and Mobility model, which integrated sociodemographic and population mobility data in a spatially structured stochastic approach (Balcan et al., 2010). Different from previous works, we distinguish visible and invisible infections and model epidemic transmission upon traffic flows, so that to support exploring mobilitycontrol policies for epidemic control.
Graph Neural Network for ODflows: The problem of estimating, predicting and controlling human flows between regions has been addressed using neural networks since (Lorenzo and Matteo, 2013). Especially due to the reason that most OD flows are modeled based on graphs, Graph Neural Network (GNN) shows great importance and was first suggested in (Scarselli et al., 2009). Later GNNs were used to predict future mobility flows (Chai et al., 2018; Geng et al., 2019; Wang et al., 2019). Besides, (Derr et al., 2020) borrowed knowledge from epidemic models to design GNN for node prediction in documents. However, existing GNN architectures lack the ability to model the virusspreading flow. Our designed FlowGNN allows our model to characterize the virusspreading flow and guarantees DURLECA’s capability.
Deep Reinforcement Learning: Deep Reinforcement Learning (DRL) has been proved to be effective for control problems that have a large action space (Mnih et al., 2015; Lillicrap et al., 2015; Van Hasselt et al., 2016). DQN (Mnih et al., 2015) and DDPG (Lillicrap et al., 2015) are two representative DRL algorithms, proposed for discrete control problems and continuous control problems, respectively. To enable the agent to find an optimal solution, later works proposed to enhance exploration (Plappert et al., 2017; Fortunato et al., 2017; Henderson et al., 2018)
. Imitation learning is another area of RL, where the goal is to enable the agent to behave like a human expert
(Hussein et al., 2017). AlphaGo proposed to start from imitation but further explore to outperform expert (Silver et al., 2016). In (Wijayanto and Murata, 2019), DQN was also used for node protection against epidemic under a single objective. Compared with it, both our control action and objectives are more complex and practical, and thus our RL training are more challenging. We design two strategies to address the exploration challenge.6. Future research and Conclusion
A series of problems ask for future study on smartandprivacyprotected epidemic control while DURLECA can be the framework. In this research, we do not consider the uncertainty of mobility and epidemic information when DURLECA explores epidemic policies. It asks for future work to explore the algorithm for searching a robust policy when the information is uncertain. A practical policy has to be robust even if there exist errors in the input data.
To conclude, this research demonstrates a sequence of important facts, which broaden the vision of human society for epidemic control and are listed below:

Private data is dispensable because restricting the aggregated interregional mobility sufficiently lowers the probability of infectious people’s movement and thus suppresses the risk of epidemic transmission.

Resolving the lifeoreconomy dilemma of epidemic control must allow dynamic and customized regional policies.

The powerfulness of our GNNenhanced RL in epidemic control manifests that field knowledge is critical for AIsystem architecture and valuable for neural network training.
In all, smart governance empowered by AI will protect future society from the loss of lives due to epidemics and the economic risk caused by epidemic control.
Acknowledgement
This work was supported in part by The National Key Research and Development Program of China under grant 2018YFB1800804, the National Nature Science Foundation of China under U1936217, 61971267, 61972223, 61941117, 61861136003, Beijing Natural Science Foundation under L182038, Beijing National Research Center for Information Science and Technology under 20031887521, research fund of Tsinghua University  Tencent Joint Laboratory for Internet Innovation Technology, and The Information Core Technology Center at Institute for Interdisciplinary.
References
 [1] An introduction to mathematical biology. 2007. ISBN 10, pp. 0–13. Cited by: §5.
 Modeling the spatial spread of infectious diseases: the global epidemic and mobility computational model. Journal of computational science 1 (3), pp. 132–145. Cited by: §1, §5.
 Understanding coronanomics: the economic implications of the coronavirus (covid19) pandemic. SSRN Electronic Journal https://doi org/10/ggq92n. Cited by: §1.
 Bike flow prediction with multigraph convolutional networks. pp. 397–400. Cited by: §5.
 Contact tracing mobile apps for covid19: privacy considerations and related tradeoffs. arXiv preprint arXiv:2003.11511. Cited by: §1.
 Epidemic graph convolutional network. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 160–168. Cited by: §5.
 Quantifying sarscov2 transmission suggests epidemic control with digital contact tracing. Science 368 (6491). Cited by: §1.
 Noisy networks for exploration. arXiv preprint arXiv:1706.10295. Cited by: §5.

Spatiotemporal multigraph convolution network for ridehailing demand forecasting.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 3656–3663. Cited by: §5.  Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: Appendix C, §3.2.
 Deep reinforcement learning that matters. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §5.
 Imitation learning: a survey of learning methods. ACM Comput. Surv. 50 (2). External Links: ISSN 03600300, Link, Document Cited by: §5.
 [13] Notes on r0. Cited by: §B.3.
 A contribution to the mathematical theory of epidemics. Proceedings of the royal society of london. Series A, Containing papers of a mathematical and physical character 115 (772), pp. 700–721. Cited by: §1, §2.2, §5.
 Global stability for the seir model in epidemiology. Mathematical biosciences 125 (2), pp. 155–164. Cited by: §5, footnote 3.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §3.1, §5.
 OD matrices network estimation from link counts by neural networks. Journal of Transportation Systems Engineering and Information Technology 13 (4), pp. 84–92. Cited by: §5.
 Complete global stability for an sir epidemic model with delay—distributed or discrete. Nonlinear Analysis: Real World Applications 11 (1), pp. 55–59. Cited by: §5.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §5.
 Optimal vaccination strategies for the control of epidemics in highly mobile populations. In Proceedings of the 39th IEEE Conference on Decision and Control (Cat. No. 00CH37187), Vol. 2, pp. 1782–1787. Cited by: §5.
 Mobile network data for public health: opportunities and challenges. Frontiers in public health 3, pp. 189. Cited by: §1.
 Parameter space noise for exploration. arXiv preprint arXiv:1706.01905. Cited by: §3.1, §5.
 Kerasrl. GitHub. Note: https://github.com/kerasrl/kerasrl Cited by: §B.2.
 Heterogeneous length of stay of hosts’ movements and spatial epidemic spread. Scientific reports 2, pp. 476. Cited by: §1.
 A structured epidemic model incorporating geographic mobility among regions. Mathematical biosciences 128 (1), pp. 71–92. Cited by: §5.
 The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §5.
 Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–489. Cited by: §3.3, §5.
 Most americans are not willing or able to use an app tracking coronavirus infections. that’sa problem for big tech’s plan to slow the pandemic. Washington Post. Retrieved from https://www. washingtonpost. com/technology/2020/04/29/mostamericansarenotwillingorableuseanapptrackingcoronavirusinfectionsthatsproblembigtechsplanslowpandemic. Cited by: §1.
 Deep reinforcement learning with double qlearning. In Thirtieth AAAI conference on artificial intelligence, Cited by: §5.
 Origindestination matrix prediction via graph convolution: a new perspective of passenger demand modeling. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1227–1235. Cited by: §5.
 Quantifying the impact of human mobility on malaria. Science 338 (6104), pp. 267–270. Cited by: §1.
 Effective and scalable methods for graph protection strategies against epidemics on dynamic networks. Applied Network Science 4 (1), pp. 18. Cited by: §5.
Appendix A Notation Summary
Term/Notation  Definition 

superscript  At time step . 
subscript  The original demand without restrictions. 
subscript  With restriction . 
subscript ,  Region index 
The mobility. A matrix.  
The OD flow from to . A scalar.  
. The outflow from . A scalar.  
. The mean outflow from . A scalar.  
. The region quota rate. A scalar.  
. The city quota rate. A scalar.  
. The total quota rate. A scalar. 
Appendix B Experiment Settings and Reproducibility
b.1. Dataset
We list the dataset details in Table 7, where counts the mean probability for an individual to move in one hour.
City  Regions  Mean Population  Duration  

Beijing  1686  0.18  744 Days 
Privacy and ethical concerns: We have taken the following procedures to address privacy and ethical concerns. First, all of the researchers have been authorized by the data provider to utilize the data for research purposes only. Second, the data is completely anonymized. Third, we store all the data in a secured offline server.
b.2. Implementation Details
Without the loss of generality, we set the moving transmission rate , the staying transmission rate , the hospitalized rate and the recover rate . Without intervention, the estimated basic reproduction rate is 2.1 at the initial stage of the epidemic. For the reward, we mainly set . For the pseudo expert, we set . For the extreme point threshold, we set .
During training, we randomly initialize an epidemic state at the start of each episode. We train DURLECA for 400,000 steps, using Adam optimizer with the learning rate as 0.0001. During testing, we fix one epidemicinitialization setting and compare different baselines. Considering the randomness of training, we train DURLECA with different random seeds 5 times for each set of configurations, and choose the one that achieves the best episode reward to report as the result.
We mainly implement DURLECA based on KerasRL (Plappert, 2016) with our modifications.
b.3. Disease
In a classical SIR model, the basic reproduction rate is calculated as (Jones, ). In our model, as the infection has been divided into two parts, we estimate an averaged over and according to their corresponding population size,
(21) 
Then we estimate .
In Section 4.5, we vary the setting of and keep the same. To make a fair comparison, we also vary for each simulated disease to make sure the city has nearly the same number of hospitalized people when we start the intervention. For , we set . For , we set . For , we set .
Appendix C Ablation Study
To evaluate the effectiveness of our proposed FlowGNN and RL exploration strategies, we conduct ablation studies in this section.
GNNBaselines: To evaluate the effectiveness of our proposed FlowGNN, we use the well known GraphSageConv layer (Hamilton et al., 2017) and our modified GraphSageConv layer to replace the proposed FlowGNN layer in the actor network and the critic network. We name the two baselines as GNNMean and GNNSoftmax.
The layer calculation of GNNMean follows Equation (22):
(22) 
where denotes the connected regions of .
The layer calculation of GNNSoftmax follows Equation (23):
(23) 
RLBaselines: To evaluate the effectiveness of our RL exploration strategies, we remove the pseudoexpert strategy and the avoidingextremepoints strategy, respectively. We refer to the two baselines as RLNoEP and RLNoThre.
Mean/Max  Total  

No Intervention  27.21/157.55  1069.29  1  0  0 
GNNMean  /         
GNNSoftmax  0.53/1.88  7.79  0.06  26  28 
RLNoEP  0.41/1.45  5.87  0.00  27  27 
RLNoThre  1.43/3.68  86.82  0.75  0  9 
DURLECA  0.60/2.28  19.07  0.76  0  0 
Results and Analysis: As shown in Table 8, without FlowGNN or the proposed RL exploration strategies, the agent fails to learn a good policy.
The failure of GNNMean comes from its inability to learn weighted edge information, i.e., how many people move from one region to another. With considering weighted edge information, GNNSoftmax still fails to retain mobility, as it can not describe traffic flows and the epidemic transmission upon it. These prove the effectiveness of our proposed FlowGNN.
RLNoEP gives a longterm lockdown to the whole city, which is a typical local optimum. As for RLNoThre, the agent successfully finds one policy that achieves relatively low hospitalizations and high mobility. However, this solution is worse than DURLECA. Besides, we find that the success of RLNoThre highly relies on luck. During our five repeating experiments, the agent was stuck in local optimums for four times, giving a longterm lockdown to the whole city. These, as discussed earlier in Section 1, are due to the difficulty of exploration. The agent is easy to encounter extreme points during exploration, and the extreme points force the agent to adopt conservative policies, i.e., lock the whole city down. Compared with RLNoEP and RLNoThre, DURLECA is guided by a pseudoexpert and is designed to avoid extreme points. Thus, DURLECA can find much better solutions.
Comments
There are no comments yet.