1 Introduction
With the COVID19 pandemic souring across the world, a reliable model is needed to describe the observed spread of the disease, make predictions about future, and guide public policy design to control the spread.
Existing Epidemic Models
There are many existing macroscopic epidemic models.[daley2001epidemic] For example, the SI model describes the growth of infection rate as the product of the current infection rate and the current susceptible rate. The SIR model further incorporates the effect of recovery into the model, i.e., when the infected population turns into immune population after a certain period of time. The SIRS model considers the case that immunity is not for lifetime and that the immune population can become susceptible population again. In addition to these models, the SEIR model incorporates the incubation period into analysis. Incubation period refers to the duration before symptoms show up.[yan2006seir] The most important factor in all those models is , the regeneration number, which tells how fast the disease can spread. can be regressed from data.
Limitations of Existing Models
Although these models are useful in predicting the spread of epidemics, they lack the granularity needed for analyzing individual behaviors during an epidemic and understanding the relationship between individual decisions and the spread of the disease.[barrett2009estimating] For example, many countries now announced “lockdown”, “shelterinplace”, “stayathome”, or similar orders. However, their effects are very different across different countries, or even across different counties in the same country. One factor that can possibly explain these differences is the cultural difference. In different cultures, individuals make different choices. For instance, in the west, people exhibit greater inertia to give up their working/life routines so that they do not follow the orders seriously. While in the east, people tend to obey the rules better. These different individual choices can result in significantly different outcomes in disease propagation that cannot be captured by a macroscopic model.
A Microscopic Epidemic Model
In this paper, we develop a microscopic epidemic model by explicitly considering individual decisions and the interaction among different individuals in the population, in the framework of multiagent systems. The aforementioned cultural difference can be understood as a difference in agents’ cost functions, which then affect their behaviors when they are trying to minimize their cost functions. The details of the microscopic epidemic model will be explained in the next section, followed by the analysis of the dynamics of the multiagent system, and the prediction of system trajectories using multiagent reinforcement learning. The model is still in its preliminary form. In the discussion section, future directions are pointed out to make the model more realistic.
2 Microscopic Epidemic Model
Suppose there are agents in the environment. Initially, agents are infected. Agents are indexed from to . Every agent has its own state and control input. The model is in discrete time. The time interval is set to be one day. The evolution of the infection rate for consecutive days depends on agents’ actions. The questions of interest are: How many agents will eventually be infected? How fast they will be infected? How can we slow down the growth of the infection rate?
2.1 Agent Model
We consider two state values for an agent, e.g., for agent , means healthy (susceptible), means infected. Everyday, every agent decides its level of activities . The level of activities for agent can be understood as the expected percentage of other agents in the system that agent wants to meet. For example, means agent expects to meet one other agent. The actual number of agents that agent meets depends not only on agent ’s activity level, but also on other agents’ activity level. For example, if all other agents choose an activity level , then agent will not be able to meet any other agent no matter what it chooses. Mathematically, the chance for agent and agent to meet each other depends on the minimum of the activity levels of these two agents, i.e., . In the extreme cases, if agent decides to meet everyone in the system by choosing , then the chance for agent to meet with agent is . If agent decides to not meet anyone in the system by choosing , then the chance for agent to meet with agent is .
Before we derive the system dynamic model, the assumptions are listed below:These assumptions can all be relaxed in future work. They are introduced mainly for the simplicity of the discussion.

In the agent model, we only consider two states: healthy (susceptible) and infected. All healthy agents are susceptible to the disease. There is no recovery and no death for infected agents. There is no incubation period for infected agents, i.e., once infected, the agent can start to infect other healthy agents. To relax this assumption, we may introduce more states for every agent.

The interactions among agents are assumed to be uniform, although it is not true in the real world. In the real world, given a fixed activity level, agents are more likely to meet with close families, friends, colleagues than strangers on the street. To incorporate this nonuniformity into the model, we need to redefine the chance for agent and agent to meet each other to be , where is a coefficient that encodes the proximity between agent and agent and will affect the chance for them to meet with each other. For simplicity, we assume that the interaction patterns are uniform in this paper.

Meeting with infected agents will result in immediate infection. To relax this assumption, we may introduce an infection probability to describe how likely it is for a healthy agent to be infected if it meets with an infected agent.
2.2 System Dynamic Model
On day , denote agent ’s state and control as and . By definition, the agent state space is and the agent control space is . The system state space is denoted . The system control space is denoted . Define as the number of infected agents at time . The set of infected agents is denoted:
(1) 
The state transition probability for the multiagent system is a mapping
(2) 
According to the assumptions, an infected agent will always remain infected. Hence the state transition probability for an infected agent does not depend on other agents’ states or any control. However, the state transition probability for a healthy agent depends on others. The chance for a healthy agent to not meet an infected agent is . A healthy agent can stay healthy if and only if it does not meet any infected agent, the probability of which is . Then the probability for a healthy agent to be infected is . From the expression , we can infer that: the chance for a healthy agent to stay health is higher if

the agent limits its own activity by choosing a smaller ;

the number of infected agents is smaller;

the infected agents in limit their activities.
The state transition probability for an agent is summarized in table 1.
Example
Consider a fouragent system shown in section 2.2. Only agent is infected. And the agents choose the following activity levels: . Then the chance for agents and to meet with each other is , , and . Note that . The chance for agents , , and to stay healthy is , although they have different activity levels.
2.3 Case Study
Before we start to derive the optimal strategies for individual agents and analyze the closedloop multiagent system, we first characterize the (openloop) multiagent system dynamics by Monte Carlo simulation according to the state transition probability in table 1.
Suppose we have agents. At the beginning, only agent is infected. We consider two levels of activities: normal activity level and reduced activity level . The two activity levels are assigned to different agents following different strategies as described below. In particular, we consider “no intervention” case where all agents continue to follow the normal activity level, “immediate isolation” case where the activity levels of infected agents immediately drop to the reduced level, “delayed isolation” case where the activity levels of infected agents drop to the reduced level after several days, and “lockdown” case where the activity levels of all agents drop to the reduced level immediately.
For each case, we simulate 200 system trajectories and compute the average, maximum, and minimum (number of infected agents) versus from all trajectories. A system trajectory in the “no intervention” case is illustrated in section 2.3, where for all agents. The trajectories under different cases are shown in Fig. 1, where the solid curves illustrate the average and the shaded area corresponds to the range from min to max . The results are explained below.

Case 0: no intervention.
All agents keep the normal activity level . The scenarios for and are illustrated in Fig. 1. As expected, a higher activity level for all agents will lead to faster infection. The trajectory of has a shape, whose growth rate is relatively slow when either the infected population is small or the healthy population is small, and is maximized when agents are infected. It will be shown in the following discussion that (empirical) macroscopic models also generate curves.

Case 1: immediate isolation of infected agents.
The activity levels of infected agents immediately drop to , while others remain . The scenario for and is illustrated in Fig. 1. Immediate isolation significantly slows down the growth of the infections rate. As expected, it has the best performance in terms of flattening the curve, same as the lockdown case. The trajectory also has a shape.

Case 2: delayed isolation of infected agents.
The activity levels of infected agents drop to after days, while others remain . In the simulation, and . The scenarios for and are illustrated in Fig. 1. As expected, the longer the delay, the faster the infection rate grows, though the growth of the infection rate is still slower than the “no intervention” case. Moreover, the peak growth rate (when agents are infected) is higher when the delay is longer.

Case 3: lockdown.
The activity levels of all agents drop to . The scenario for is illustrated in Fig. 1. As expected, it has the best performance in terms of flattening the curve, same as the immediate isolation case.In the case that infected population can be asymptomatic or have a long incubation period before they show any symptom, like what we observe for COVID19, immediate identification of infected person and then immediate isolation is not achievable. Then lockdown is the only best way to control the spread of the disease in our model.
Since the epidemic model is monotone, every agent will eventually be infected as long as the probability to meet infected agents does not drop to zero. Moreover, we have not discussed decision making by individual agents yet. The activity levels are just predefined in the simulation.
Remark
The model we introduced is microscopic, in the sense that interactions among individual agents are considered. The simulated openloop trajectories are indeed similar to those from a macroscopic model. Since only susceptible and infected populations are considered in the proposed microscopic model, we then compare it with the macroscopic SusceptibleInfected (SI) model. Define the state as the fraction of infected population. The growth of the infected population is proportional to the susceptible population and the infected population. Suppose the infection coefficient is , the system dynamics in the SI model follow:
(3) 
We simulate the system trajectory under different infection coefficients as shown in eq. 3. The trajectories also have S shapes, similar to the ones in the microscopic model. However, since this macroscopic SI model is deterministic, there is no “uncertainty” range as shown in the microscopic model. The infection coefficient depends on the agents’ choices of activity levels. However, there is not an explicit relationship yet. It is better to directly use the microscopic model to analyze the consequences of individual agents’ choices.
3 Distributed Optimal Control
This section tries to answer the following question: in the microscopic multiagent epidemic model, what is the best control strategy for individual agents? To answer that, we need to first specify the knowledge and observation models as well as the cost (reward) functions for individual agents. Then we will derive the optimal choices of agents in a distributed manner. The resulting system dynamics correspond to a Nash Equilibrium of the system.
3.1 Knowledge and Observation Model
A knowledge and observation model for agent includes two aspects: what does agent know about itself, and what does agent know about others? The knowledge about any agent includes the dynamic function of agent and the cost function of agent . The observation corresponds to runtime measurements, i.e., the observation of any agent includes the runtime state and the runtime control . In the following discussion, regarding the knowledge and observation model, we make the following assumptions:

An agent knows its own dynamics and cost function;

All agents are homogeneous in the sense that they share the same dynamics and cost functions. And agents know that all agents are homogeneous, hence they know others’ dynamics and cost functions;Not knowing other agents’ dynamics or cost functions will result in information asymmetry, which creates difficulty in the analysis. Nonetheless, the assumption can be relaxed in the future.

At time , agents can measure for all . But they cannot measure until time . Hence, the agents are playing a simultaneous game. They need to infer others’ decisions when making their own decisions at any time .
3.2 Cost Function
We consider two conflicting interests for every agent:The identification of these two conflicting interests is purely empirical. To build realistic cost functions, we need to either study the real world data or conduct human subject experiments.

Limit the activity level to minimize the chance to get infected;

Maintain a certain activity level for living.
We define the runtime cost for agent at time as
(4) 
where corresponds to the first interest, corresponds to the second interest, and adjusts the preference between the two interests. The function is assumed to be smooth.The function can be a decreasing function on , meaning that the higher the activity level, the better. The function can also be a convex parabolic function on with the minimum attained at some , meaning that the activity level should be maintained around . Due to our homogeneity assumption on agents, they should have identical preferences, i.e., for all .
Agent chooses its action at time by minimizing the expected cumulative cost in the future:
(5) 
where is a discount factor. The objective function depends on all agents’ current and future actions. It is difficult to directly obtain an analytical solution of (5). Later we will use multiagent reinforcement learning to obtain a numerical solution.
In this section, to simplify the problem, we consider a single stage gameThe formulation (5) corresponds to a repeated game as opposed to the single stage game. Repeated games capture the idea that an agent will have to take into account the impact of its current action on the future actions of others. This impact is called the agent’s reputation. The interaction is more complex in a repeated game than that in a single stage game. where the agents have zero discount of the future, i.e., . Hence the objective function is reduced to
(6) 
which only depends on the current actions of agents. According to the state transition probability in table 1, the expected cost is
(7) 
3.3 Nash Equilibrium
According to (7), the expect cost for an infected agent only depends on its own action. Hence the optimal choice for an infected agent is . Then the optimal choice for a healthy agent satisfies:
(8)  
(9) 
Note that the term is positive and is increasing for and then constant for . Hence, the optimal solution for (9) should be smaller than .If , then (9) becomes , whose optimal solution is with cost . If , then (9) becomes where . Since , the optimal solution satisfies that with cost . Note that equals to the smallest cost for the case . Hence the optimal solution for (9) satisfies that . Then the objective in (9) can be simplified as . In summary, the optimal actions for both the infected and the healthy agents in the Nash Equilibrium can be compactly written as
(10) 
Example
Consider the previous example with four agents shown in section 2.2. Define
(11) 
which is a monotonically decreasing function as illustrated in section 3.3. Then the optimal actions in the Nash Equilibrium for this specific problem satisfy:
(12) 
Solving for (12), for infected agents, . For healthy agents, the choice also depends on as illustrated in eq. 12. We have assumed that which is identical for all agents. We further assume that such that the optimal solution for healthy agents should be . The optimal actions and the corresponding costs for all agents are listed in table 2. In the Nash Equilibrium, no agent will meet each other, since all agents except agent reduce their activity levels to zero. The actual cost (received at the next time step) equals to the expected cost (computed at the current time step).
Agent ID  State  Optimal  Optimal  Actual 

1  1  1  1  1 
2,3,4  0  0  
Total 
However, let us consider another situation where the infected agent chooses activity level and all other healthy agents choose activity level. The resulting costs are summarized in table 3. Obviously, the overall cost is reduced in the new situation. However, this better situation cannot be attained spontaneously by the agents, due to externality of the system which will be explained below.
Agent ID  State  Optimal  Optimal  Actual 

1  1  0  1+  1+ 
2,3,4  0  1  0  0 
Total 
3.4 Dealing with Externality
For a multiagent system, define the system cost as a summation of the individual costs:
(13) 
The system cost in the Nash Equilibrium is denoted , which corresponds to the evaluation of under agent actions specified in (10). On the other hand, the optimal system cost is defined as
(14) 
The optimization problem (14) is solved in a centralized manner, which is different from how the Nash Equilibrium is obtained. To obtain the Nash Equilibrium, all agents are solving their own optimization problems independently. Although their objective functions depend on other agents’ actions, they are not jointly make the decisions, but only “infer” what others will do. By definition, . In the example above, and . The difference is called the loss of social welfare. In the epidemic model, the loss of social welfare is due to the fact that bad consequences (i.e., infecting others) are not penalized in the cost functions of the infected agents. Those unpenalized consequences are called externality. There can be both positive externality and negative externality. Under positive externality, agents are lacking motivations to do things that are good for the society. Under negative externality, agents are lacking motivations to prevent things that are bad for the society. In the epidemic model, there are negative externality with infected agents.
To improve social welfare, we need to “internalize” externality, i.e., add penalty for “spreading” the disease. Now let us redefine agent ’s runtime cost as
(15) 
where is a monotonically increasing function. The last term does not affect healthy agents since , but adds a penalty for infected agents if they choose large activity level. One candidate function for is . In the real world, such “cost shaping” using can be achieved through social norms or government regulation. The expected cost becomes
(16) 
Suppose the function is well tuned such that the . Then although the expected costs for infected agents are still independent from others, their decision is considerate to healthy agents. When the infected agents choose , then for healthy agents, the expected cost becomes , meaning that they do not need to worry about getting infected. Let us now compute the resulting Nash Equilibrium under the shaped costs using the previous example.
Example
In the fouragent example, set . Then . Hence agent 1 will choose . For agents , they will choose since they are only minimizing . The resulting costs are summarized in table 4. With the shaped costs, the system enters into a better Nash Equilibrium which indeed aligns with the system optimum in (14). A few remarks:

Cost shaping did not increase the overall cost for the multiagent system.

The system optimum remains the same before and after cost shaping.

Cost shaping helped agents to arrive at the system optimum without centralized optimization.
Agent ID  State  Optimal  Optimal  Actual 

1  1  0  1+  1+ 
2,3,4  0  1  0  0 
Total 
4 MultiAgent Reinforcement Learning
We have shown how to compute the Nash Equilibrium of the multiagent epidemic model in a single stage. However, it is analytically intractable to compute the Nash Equilibrium when we consider repeated games (5). The complexity will further grow when the number of agents increases and when there are information asymmetry. Nonetheless, we can apply multiagent reinforcement learning[bucsoniu2010multi] to numerically compute the Nash Equilibrium. Then the evolution of the pandemic can be predicted by simulating the system under the Nash Equilibrium.
4.1 Q Learning
As evident from (10), the optimal action for agent at time is a function of and . Hence we can define a Q function (action value function) for agent as
(17) 
According to the assumptions made in the observation model, all agents can observe at time . For a single stage game, we have derived in (10) that . For repeated games (5), we can learn the Q function using temporal different learning. At every time , agent chooses its action as
(18) 
After taking the action , agent observes and and receives the cost at time . Then agent updates its Q function:
(19)  
(20) 
where is the learning gain and is the temporal difference error.
All agents can run the above algorithm to learn their functions during the interaction with others. However, the algorithm introduced above has several problems:

Exploration and limited rationality.
There is no exploration in (18). Indeed, Qlearning is usually applied together with greedy where with probability , the action is chosen to be the optimal action in (18), and with probability
, the action is randomly chosen with a uniform distribution over the action space. The
greedy approach is introduced mainly from an algorithmic perspective to improve convergence of the learning process. When applied to the epidemic model, it has a unique societal implication. When agents are randomly choosing their behaviors, it represents the fact that agents have only limited rationality. Hence in the learning process, we apply greedy as a way to incorporate exploration for faster convergence as well as to take into account limited rationality of agents. 
Data efficiency and parameter sharing.
Keeping separated Q functions for individual agents is not data efficient. An agent may not be able to collect enough samples to properly learn the desired Q function. Due to the homogeneity assumptions we made earlier about agents’ cost functions, it is more data efficient to share the Q function for all agents. Its societal implication is that agents are sharing information and knowledge with each other. Hence, we apply parameter sharing[gupta2017cooperative] as a way to improve data efficiency as well as to consider information sharing among agents during the learning process.In a more complex situation where agents are not homogeneous, it is desired to have parameter sharing with a smaller group of agents, instead of parameter sharing will all agents.
With the above modifications, the multiagent Q learning algorithm[hu2003nash] is summarized below.

For every time step , agents choose their actions as:
(21) 
At the next time step , agents observe the new states and receive rewards for all . Then the Q function is updated:
(22) (23)
Example
In this example, we consider agents in the system. Only one agent is infected in the beginning. The runtime cost is the same as in the example in the distributed optimal control section, i.e., where is chosen to be . For simplicity, the action space is discretized to be , called as low, medium, and high. Hence the Q function can be stored as a matrix. In the learning algorithm, the learning rate is set to . The exploration rate is set to decay in different episodes, i.e., where denotes the current episode and the maximum episode is . The Q function is initialized to be for all entries. Three different cases are considered. For each case, we illustrate the Q function learned after 200 episodes as well as the system trajectories for episodes , blue for earlier episodes and red for later episodes. The results are shown in Fig. 2.

Case 1: discount with runtime cost .
With , this case reduces to a single stage game as discussed in the distributed optimal control section. The result should align with the analytical Nash Equilibrium in (10). As shown in the left plot in Fig. 2(a), the optimal action for a healthy agent is always low (solid green), while the optimal action for an infected agent is always high (dashed magenta). The Q values for infected agents do not depend on . The Q values for healthy agents increase when increases if the activity level is not zero, due to the fact that: for a fixed activity level, the chance to get infected is higher when there are more infected agents in the system. All these results align with our previous theoretical analysis. Moreover, as shown in the right plot in Fig. 2(a), the agents are learning to flatten the curve across different episodes.

Case 2: discount with runtime cost .
Since the agents are now computing cumulative costs as in (5), the corresponding Q values are higher than those in case 1. However, the optimal actions remain the same, low (solid green) for healthy agents, high (dashed magenta) for infected agents, as shown in the left plot in Fig. 2(b). The trends of the Q curves also remain the same: the Q values do not depend on for infected agents and for healthy agents whose activity levels are zero. However, as shown in the right plot in Fig. 2(b), the agents learned to flatten the curve faster than in case 1, mainly because healthy agents are more cautious (converge faster to low activity levels) when they start to consider cumulative costs.

Case 3: discount with shaped runtime cost in (15).
The shaped cost changes the optimal actions for all agents as well as the resulting Q values. As shown in the left plot in Fig. 2(c), the optimal action for an infected agent is low (dashed green), while that for a healthy agent is high (solid magenta) when is small and low (solid green) when is big. Note that when is high, the healthy agents still prefer low activity level, though the optimal actions for infected agents are low. That is because: due to the randomization introduced in greedy, there is still chance for infected agents to have medium or high activity levels. When is high, the healthy agents would rather limit their own activity levels to avoid the risk to meet with infected agents that are taking random actions. This result captures the fact that agents understand others may have limited rationality and prefer more conservative behaviors. We observe the same trends for the Q curves as the previous two cases: the Q values do not depend on for infected agents and for healthy agents whose activity levels are not zero. In terms of absolute values, the Q values for infected agents are higher than those in case 2 due to the additional cost in . The Q values for healthy agents are smaller than those in case 2 for medium and high activity levels, since the chance to get infected is smaller as infected agents now prefer low activity levels. The Q values remain the same for healthy agents with zero activity levels. With shaped costs, the agents learned to flatten the curve even faster than in case 2, as shown in the right plot in Fig. 2(c), since the shaped cost encourages infected agents to lower their activity levels.



5 Discussion and Future Work
Agents vs humans
The epidemic model can be used to analyze realworld societal problems. Nonetheless, it is important to understand the differences between agents and humans. We can directly design and shape the cost function for agents, but not for humans. For agents, their behavior is predictable once we fully specify the problem (i.e., cost, dynamics, measurement, etc). Hence we can optimize the design (i.e., the cost function) to get desired system trajectory. For humans, their behavior is not fully predictable due to limited rationality. We need to constantly modify the knowledge and observation model as well as the cost function to match the true human behavior.
Future work
The proposed model is in its preliminary form. Many future directions can be pursued.

Relaxation of assumptions.
We may add more agent states to consider recovery, incubation period, and death. We may consider the fact that the interaction patterns among agents are not uniform. We may consider a wide variety of agents who are not homogeneous. For example, health providers and equipment suppliers are key parts in fighting the disease. They should receive lower cost (higher reward) for maintaining or even expanding their activity levels than ordinary people. Their services can then lead to higher recovery rate. In addition, we may relax the assumptions on agents’ knowledge and observation models, to consider information asymmetry as well as partial observation. For example, agents cannot get immediate measurement whether they are infected or not, or how many agents are infected in the system.

Realistic cost functions for agents.
The cost functions for agents are currently handtuned. We may learn those cost functions from data through inverse reinforcement learning. Those cost functions can vary for agents from different countries, different age groups, and different occupations. Moreover, the cost functions carry important cultural, demographical, economical, and political information. A realistic cost function can help us understand why we observe significantly different outcomes of the pandemic around the world, as well as enable more realistic predictions into the future by fully considering those cultural, demographical, economical, and political factors.

Incorporation of public policies.
For now, the only external intervention we introduced is cost shaping. We may consider a wider range of public policies that can change the closedloop system dynamics. For example, shutdown of transportation, isolation of infected agents, contact tracing, antibody testing, etc.

Transient vs steady state system behaviors.
We have focused on the steady state system behaviors in the Nash Equilibrium. However, as agents live in a highly dynamic world, it is not guaranteed that a Nash Equilibrium can always be attained. While agents are learning to deal with unforeseen situations, there are many interesting transient dynamics, some of which is captured in Fig. 2, i.e., agents may learn to flatten the curve at different rates. Methods to understand and predict transient dynamics may be developed in the future.

Validation against real world historical data.
To use the proposed model for prediction in the real world, we need to validate its fidelity again the historical data. The validation can be performed on the trajectories, i.e., for the same initial condition, the predicted trajectories should align with the ground truth trajectories.
6 Conclusion
This paper introduced a microscopic multiagent epidemic model, which explicitly considered the consequences of individual’s decisions on the spread of the disease. In the model, every agent can choose its activity level to minimize its cost function consisting of two conflicting components: staying healthy by limiting activities and maintaining high activity levels for living. We solved for the optimal decisions for individual agents in the framework of game theory and multiagent reinforcement learning. Given the optimal decisions of all agents, we can make predictions about the spread of the disease. The system had negative externality in the sense that infected agents did not have enough incentives to protect others, which then required external interventions such as cost shaping. We identified future directions were pointed out to make the model more realistic.