I Introduction
The outbreak of the novel coronavirus (COVID19) is unfolding as a major international crisis whose influence extends to every aspect of our daily lives. Testing is critical in identifying patients and carriers. Effective testing allows infected individuals to be quarantined, thus reducing the spread of COVID19, saving countless lives, and helping to restart the economy safely and securely [1, 2, 3, 4]. Testing capacity will remain a constraint for the foreseeable future. This means that we need to develop highly efficient testing strategies that make optimal use of our testing resources in order to minimize the number of infected individuals.
These testing strategies can be greatly aided by contact tracing that provides health care providers information about the whereabouts of infected patients in order to determine whom to test. There have been significant efforts to improve contact tracing by developing apps that leverage the ubiquity of smart phones to automatically detect contacts between individuals within a predetermined distance from each other (e.g., within 6 feet), the time duration of the contact, etc. [5, 6]. Countries that have been more successful in corralling the virus typically use a test, treat, trace, test strategy that begins with testing individuals with symptoms, traces contacts of positively tested individuals via a combinations of patient memory, apps, WiFi, GPS, etc., followed by testing their contacts, and repeating this procedure. The problem is that such strategies are myopic and greedy and do not efficiently use the testing resources. This is especially the case with COVID19, where symptoms may show up several days after the infection (or not at all, there is evidence to suggest that many COVID19 carriers are asymptotic, but may spread the virus) [7]. Such greedy strategies, often referred to as “exploitation” rules in the learning theory, miss out population areas where the virus may be dormant and flare up in the future.
In this paper, we show that the testing problem can be formally cast as a sequential learningbased resource allocation problem with constraints, where the input to the problem is provided by a timevarying social contact graph obtained through various contact tracing tools. Our goal is to develop efficient learning strategies that appropriately balance exploitation (testing high confidence individuals) as well as exploration (testing lower confidence individuals to identify potential unexplored areas, e.g., using group testing) to minimize the number of infected individuals. We will investigate fundamental performance bounds, and ensure that our solution is robust to errors in the input graph as well as in the tests themselves.
Ii A Partially Observable Markov Decision Process Model with Contact Graph
We formulate the problem of sequential testing for COVID19 as a Partially Observable Markov Decision Process (POMDP). The system of interest consists of individuals and evolves in discrete time . Let denote the “hidden” state of individual at , where means that is free of disease at and indicates that
is infected. We use the vector
to represent the state of the entire system. Let denote the statespace of the network. Note that the state vector is never fully revealed to the learner^{1}^{1}1So the setup involves a partially observable MDP (POMDP), which is nontrivial to solve in general case..Test and Quarantine: At each time , the learner has a unit budget to choose an individual in order to “sample” (test for infection). Sampling an individual at reveals the state . We let denote the sampling decision at time . In case no one is sampled at , we let . The observation at is denoted , and is given by Note that if is , we assume to be deterministic, and hence reveals no information. If sampled individuals are found to be infected, then they are “quarantined,” hence cannot spread the disease to their neighbors. We let denote the set of quarantined individuals until .
Contact Graph: The COVID19 is spread by social contacts. We model the social contacts as a timevarying, weighted and undirected graph over a fixed node set , which denotes the individuals, i.e., . Edges in graph correspond to social contacts and weights measure the extent of social contacts (e.g., contact duration, contact distance, number of times of contact, etc.). The social contact graph could be obtained from a combination of mobile apps, GPS/WiFi data, patient memory, etc. Note that the length of each time slot we considered in this testing system could be as small as seconds/minutes. Hence the graph could be highly dynamic or piecewisely static, depending on the data updating frequency of the mobile app.
Active Edge: In order to provide a unified framework for different sources of contact graph, and simplify exposition, we assume that only one single edge in the graph is active at any time , denoted as . Let be those individuals that are not quarantined at (and hence “free”), and be the vertexinduced subgraph of . At each time , the active edge is sampled from the social contact subgraph according to the edge weights. This means that the number of active social contacts are reduced as more and more confirmed cases are quarantined. We assume that is revealed to the learner. Given , individuals and
“share” the disease with probability
, i.e., both of them become infected at time with a probability if either one of them was infected at time . We will assume that the infection transmission probability is known. The case when is unknown, and needs to be learnt is considered separately.State Transition: Let us now look at the controlled transition probabilities of the controlled Markov process . We first introduce some notations. For , define
(1) 
and
(2) 
Clearly, assumes value only if and differ in a single position. Since in our model we explicitly assume that the disease can spread to only one more person during two consecutive times, this function is if cannot evolve to in one single timestep. provides us the node that “transitioned” to the diseased state when the system evolved in a unit step from to . Thus, the singlestep controlled transition probability associated with the process can be written as follows,
(3) 
Objective: Let be the observation history of the learner. Then, the policy is a sampling decision at on the basis of , i.e., . Our goal is to find a policy that solves the following problem,
(4)  
(5) 
where denotes the norm and is the total testingcapacity. The instantaneous cost encourages the policy to keep the total number of infected individuals as low as possible, in an as early as possible manner. The capacity constraint (5) is crucial because not many testingkits are available during epidemics. An alternative, somewhat equivalent and simpler objective is to remove the capacity constraints altogther and include a cost for using testingkits,
(6) 
where . In the remaining discussion, we restrict ourselves to (6).
Remark 1
A natural but incorrect objective is to find a that maximizes the number of infections detected, i.e.,
(7) 
However, we highlight the following issue with the formulation (7): the policy/algorithm is rewarded for catching as many infections as possible. We also note that the policy also does affect the evolution of the global state . This is done by controlling the links indirectly by quarantining those individuals whose tests turn out to be positive (recall that an infected person is quarantined, and is then not allowed to form links with any other person in the network). Hence, the objective (7) encourages the development of a policy to infect as many people as possible (so that it can, at later stages, catch these cases). This dual affect of control [8] is clearly not desirable.
Belief State MDP Formulation: We now introduce a belief state, which is the posterior distribution of over the state space . This allows us to transform the POMDP to a continuousstate MDP that involves evolution of the belief state. We denote the belief state by , where
denotes the conditional probability associated with the system state equal to . can be computed recursively by utilizing the Bayes’ Rule,
(8) 
where the state transition probabilities are as discussed in (3).
Optimal Policy: The sampling policy that is optimal for the problem (4)(5) can be obtained by solving the following set of nonlinear Dynamic Programming equations [9],
(9)  
(10) 
where denotes simplex on , denotes representative belief state at time , and the function denotes the value function at time . Optimal sampling action at time in state corresponds to minimizer of r.h.s. in the above equation. Equations (9), (10) are computationally intractable as . Thus, we propose tractable provably approximate solutions next.
Iia Provably Suboptimal Value Iteration Approximation
We describe an approximation method with low computational complexity for the POMDP (6). Despite (6) being a continuousstate MDP, it has a finite dimensional characterization [10, 11]. This characterization is exploited in [12, 13, 14] in order to develop approximate solutions that are computationally tractable. Among these approaches, [14] provides upper and lower bounds to the proposed approximation scheme, and hence also has theoretical guarantees. The following result is taken from [14].
Theorem 1
Consider the Bellman equations (9), (10), the associated value functions and the optimal policy . They have the following finitedimensional characterization.

is piecewiselinear and concave with respect to . Thus, , for any , where is a finite set of dimensional vectors.

has the following finite dimensional characterization: The belief space can be partitioned into at most convex polytopes. In each such polytope, the optimal policy is a constant corresponding to a single action.
Since the sets can be quite large, we can reduce the computational cost from to by cleverly choosing “approximation sets” having small cardinalities. The resulting “approximate value function” would then yield an approximately optimal policy. This is the basis of [14]’s approximation scheme that is stated below, which gives an upperbound to the true value functions .

Initialize: , where is the terminal cost vector.

Step 1. Given a set of vectors , construct the set by pruning as follows: Pick any belief states in the belief simplex . ^{2}^{2}2Any homotopy algorithm for solving equations without special structure, and which uses Freudenthal Triangulation can be used for this step. Interested readers see [15] for technical terms. Then perform the following operations,

Step 2. With , obtain by using any standard POMDP algorithm.

Step 3. and goto Step 1.
To get a lowerbound, choose any belief states
and construct a linear interpolation between the points
. It then follows from the concavity of , that the resulting curve lies below . Hence, the true value function is “sandwiched” between the upper and lower bound, as depicted in Figure. 1.Iii Provably SubOptimal Low Complexity Algorithms
We now introduce two broad class of algorithms that are easy to implement, and we provide guarantees on their performance.
Iiia Policy Iteration Approximation
The idea is to begin with a naive sampling policy for which the value function is easily computable, and then employ one step of policy iteration in order to obtain a policy that is better than . Since the policy iteration operator corresponds to Newton’s method applied on the policyspace [16], a single application of policy iteration is supposed to yield vast improvements. More details regarding the “convergence rates” of such procedures can be found in [17]. We now give an example of an easilycomputable , and also describe the policy iteration technique.
OpenLoop Policy : At time the user picks nodes out of nodes, arranges them in some order and decides to sample them according to this order. These nodes, say , are then sampled during the next timeslots. This policy is clearly an openloop policy since it makes decisions in a nonadaptive manner, i.e., it does not change its decision regarding which node to sample despite gaining more information during the experiment. The value function corresponding to can be obtained by solving the following set of equations,
(11)  
(12) 
where is calculated from (8) with , and the subscript denotes that the value function is associated with the policy .
Policy Iteration: The sampling decision at time is obtained by solving the following equation
(13) 
We summarize the discussion of this section as the following result.
Theorem 2
Consider the sampling policy which makes decisions as in (13), and is obtained by utilizing a single step of the policy improvement operator upon the policy . We then have that yields a better performance than , i.e., their value functions satisfy .
IiiB CosttogoApproximations via LookAhead Rules
The idea behind this approach is that instead of solving the Dynamic Programming equations (9), (10) exactly, we derive only an approximation of the true value functions . Such approximations yield more computationally tractable approaches, but yield only a suboptimal policy. There are many approaches to derive such approximations, however we will restrict ourselves to lookahead rules [18, 19]. Another approach yields an index rule that attaches an index to each “arm” (an individual), and then samples the individual with the largest value of index. Some examples of such index rules, and more details on how to derive these policies can be found in [20, 21, 22, 23, 24, 25]. We next discuss the lookahead approach.
Let be an approximation of the value function at time . If denotes the belief state at time , then the decision at time is obtained by solving the following optimization problem,
(14) 
We will make the following assumption in order to analyze the performance of lookahead rules.
Assumption 1
For all and times , we have that
where in the above denotes a representative state ( dimensional vector comprising of s and s).
Under the above assumption, we can prove the following appealing property of the lookahead policy.
Theorem 3
We now provide some examples of such lookahead rules.
We begin with a simpler problem in which we only have to make sampling decision for only a single timestep/resource. In this case, a greedy policy makes a sampling decision that minimizes the instantaneous cost, as follows
(15) 
where . The set represents those possibilities in which user is infected and not quarantined. Let denote the value function for this greedy rule, which is
Note that the greedy policy is an exploitationonly policy as it will only test individuals with high confidence to be infected. This is not a good policy as it does not explore controlling the virus as early as possible. To introduce a certain level of exploration, we now apply one step of policy improvement, i.e., the Bellman operator (9), (10) to the greedy policy. The resulting policy generates the sampling decisions as follows,
This resulting policy is onestep look ahead policy, and denote it . Note that the computational complexity of is .
Iv More Complex Environments
The models considered in the previous sections are too simplistic, and may not be adequate to capture many realworld scenarios. In this section, we briefly discuss how to enhance these models and algorithms in order to provide solutions for more complex scenarios.
Iva Group Testing
One way to improve the efficiency of resource allocation in (6) is to employ group testing [26]. In this procedure, samples of multiple individuals are combined into a single “mixture” sample and tests performed on the mixture. In case the result is negative, all the component individuals are declared negative, which is especially useful when testing lower confidence population areas (e.g., during exploration). However, if the test is positive, a subset of these individuals are carefully selected for conducting further tests, allowing identification of all positive individuals. Such a procedure generally saves the number of tests required, and is immensely useful during testingkit shortages. Group testing can be incorporated into the model of Section II as follows. The cost incurred by the system remains the same as in (6). However, the actionspace, i.e., the choice of controls is now all possible subsets of , i.e., and denotes the set of individuals are to chosen for collective testing at . We will seek to develop adaptive algorithms that perform group testing in an efficient manner and quantify the additional gains (in terms of the additional number of people that were prevented from getting infected) from group testing.
IvB Inaccurate Testing
The model considered in Section II assumes that the testing result will reveal the current state of the tested individual. However, in practice, the tests are not 100% accurate (e.g., PCR tests for COVID19 have a high false negative). Our model can be readily extended to accommodate this noisy testing as one can apply the Bayes’ rule on the observation to infer the current state of the tested individual. Yet, this introduces one interesting question when we deploy group testing. Here, note that although group testing itself introduces testing errors in false negatives if the samples get sufficiently diluted, it is also an efficient way to deal with testing error as one individual could be tested multiple times keeping the overall average sample much less than 1. We will study how to efficiently allocate group testing in order to reduce the overall testing error.
IvC Noisy Contact Graph
The model considered in Section II assumes that the social contact graph are known and the COVID19 is spread through these contacts. However, in practice, this is rarely the case because contact tracing only provides approximate coverage and noisy linkages. Hence, we need to extend the state transition kernel to allow some unknown source of infection . Besides, it is hard to know the transmission probability given a contact a priori as well. We will extend the problem (6) to the case without knowledge of and . Such an optimization necessarily entails learning these unknown parameters. The algorithm thus has to perform a tradeoff in which it makes suboptimal choices for sampling people, which enables it to learn these parameters. We plan to develop learning algorithms that perform this tradeoff in an optimal manner.
IvD Information Directed Sampling Approach for Learning
The approximation methods we proposed in the previous Sections are computationally tractable compared to the optimal policy (9) and (10). These methods are practical for moderate population , e.g., a city. However, its computational complexity does not allow it to scale to large population , e.g., nationwide. One promising approach to further reduce the computational complexity is to consider a compressed/kernelized policy space. This is reasonable for practical situations where tradeoffs are often made between optimality and feasibility. For example, in practice, we may only be able to play with the portion of the total testing budget that could be used to explore asymptomatic individuals. This motivates us to consider a class of policies parametrized by parameter . Our goal is to “learn” the best policy from amongst the class
. One possibility is to employ Thompson sampling, or efficient Bayesian information collection type of learning rules, e.g.
[27, 28]. We briefly describe the approach below. Let denote a “sufficiently” large timeperiod. Total time horizon of steps is divided into “episodes” of slots each. We employ a fixed policy in the th episode, that begins at time . The following optimization problem is solved at time in order to derive : , where is the performance of policy during timesteps.is a random variable, because it depends upon unknown parameters, and
denote its mean value and variance respectively.
is a suitably chosen stepsize, that converges to as . It will be interesting to characterize the performance of this learning rule, more specifically how it scales with .V Future Works
The approximation procedure of Section IIA lacks a characterization of the gap between the upper and lower bounds. This gap depends upon the user’s choice of , and the sample belief states
. An important problem would be to characterize this dependence, thereby allowing us to make “optimal” choices for these hyperparameters. This would also provide us with a “convergence” rate for the approximation algorithm of
[14], i.e., how fast the approximation error goes to as the granularity controlled by , is increased.The lookahead policies have performance guarantees under certain conditions on the system parameters, which in our case translate to conditions on the social contact graph. It is often the case that is nearoptimal since it “looks into the future” while making decisions. It is of interest to investigate the performance of ; more specifically to seek a characterization of its suboptimality gap.
References
 [1] We need smart coronavirus testing, not just more testing, (accessed April 23, 2020), https://www.statnews.com/2020/03/24/weneedsmartcoronavirustestingnotjustmoretesting/.
 [2] NSF NeTS Community First Call to Arms Workshop, 2020 (accessed April 22, 2020), https://sites.google.com/tamu.edu/netscovid/firstcalltoarmsworkshop.
 [3] Community Mitigation, (accessed April 23, 2020), https://www.cdc.gov/coronavirus/2019ncov/php/openamerica/communitymitigation.html.
 [4] Opening Up America Again, (accessed April 23, 2020), https://www.whitehouse.gov/openingamerica/.
 [5] M. Gurman, Apple, Google Bring Covid19 ContactTracing to 3 Billion People, 2020 (accessed April 22, 2020), https://www.bloomberg.com/news/articles/20200410/applegooglebringcovid19contacttracingto3billionpeople.
 [6] Wikipedia, Covid19 ContactTracing Apps, 2020 (accessed April 22, 2020), https://en.wikipedia.org/wiki/COVID19_apps.
 [7] C. Heneghan, J. Brassey, and T. Jefferson, COVID19: What proportion are asymptomatic?, 2020 (accessed April 22, 2020), https://www.cebm.net/covid19/covid19whatproportionareasymptomatic/.
 [8] A. A. Feldbaum, “Dual control theory,” Automation and Remote Control, vol. 21, no. 9, pp. 874–1039, 1960.
 [9] V. Krishnamurthy, Partially Observed Markov Decision Processes. Cambridge University Press, 2016.
 [10] R. D. Smallwood and E. J. Sondik, “The optimal control of partially observable markov processes over a finite horizon,” Operations research, vol. 21, no. 5, pp. 1071–1088, 1973.
 [11] E. J. Sondik, “The optimal control of partially observable markov processes over the infinite horizon: Discounted costs,” Operations research, vol. 26, no. 2, pp. 282–304, 1978.
 [12] G. E. Monahan, “State of the art—a survey of partially observable markov decision processes: theory, models, and algorithms,” Management science, vol. 28, no. 1, pp. 1–16, 1982.
 [13] A. R. Cassandra, L. P. Kaelbling, and M. L. Littman, “Acting optimally in partially observable stochastic domains,” in AAAI, vol. 94, 1994, pp. 1023–1028.
 [14] W. S. Lovejoy, “Computationally feasible bounds for partially observed markov decision processes,” Operations research, vol. 39, no. 1, pp. 162–175, 1991.
 [15] J. R. Munkres, Elementary Differential Topology.(AM54). Princeton University Press, 2016, vol. 54.
 [16] P. Whittle, Optimal control: basics and beyond. John Wiley & Sons, Inc., 1996.
 [17] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
 [18] D. P. Bertsekas, “Dynamic programming and suboptimal control: A survey from adp to mpc,” European Journal of Control, vol. 11, no. 45, pp. 310–334, 2005.
 [19] Dimitri P. Bertsekas, Dynamic programming and optimal control. Athena scientific Belmont, MA, 1995, vol. 1, no. 2.
 [20] J.C. Gittins, K. Glazebrook and R. Weber, Multiarmed Bandit Allocation Indices. John Wiley & Sons, 2011.
 [21] I. Kadota, A. Sinha, E. UysalBiyikoglu, R. Singh, and E. Modiano, “Minimizing the age of information in broadcast wireless networks,” in Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2016, pp. 844–851.
 [22] I. Kadota, A. Sinha, E. UysalBiyikoglu, R. Singh, and Modiano, “Scheduling policies for minimizing age of information in broadcast wireless networks,” IEEE/ACM Transactions on Networking (TON), 2018, to appear in.
 [23] X. Guo, R. Singh, T. Zhao, and Z. Niu, “An index based task assignment policy for achieving optimal powerdelay tradeoff in edge cloud systems,” in 2016 IEEE International Conference on Communications, ICC 2016, Kuala Lumpur, Malaysia, May 2227, 2016, 2016, pp. 1–7. [Online]. Available: https://doi.org/10.1109/ICC.2016.7511147
 [24] X. Guo, R. Singh, P. R. Kumar and Z. Niu, “Optimal energyefficient regular delivery of packets in cyberphysical systems,” in IEEE International Conference on Communications (ICC), June 2015, pp. 3186–3191.
 [25] R. Singh, X. Guo and P. R. Kumar, “Index policies for optimal meanvariance tradeoff of interdelivery times in realtime sensor networks,” in IEEE Conference on Computer Communications (INFOCOM), 2015, pp. 505–512.
 [26] D. Du, F. K. Hwang, and F. Hwang, Combinatorial group testing and its applications. World Scientific, 2000, vol. 12.
 [27] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Gaussian process optimization in the bandit setting: No regret and experimental design,” arXiv preprint arXiv:0912.3995, 2009.
 [28] D. Russo and B. Van Roy, “Learning to optimize via posterior sampling,” Mathematics of Operations Research, vol. 39, no. 4, pp. 1221–1243, 2014.