Network-Side Digital Contact Tracing on a Large University Campus

01/25/2022
by   Matthew L. Malloy, et al.
0

We describe a study conducted at a large public university campus in the United States which shows the efficacy of network log information for digital contact tracing and prediction of COVID-19 cases. Over the period of January 18, 2021 to May 7, 2021, more than 216 million client-access-point associations were logged across more than 11,000 wireless access points (APs). The association information was used to find potential contacts for approximately 30,000 individuals. Contacts are determined using an AP colocation algorithm, which supposes contact when two individuals connect to the same WiFi AP at approximately the same time. The approach was validated with a truth set of 350 positive COVID-19 cases inferred from the network log data by observing associations with APs in isolation residence halls reserved for individuals with a confirmed (clinical) positive COVID-19 test result. The network log data and AP-colocation have a predictive value of greater than 10 the contacts of an individual with a confirmed positive COVID-19 test have greater than a 10% chance of testing positive in the following 7 days (compared with a 0.79 12.6). Moreover, a cumulative exposure score is computed to account for exposure to multiple individuals that test positive. Over the duration of the study, the cumulative exposure score predicts positive cases with a true positive rate of 16.5 operating point.

READ FULL TEXT VIEW PDF
12/28/2020

Modeling of Network Based Digital Contact Tracing and Testing Strategies for the COVID-19 Pandemic

With more than 1.7 million COVID-19 deaths, identifying effective measur...
08/11/2020

Comparing manual contact tracing and digital contact advice

Manual contact tracing is a top-down solution that starts with contact t...
05/27/2020

CoVista: A Unified View on Privacy Sensitive Mobile Contact Tracing Effort

Governments around the world have become increasingly frustrated with te...
05/26/2020

Cross Hashing: Anonymizing encounters in Decentralised Contact Tracing Protocols

During the COVID-19 (SARS-CoV-2) epidemic, Contact Tracing emerged as an...
08/30/2021

Optimal testing strategies to monitor COVID-19 traced contacts

The quarantine of identified close contacts has been vital to reducing t...
12/06/2021

PAN-DOMAIN: Privacy-preserving Sharing and Auditing of Infection Identifier Matching

The spread of COVID-19 has highlighted the need for a robust contact tra...
04/24/2020

How Reliable are Test Numbers for Revealing the COVID-19 Ground Truth and Applying Interventions?

The number of confirmed cases of COVID-19 is often used as a proxy for t...

1. Introduction

Digital contact tracing – the use of digital devices such as mobile phones to establish potential epidemiological contacts – has received significant attention as a potential tool in the COVID-19 pandemic. Most proposed digital contact tracing approaches require installation of a mobile app or operating system level access which can be significant hurdles in countries where high public compliance is expected (saw2021predicting; munzert2021tracking) and a fatal flaw elsewhere (munzert2021tracking; time_nevada).

Conversely, network-side approaches to digital contact tracing do not require installation of a mobile application, operating system customization (i.e, Google and Apple’s exposure notification system(apple_google)), or any information collected on a client device. Instead, they rely on network-side information such as connection logs to trace potential contacts. As network-side approaches can be enabled by network operators without burden on end-users, they are attractive for large-scale, automated contact tracing.

Figure 1. Network-based digital contact tracing. Two WiFi APs (denoted X and Y) are shown. Two users (denoted A and B) connect simultaneously (i.e, colocate) to access point X, supposing epidemiological contact.

Access point (AP) colocation is the co-occurrence of users on the same WiFi access point (AP) at the same time. As WiFi APs have limited range, AP-colocation is a proxy for physical colocation, which supposes epidemiological contact. Like physical colocation, prolonged AP-colocation with an infected individual may correlate with increased risk of contracting the infectious disease. To infer contacts and predict future infections, we present an AP-colocation algorithm. The algorithm generates a confidence score between two individuals that increases as the duration and number of AP-colocations increase. More precisely, with an input corpus of connection logs, the algorithm outputs a time-varying weighted contact graph. The nodes in the graph correspond to individuals (or proxies for individuals, such as digital devices), and edges in the graph represent contacts between individuals. The weights of the edges are the confidence scores, which depend on the number of times two individuals colocate on a single AP and the number of other individuals connected to that AP. The approach also assigns a cumulative exposure score to individuals, which increases as multiple neighboring nodes in the contact graph test positive.

While approaches for digital contact tracing based on colocation have been proposed in the past (malloy2020digital; trivedi2020empirical; trivedi2021wifitrace), the primary contribution of this paper is a study of AP-colocation for digital contact tracing of COVID-19 at a large public university campus in the United States. The campus wireless network consists of more than 11,000 WiFi APs covering approximately one square mile and under normal circumstances serves around 50,000 students, employees, and visitors on a typical day. The APs are located primarily inside residence halls, classroom buildings, libraries, dining halls, research and administrative buildings, shared outdoor spaces, and other spaces typical of a large university campus. As students and employees move throughout the campus, their mobile devices connect and disconnect from the APs. The APs log connections and disconnections, creating a record of the approximate location of the user and others in their proximity. We apply the AP-colocation algorithm to a dataset of over 216 million WiFi association records collected over the duration of the Spring 2021 semester. The resulting contact graph exhibits immense scale, and we report on its statistics.

To validate our approach, a truth set of positive (350) and negative (6,101) COVID-19 cases is inferred from the WiFi association dataset. Positive cases are inferred by observing client-AP associations in dormitories reserved for COVID-19 isolation of individuals with a confirmed clinical positive. Likewise, negative cases are inferred from associations in residence halls not reserved for isolation, which require a twice-a-week negative test result. The ground truth dataset enables validation of the utility of the contact graph and the exposure scores for prediction of positive COVID-19 cases.

Results indicate that the use of network log data and AP-colocation has a predictive value of greater than 10% over the course of the study (above 16% under some parameter choices resulting in limited scale). More precisely, when tuned to return 2 contacts per positive case (on average), the returned contacts have greater than 10% chance of having a confirmed positive COVID-19 result in the following 7 days. This is contrasted with the 0.79% chance of a positive result in the next 7-days when a contact is selected at random. To exploit when an individual is exposed to multiple positive cases, an exposure score is described and computed for each individual in the study. For particular algorithm parameters, the cumulative exposure score predicts positive cases with a true positive rate of 16.5% and missed detection rate of 79%.

While the approach shows promise in settings such as a university or large corporate campus, there are significant shortcomings to using WiFi log data for digital contact tracing. First, individuals must carry on their person a digital device that associates with the network. Estimation of the percentage of individuals on the campus that do not associate with the enterprise WiFi was outside the scope of this study. Second, it is possible and likely common for individuals that connect to the same AP to never come within a distance that supposes disease transmission. As such, there are inevitable false positives (and missed detections), and the approach is best suited to establishing contacts associated with repeated and long term interaction between individuals. As with any digital contact tracing, there are significant privacy considerations that must be addressed, and potential privacy risk must be contrasted with the benefit of such a system. We discuss these trade-offs and note this study is meant to be a starting point for further conversation. In light of these limitations, while the study suggests that digital contact tracing using network-side information can be effective, we recommend that it augment traditional contact tracing.

Lastly, although this study was conducted at a large university campus, the ideas and techniques can be extrapolated to other settings in which network log data is collected. In particular, both cellular network operators and entities in the digital advertising ecosystem collect the information required to implement network-side digital contact tracing in some form. We refer the reader to (malloy2020digital).

In summary, this paper proposes an approach for digital contact tracing based on network log data and describes a study conducted at a large public university campus in the United States. To the best of our knowledge, this is the first study in which ground truth COVID-19 cases are used for validation of network-side contact tracing.

2. Data

Data was collected at a large university campus in the United States during the Spring 2021 semester from January 18th 2021 through May 7th, 2021. The data was collected from 11,964 physical Wi-Fi APs and 16 Aruba Networks enterprise network controllers. Automated log files containing association and disassociation event notifications were collected from the network controllers on a nightly basis. On average, a single day’s log files contain 2,390,087 association events and 1,252,451 disassociation events during February 2021 (the first full month of the study in which students were present on campus and classes were in session).

Our study relies on the association records, which take the form (anonymized MAC address , AP ID , timestamp ), approximately localizing a user’s device at a specific time. Associations are only logged if the user also successfully authenticates with the network. This excludes records corresponding to randomized MAC addresses (i.e, those present in datasets collected from sniffing WiFi traffic such as probe requests outside an enterprise network). For security and auditing purposes, the network also records logs of authentication events. A sanitized dataset comprising 1-to-many mappings of securely hashed user IDs to MAC addresses was collected. Using the authentication dataset, we are able to group devices owned by the same user, which is crucial to interpreting our results. The anonymized mapping was used to convert the association records to the form (anonymized user ID , AP ID , timestamp ).

One limitation of the logging of the Aruba Network controllers is that approximately half of all recorded WiFi sessions ended without an explicit disassociation message, likely due to client devices roaming out of the AP’s range. The absence of a clear session duration makes it challenging to implement contact tracing based on a precise calculation of the duration of AP colocation. Instead, in a method that also facilitates faster calculation, we discretize time into 15-minute epochs and record instances of

(anonymized user ID , AP ID , epoch timestamp ). Repeated associations during an epoch are merged into a single record, and disassociation events are disregarded for this study.

Despite the scale and complexity, the data pipeline was remarkably reliable. Nonetheless, some level of attrition in collection was experienced, as software issues gradually prevented a subset of APs from reporting log messages. On three separate occasions, the APs from entire buildings were removed from our collection pool permanently as summarized in the table below. Affected buildings included dormitories, teaching spaces, and dining halls, but none of the isolation dorms were impacted. Visibility into 66% of the campus was retained through the end of the Spring semester.

Date APs Reporting Buildings Included
January 18, 2021 11,964 (100%) 207 (100%)
March 24, 2021 10,011 (84%) 174 (84%)
April 9, 2021 8,066 (67%) 170 (82%)
April 25, 2021 7,927 (66%) 164 (79%)
Table 1. APs and buildings during the study duration.

2.1. Truth Set

A truth set of positive and negative COVID-19 cases of residents of on-campus housing was inferred from the network traffic. In the Spring 2021 semester, on-campus students were required to take twice-weekly rapid saliva-based COVID-19 tests. Students who tested positive (and were residents of university housing) were required to move into one of five designated isolation dormitories. Since all campus dormitories including the isolation dorms are covered by campus WiFi infrastructure, this allowed inference of positive cases based on extended and repeated observation of MAC address association to WiFi APs in the isolation dormitories.

Likewise, an assumed negative was inferred by extended observation of a MAC address in campus residence halls not reserved for isolation. Since students were tested twice weekly and positive cases were quickly moved to the isolation dormitories, an individual in a residence hall not reserved for isolation was an assumed negative. Full details of the approach are included in the Appendix.

Ultimately, the inferred ground truth data-set consisted of anonymized user IDs, an indicator if the user had an inferred positive test, and if so, the date and time at which they were observed to connect to an AP in the isolation dorm.

3. Methodology

In this section we discuss the methodology used to predict potential exposure to infected individuals. The approach requires first constructing a contact graph followed by using the graph to predict contacts and ultimately new positive cases.

3.1. Contact Graph

To analyze and predict future cases from the WiFi association data, we construct a weighted, undirected, time-varying graph . Following standard notation, a graph (or network) consists of a set of nodes and set of edges . An edge is a two element subset of the node set with an associated weight, .

In an epidemiological contact graph, nodes correspond to an individuals (or surrogate for an individual, such as device identifier or a MAC address). An edge represents a potential epidemiological contact between two individuals. To control precision and recall, confidence scores – denoted

– are assigned to edges. Larger weights represent a high potential for epidemiological exposure and higher likelihood for disease transmission. In the presentation of the algorithm we assume time has been discretized into epochs . The algorithm is described as follows.

1:parameters: look-back duration , scaling parameter (default days, )
2:input: AP associations: (device/user ID , AP ID , epoch )
3: set of unique device/user IDs
4:for each AP , each epoch
5:     number of IDs on AP on epoch
6:     for all pairs of IDs on AP
7:       
8: for all
9:
10:return
Algorithm 1 AP Colocation Contact Graph

Algorithm 1 takes client-AP associations as input, and computes a weighted (undirected) graph, similar to the approach proposed in (malloy2020digital). The algorithm has two parameters: an optional scaling parameter , and a look-back duration . Colocation between IDs prior to time are excluded from the contact graph . is chosen to be longer than the incubation period for the disease, and excludes contacts that happened sufficiently far into the past. The parameter captures how the confidence score scales with number of devices connected to a common AP. If , the score is inversely proportional with the number of devices on the AP, while when , the edge weight between two users is the count of epochs and APs for which the users colocated. A large value of dilutes the effect of high volume APs, such as those found in dining halls.

Alg. 1 assumes that time has been discretized into epochs. The algorithm proceeds as follows: for each epoch and access point , the algorithm computes a corresponding weight between all IDs colocated on the AP. Two users colocate on an AP if they both associate during a single epoch. Weights are summed over valid epochs and APs to create edge weights. The pairs of users that colocate on at least one epoch and their associated weights define the (time varying) graph .

3.2. Predicting Positive Cases

After construction of the contact graph, positive cases can be predicted. If user has a (clinical) positive result, all neighbors of user (at the time of the positive result) with an edge weight above are returned as predicted positives.

When a disease is highly prevalent in a population, multiple contacts of an individual may test positive, increasing the chances of transmission to that individual. In general, this is not captured by traditional contact tracing, as the contacts of each positive case are identified independently.

To capture the potential predictive power of knowledge that multiple contacts have tested positive, we introduce the notion of an exposure score. The exposure score is cumulative: for a single individual, the confidence scores associated with all positive contacts are summed. More precisely, let be the time at which user tests positive. The exposure score for user is defined as

where is the set of nodes that test positive during the study. Since for non-neighboring nodes, the sum is taken over neighbors of node that have a clinical positive test result.

4. Results and Validation

4.1. AP-Colocation Graph

The methodology of Sec. 3 was applied to a campus-wide dataset of more than 216 million client-AP associations over the duration of the study. As the graph is time-varying, we first report on the characteristics of the contact graph with during the time period February 1, 2021 through February 8, 2021. We note that since a mapping between a MAC address and a user was only available for a subset of this data, the campus-wide contact graph was generated such that each node corresponds to a MAC address as opposed to an individual.

Campus-wide contact graph count
client-AP associations 16,124,734
nodes, 47,415
edge count, 2,242,934
average degree () 94.6
Table 2. Statistics of the campus-wide contact graph , days, .

4.2. Validation

While the methodology was applied to a campus-wide dataset, only a subset of these associations correspond to individuals for which ground truth data was available. The subset of data corresponding to these individuals was used to create a labeled contact graph. The labeled contact graph consists of nodes (individuals), of which have a ground truth positive result during the course of the study. The remaining individuals are assumed to be negative as described in Sec. 2 and the Appendix, for a positivity rate of 5.7% over the duration of the study.

For each individual that tests positive in the labeled dataset, the graph was used to predict contacts, where denotes the time at which the individual tests positive. A predicted positive contact is a neighbor of with a confidence score above a threshold ; i.e, a predicted positive contact is a node such that . If such a neighbor tests positive in the following days, a true positive (TP) event is recorded. If such a neighbor does not test positive, a false positive (FP) event is recorded.

Let denote the set of individuals with a ground truth positive over the course of the study and denote the set of individuals in the study at time (an individual is excluded from the study after testing positive). The positive predictive value (PPV) is defined as

PPV

where TP and FP represent the count of false positives and true positives over the duration of the study, and is the indicator function.

Validation Set count
individuals (nodes) at start,
individuals (nodes) at end,
positive cases (nodes),
Table 3. Statistics of the validation set.

For comparison, the positive predictive value of contacts chosen at random was calculated. Again let and index the individuals (nodes), and , denote the respective time at which the individual tests positive, then

parameters (days) PPV scale
Alg. 1 5.0% 5.2
Alg. 1 10.0% 1.7
Alg. 1 12.5% 0.9
Alg. 1 5.0% 8.4
Alg. 1 10.0% 2.1
Alg. 1 12.5% 1.1
NA
NA
Table 4. Validation results at various operating points, compared with predicted contacts chosen at random.

Figure 2. Percent of predicted contacts that test positive in the -days () after a positive case as a function of the threshold , denoted PPV. Contact graph with . .

Figure 3. Scale (average number of predicted contacts) vs. positive predictive value for and . .

Scale is defined as the average number of contacts that are returned for a given contact score threshold. More specifically,

Scale and PPV are shown in Fig. 2 and Fig. 3 for the following parameter settings: , , days, days. Additional results for a variety of parameter choices are shown in Fig. 7 through Fig. 9.

Fig. 10 show analysis of the sensitivity to participation in the study. In particular, of the individuals, 25% and 50% were excluded at random, and positive predictive value and scale analysis was repeated. This resulted in a study size of users and positives for 75% participation, and users and for 50% participation (contrasted with and for full participation). The positive predictive value does not change significantly since both the numerator and denominator of (4.2) are reduced by approximately the same amount. The scale (average returned contacts per positive) decreases proportional to the number of devices in the study.


Figure 4. Histogram showing the number of plausible transmissions from each positive case. The -axis indicates number of contacts with an edge weight and a confirmed positive in the following days, and the -axis shows the count of occurrences out of the 350 positive cases. and .

Fig. 4 shows a histogram of the number of plausible transmissions, denoted . A plausible transmission is the number of contacts of a positive case (with edge weight ) that test positive in the following days. More precisely, let index a user with a confirmed positive at time . Then the plausible transmissions for case is given as

Fig. 4 shows histogram of over the 350 positive cases.

In addition to the validation of the confidence score produced by the graph, the exposure score as a predictor of a future positive COVID-19 test was validated. The exposure score aims to capture the potential predictive power of knowledge that multiple contacts of an individual have tested positive. The exposure score of ten individuals (chosen randomly) over the course of the study are shown in Fig. 5. The left figure in Fig. 5 shows the score corresponding to individuals that test positive. The date of the positive test is shown on the plot. In all cases, the exposure score is elevated prior to positive result. The right figure shows example traces of individuals that do not test positive. Note that the y-axis is scaled for each subplot.

The cumulative exposure score was also used to predict positive cases. For each individual, a positive prediction was declared at the first time the exposure score exceeded a threshold . If the time of the prediction was after the individual had a ground truth positive, the individual was excluded. If the time of the prediction preceded a ground truth positive test by less than days, a true positive (TP) was recorded; if the time of the prediction preceded a ground truth positive test by more than days, or the individual did not have a ground truth positive over the course of the study, a false positive (FP) was recorded. Likewise, if the confidence score of the individual was below for the duration of the study and the individual did not test positive, a true negative (TN) was recorded. If the confidence score of the individual was below the threshold for the duration of the study and the individual tested positive, a false negative (FN) was recorded. The true positive rate is:

Likewise, the missed detection rate is the percentage of total positive cases that are not predicted

(a) (b)
Figure 5. Exposure scores of five users that test positive (Figure a) and negative (Figure b) during the course of the study. The date of the positive test result is indicated by an ‘x’. Note the scale of the y-axis is different for each sub-plot. days.
Figure 6. Receiver operating characteristic curve for the exposure score over the duration of the study. The true positive rate plotted against missed the detection rate for values of , for days. , .

Fig. 6 show the true positive rate and missed detection rate for a number of values of over the duration of the study. Note that can be used to select an appropriate operating point: the approach can operate with a true positive rate of 16.5% and missed detection rate of less than 80%, or, for example, a missed detection rate below 20% and true positive rate above 5%.

Fig. 11 and Fig. 12 show the true positive rate, missed detection rate, and risk ratio when positive cases are predicted by an exposure score that exceeds a threshold for a variety of algorithm parameters, as a function of time during the study. Note that Fig. 11 and Fig. 12 show results on a per individual basis. In other words, the figures show the percent of the predicted individuals that later test positive. This is in contrast to Fig. 2 and Fig. 3, which show the average number of contacts of individuals who test positive.

For reference, Fig. 13 show the count of new positive cases in the days following the date on the x-axis.

Lastly, we highlight the power of the dataset with a table that shows potential ‘high-spread’ events. Table 5 shows the top (AP, hour) pairs when sorted by the highest percentage of positive users in the following 14 days. We stress that the table does not imply that transmission occurred during these events.

AP AP location date and time total users positive users positive users
days days
A residence hall 2021-02-05, 20:00 - 21:00 14 8 10
B residence hall 2021-02-10, 02:00 - 03:00 10 5 6
C residence hall 2021-02-11, 12:00 - 13:00 13 7 7
D residence hall 2021-02-11, 09:00 - 10:00 12 4 6
E residence hall 2021-02-10, 13:00 - 14:00 11 3 5
F dining hall 2021-02-02, 19:00 - 20:00 499 17 25
G residence hall 2021-02-04, 19:00 - 20:00 165 6 20
H dining hall 2021-02-02, 19:00 - 20:00 361 12 20
I residence hall 2021-02-07, 20:00 - 21:00 69 1 13
J residence hall 2021-02-01, 19:00 - 20:00 53 8 12
Table 5. Potential high-spread events, February 2021. Rows A-E, top (AP, hour) pairs with 10 or more users, sorted by highest percentage of users that test positive in the following days. Rows F-J, top (AP, hour) sorted by largest number of positive users in following days. Each AP is only listed once.

5. Discussion

The results indicate that the use of network log data and the AP-colocation algorithm has a positive predictive value of greater than 10% (and above 16% under some parameter choices resulting in limited scale). More specifically, at a scale of 1.7 contacts per positive case (on average), those returned contacts have greater than a 10% chance of having a clinical positive COVID-19 result in the following 7 days. At a higher scale, of 5.2 predicted contacts per positive case, the predicted contacts have a 5.0% chance of a clinical positive in the following 7 days (see Table 4 and Fig. 3). This is contrasted with the 0.79% chance of a positive result in the next 7-days when contacts are selected at random from the ground truth dataset. Comparing the 10% predictive value of Alg. 1 to contacts chosen at random, the WiFi colocation contacts have a more than 12-fold increase in chances of a clinical positive test results in the following 7 days.

To account for exposure to multiple individuals that test positive, the exposure score

can be employed, providing a stronger predictive power. Note that the false positive rate and false negative rates associated with a fixed threshold vary as the prevalence of the disease in the population varies. Depending on the overall prevalence of the disease, users with an exposure score above a threshold exhibit a more than five-fold increase in odds of testing positive for COVID-19 over those with a score below the threshold (see Fig.

11).

6. Ethical Considerations

With the collection of any network data there are natural privacy concerns, and there must be effort to balance the tradeoffs between these concerns and any potential benefit of such a system to the community. During a global pandemic, the potential benefits of better contact tracing are extremely high. Positive cases of reportable medical diseases (such as COVID-19) must be reported to public health officials, who have broad authority to collect data related to contact tracing, whether those records are provided through oral history or electronically. Outside a pandemic, it is unlikely the risk/benefit to such a system is viable. During a pandemic, the specific risk/benefit depends on the prevalence of the disease, the severity of the disease, and the efficacy of other contact tracing approaches. All factors must be considered to establish if such a system is within an acceptable risk tolerance as it pertains to privacy, and we reference work that focuses on the privacy aspects of digital contact tracing (hekmati2021contain; tang2020privacy; baumgartner2020mind; tahiliani2021privacy; tang2021another; ahmed2021dimy; hatamian2021privacy; holzapfel2020digital; ocheja2020quantifying; legendre2020contact; sanderson2021balancing).

Graph datasets are attractive from a privacy perspective as their edges capture pair-wise relationships between individuals, and do not disclose the location (i.e, the AP or physical location). Even if a contact graph is constructed from information that may be considered sensitive (i.e, physical location), it can be stored and analyzed without any of the sensitive information. Specific to our study, additional measures were taken to mitigate any disclosure of sensitive information. All identifiers – MAC addresses, authentication information, AP identifiers – were anonymized via a one way hash before analysis was completed, and the authors of the study never had access to plain text user identifiers.

Lastly, this work has been determined by the Minimal Risk IRB at the University of Wisconsin as not research involving human subjects as defined by the United States Department of Health and Human Services (DHHS) and Unites States Food and Drug Administration (FDA). We reference the Menlo Report (Dittrich12) as it establishes ethical principles and provides context for these principles in computing and communication research. The Menlo Report provides guidance with respect to Institutional Review Board (IRB) and self evaluation.

7. Related Work

Digital contact tracing has received a surge of attention since the start of the COVID-19 pandemic. Many of the initial approaches required installation of a mobile application for data collection and rely on location services (i.e, GPS) or proximity sensing using technologies such as Bluetooth low energy (BLE). Similar to app-based approaches, operating system level approaches use the operating system (OS) to collect data after a user ‘opts-in’, but in both cases, data is collected on the client device. Notably, in a joint venture, Google and Apple developed and released contact tracing technology known as GAEN (Google/Apple Exposure Notification) (apple_google), which relies on a BLE protocol to sense surrounding devices. Both app-based and OS based technologies have been adopted by governments with varying degrees of success. In some instances, where state level governments highly recommended contact tracing applications, installation was noted to be under 3% of adults (time_nevada). Furthermore, while the relatively short range of Bluetooth would initially seem an advantage in identifying close contacts, growing evidence of aerosol transmission of SARS-CoV-2 at distances of up to 60 feet (Bazante2018995118), which more closely matches indoor WiFi transmission range, suggests that tools based on WiFi would be a useful addition to the contact tracing tool set.

In contrast to approaches based on client-side data collection, network-side contact tracing, such as the approach proposed in this paper, can be implemented without the installation of an app. Closely related to the work in this paper is that of (trivedi2021wifitrace; zakaria2020analyzing). In (trivedi2021wifitrace), the authors propose use of WiFi network association logs gathered by enterprise networks to create a graph data structure which can be used for contact tracing. The approach (including Alg. 1 of (trivedi2021wifitrace)) is similar to proposed co-location graph algorithms for other applications, including (malloy2017internet; funkhouser2018device). The authors of (trivedi2021wifitrace) implement their approach and demonstrate its efficacy with WiFi datasets but simulated disease outbreak data, as they do not have access to ground truth COVID test results. Other campus-wide studies related to the COVID-19 pandemic using network log data include (zhang2021wlan) which studies super-spreader events using data collected on a university campus. Many additional papers focused on cellular or WiFi networks have surfaced during the COVID-19 pandemic (dmitrienko2020proximity; tu2021epidemic; mcheick2021d2d; zagatti2021large; yoo2020bim; manavi2020review; liu2021wibeacon; yi2021cellular; monroe2021location; giustiniano20215g; oikonomidis2021role; zhao2020accuracy; abowd2020using; thakare2021p; bressandata; nguyen2020epidemic; zang2021building; li2021vcontact) which do not include ground truth datasets. Overview articles that explore WiFi technologies include (petrovic2021iot; sahraouitraceme; sun2021mitigating; basheeruddinasdaq2021wireless; braithwaiteautomated; braithwaite2020automated; roy2020efficient).

Another closely related work is the approach proposed in (malloy2020digital), which relies on data collected by entities in the digital advertising ecosystem, and IP-colocation (as opposed to AP-colocation). The techniques and construction of the co-location graph are similar. IP-colocation and AP-colocation are likely to exhibit significant overlap. In many homes and small businesses, WiFi users spend significant time in close proximity and share a single public IP address through NAT. Enterprise networks, such as the WiFi network in this study, have more flexibility in managing their IP address space with policy options ranging from assigning each device a unique public address to arbitrarily distributing devices across a pool of public addresses through NAT.

Finally, network-side contact tracing has some overlap with the problem of WiFi localization, particularly from the vantage point of locating transmitters within a network. Approaches based on received signal strength (bahl2000radar), angle of arrival (joshi2013pinpoint), and time of flight (giustiniano2011caesar) all have the potential to improve contact tracing accuracy by providing an improved measure of physical proximity. We speculate that narrowing the spatial resolution to the room level is likely to be of epidemiological value. Additionally, if the wireless network facilitates tracking of user devices with finer granularity than association and disassociation events, it would enable a more precise calculation of AP colocation duration despite devices roaming without explicit disassociation messages.

8. Summary

Our results suggest that network-side contact tracing can help identify individuals at increased risk of disease plausibly through exposure in a shared environment. In contexts where available, e.g. academic and corporate campuses, Wi-Fi networks can be effectively leveraged to help control disease spread. The approaches discussed here can be extrapolated to other settings in which network log data is collected.

While the confidence scores defined by algorithm 1 are predictive of positive cases, determining actionable and effective criteria to implement interventions is a sizable outstanding effort. In the future, more work must be done before such a system is viable, in particular working with health and administrative officials to determine if notification of possible exposure to SARS-CoV-19 (or other communicable diseases) is warranted.

Appendix

(a) (b)
Figure 7. Percent of predicted contacts that test positive in the days after a positive case, as a function of the threshold . days. A

%-Wilson confidence interval is shown. Figure a,

. Figure b, .
(a) (b)
Figure 8. Figure a. . Percent of predicted contacts that test positive in the -days () after a positive case as a function of the threshold , denoted PPV. Contact graph with . Figure b. Average number of predicted contacts as a function of contact score for and . Note that the plots do not depend on ; two traces are obscured.
(a) (b)
Figure 9. Scale (average number of predicted contacts) vs. positive predictive value for and . Figure a, . Figure b, . Note that at some scale (i.e, 5), has a higher positive predictive value than .
(a) (b)
Figure 10. Sensitivity analysis when 100%, 75%, and 50% of the original users participate in the study. 95%-Wilson confidence intervals are shown. Note the positive predicted value is largely unaffected by decreased participation, but the average number of contacts (per positive case) decreases accordingly. and . .
(a) (b)
Figure 11. True positive rate – predicted positive individuals that test positive in days after date, divided by the total number of predicted positive individuals. Missed detection rate – predicted negatives that test positive in days after the date, divided by count of positive tests in days after the date. A predicted positive is an individual with a cumulative exposure score above , and a predicted negative is a an individual with a cumulative exposure score below . , . Figure a, . Figure b .
Figure 12. Risk ratio. The risk ratio is defined as the odds of testing positive (in the following days) given a cumulative exposure score above , divided by the odds of testing positive (in the following days) given a cumulative exposure score below . , . Figure a, . Figure b .
Figure 13. Count of new positive cases in the and day period after date on the x-axis.

Ground Truth

At the start of the Spring 2021 semester, students returned to campus housing facilities but required frequent testing with rapid saliva-based COVID-19 tests. Students who were residents of university housing and tested positive were required to move into one of five designated isolation dormitories or self-isolate in off-campus housing. Access to the isolation facilities was controlled by the university administration, which closely monitored and recorded the daily occupancy of each dormitory. Since all of the campus dormitories including the isolation dorms are also covered by campus WiFi infrastructure, we hypothesized that the WiFi logs would offer insight into devices that moved from regular dorms to isolation dorms without revealing personally identifiable information about the device owner. To this end, we assessed the validity of our hypothesis by comparing the number of devices detected in isolation dorms with the true count of admitted residents supplied by the university.

While visualizing device activity over time, we quickly realized that time of day has a large impact on observed WiFi activity. During busy hours of the day, we would overestimate the number of residents by a factor of two or more because of the presence of staff. A prior study of device type and user behavior (trivedi2020empirical) suggests differences in the types of devices most likely to be found at varying hours of the day. In the early hours of the morning when dormitory residents are likely to be present and sleeping and staff or visitors are likely absent, we expect any devices still connected to the WiFi network are most likely to be the personal phones belonging to residents. Whereas phones may maintain a network connection in order to receive updates, most other devices such as laptops and entertainment devices are likely to be in a low power, non-transmitting state. If true, this tendency could not only assist us inferring residence but could also help identify the devices most likely to be carried with the person for contact tracing.

To test the feasibility of estimating building occupancy, we compare the number of unique devices (anonymized MAC addresses) detected during different times of the day with the daily resident counts from the campus health authorities. We frame this as an optimization problem of choosing the best hour of the day to use WiFi device counts to predict the number of residents in a dormitory. More specifically, we count the number of devices detected in the five isolation dorms during a given one hour window for each day over the course of the study and compute the mean squared error (MSE) against the true counts in the respective dorms. We repeat this for each of the 24 hours of the day and find that the one hour window of 4:00-5:00 AM minimizes the prediction error, as shown in figure 16. We find that the resulting device counts are a good approximation of the true number of residents. Figure 14 shows the five isolation dorms with inferred and true resident counts. Compared to the general problem of predicting a building’s occupancy from WiFi activity, highly predictable device and user behavior make dorm residency a relatively easier problem.

Inferring Dorm Residents

After finding a correlation between the number of devices detected in the early morning and the number of isolation dorm residents, we turn our attention to identifying devices and their corresponding owners who reside in any of the campus dormitories. Recall that for the purpose of evaluating the effectiveness of our contact tracing approach, we must limit our population to the users who live in campus housing because this is the subset of users for whom we can infer COVID-19 cases based on presence in the isolation dorms. Although we do not have ground truth numbers for Spring semester residents, we take the reported capacity from university housing web pages as an estimate of the true numbers. We find that simply considering any device detected during the hour of 4:00-5:00 AM as belonging to a resident leads to a large overcount, perhaps explained in part by visitors, so it is necessary to filter the more probable residents.

This time operating over the range of user IDs, we count the number of days in which any device belonging to a user is detected in a building during the hour of 4:00-5:00 AM. If multiple devices belonging to the same user are detected on the same morning, we only attribute one detection to that building. If the number of times the user is detected in a given building meets or exceeds a threshold, , then we consider the user a probable resident of that building. For a given value of we count the number of inferred residents in each building, which strictly decreases as increases. Iterating over , we find that minimizes the MSE against reported dorm capacity numbers, as seen in figure 16. We take the users meeting this threshold as the population of dorm residents for the remainder of our study and disregard non-residents. Although we are able to compute AP colocation metrics for non-resident devices and perform digital contact tracing, we are only able to evaluate efficacy for contact tracing in the COVID-19 pandemic on the subset of devices that belong to residents.

We briefly note two potential issues with this approach. First, this choice of threshold is still expected to result in overestimation because distancing measures and more students choosing to live off campus likely meant that dorms were under capacity during the semester. However, we felt it important not to falsely exclude too many users from our evaluation, and furthermore, overestimating the population of residents is unlikely to bias the results in our favor. Second, under this approach a small number of users may be considered as probable residents of more than one building. Since our goal is to identify the set of dorm residents, we do not consider that a problem, but if we needed to infer a user’s building of residence, some refinements could be made such as selecting the building with the highest detection count.

Figure 14. Device counts detected in the early morning hour of 4:00-5:00 AM compared with true number of residents in five isolation dorms.

Inferring Isolation Residents

Finally, we apply a similar approach from the previous section to identify dorm residents who temporarily move into one of the five isolation dorms. Using the same threshold, , we scan the association logs for users whose device(s) appears in an isolation dorm for at least mornings and label these as positive cases. Starting from the first time one of the user’s devices is detected in isolation, we find the longest uninterrupted length of time until any of the user’s devices is detected in a different building, and consider that the user’s length of stay in isolation. Figure 15 shows the number of users inferred as residents in each isolation dorm compared to the ground truth data for the semester, which we received from campus health authorities.

Figure 15. Number of inferred users in each isolation dorm compared with true counts.
Figure 16. MSE of device count at different times of the day as a predictor of isolation dorm residents. (right). MSE of inferred resident counts compared to reported building capacities for 19 dormitories.

Acknowledgements

Suman Banerjee and Lance Hartung are supported in part through the following US National Science Foundation grants: CNS-1719336, CNS-1647152, CNS-1629833, and CNS-2003129 and an award from the US Department of Commerce with award number 70NANB21H043.

References