Providing early indication of regional anomalies in COVID19 case counts in England using search engine queries

07/23/2020
by   Elad Yom-Tov, et al.
0

COVID19 was first reported in England at the end of January 2020, and by mid-June over 150,000 cases were reported. We assume that, similarly to influenza-like illnesses, people who suffer from COVID19 may query for their symptoms prior to accessing the medical system (or in lieu of it). Therefore, we analyzed searches to Bing from users in England, identifying cases where unexpected rises in relevant symptom searches occurred at specific areas of the country. Our analysis shows that searches for "fever" and "cough" were the most correlated with future case counts, with searches preceding case counts by 16-17 days. Unexpected rises in search patterns were predictive of future case counts multiplying by 2.5 or more within a week, reaching an Area Under Curve (AUC) of 0.64. Similar rises in mortality were predicted with an AUC of approximately 0.61 at a lead time of 3 weeks. Thus, our metric provided Public Health England with an indication which could be used to plan the response to COVID19 and could possibly be utilized to detect regional anomalies of other pathogens.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/23/2020

Sensitivity optimization of multichannel searches for new signals

The frequentist definition of sensitivity of a search for new phenomena ...
02/23/2019

Can Google Scholar and Mendeley help to assess the scholarly impacts of dissertations?

Dissertations can be the single most important scholarly outputs of juni...
01/24/2021

Estimating the Total Volume of Queries to a Search Engine

We study the problem of estimating the total number of searches (volume)...
12/22/2021

Faster indicators of dengue fever case counts using Google and Twitter

Dengue is a major threat to public health in Brazil, the world's sixth b...
04/15/2019

Deep neural networks can predict mortality from 12-lead electrocardiogram voltage data

The electrocardiogram (ECG) is a widely-used medical test, typically con...
05/21/2020

The Hypothesis of Testing: Paradoxes arising out of reported coronavirus case-counts

Many statisticians, epidemiologists, economists and data scientists have...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

COVID19 was first reported in England at the end of January 2020, and by mid-June over 150,000 cases were reported. We assume that, similarly to influenza-like illnesses, people who suffer from COVID19 may query for their symptoms prior to accessing the medical system (or in lieu of it). Therefore, we analyzed searches to Bing from users in England, identifying cases where unexpected rises in relevant symptom searches occurred at specific areas of the country.

Our analysis shows that searches for “fever” and “cough” were the most correlated with future case counts, with searches preceding case counts by 16-17 days. Unexpected rises in search patterns were predictive of future case counts multiplying by 2.5 or more within a week, reaching an Area Under Curve (AUC) of 0.64. Similar rises in mortality were predicted with an AUC of approximately 0.61 at a lead time of 3 weeks.

Thus, our metric provided Public Health England with an indication which could be used to plan the response to COVID19 and could possibly be utilized to detect regional anomalies of other pathogens.

1 Introduction

COVID19 was first reported in England in late January 2020 [10]. By the middle of June 2020 over 150,000 cases and 39,000 deaths were reported.

In early March 2020, Public Health England (PHE), University College London (UCL) and Microsoft began investigating the possibility of using Bing search data to detect areas where outbreaks of the disease might be occurring or are soon to occur, so as to assist PHE in better planning their response.

Internet data in general and search data in particular, have long been used to track Influenza-Like Illness (ILI) [7, 14, 12], norovirus [5], and dengue fever [3] in the community. The main reason for the utility of these data for this purpose is the fact that most people with, for example, ILI will not visit a medical facility but will search about it or mention it in social media postings [16]. We assume that the similarity of symptoms between ILI and COVID19, together with public fear of accessing medical facilities during an epidemic, may drive people to similarly search the web for relevant symptoms, making them predictive of COVID19.

Models of ILI which are based on internet data are usually trained using past season’s data. Since this was infeasible for COVID19 we opted to use a different approach in our prediction, which utilized less training data. Our methodology examined consecutive weeks, where during the first of those weeks we found, for each Upper Tier Local Authority (UTLA), other UTLAs with similar rates of queries for symptoms. These UTLAs were then utilized to predict the rate of symptom queries for relevant symptoms during the following week. The difference between the actual and predicted rate of searches served as an indication of an unusual number of searches in a given area, i.e., an anomaly.

This methodology is similar to a difference-in-difference analysis [4], albeit one where differences are calculated between actual and predicted symptom rates. As such, it shares similarities with the methodology used to predict the effectiveness of childhood flu vaccinations using internet data [9, 13].

2 Methods

2.1 Symptom list and area list

The list of 25 relevant symptoms for COVID19 was extracted from PHE reports, and are listed in Table 1 together with their synonyms.

In order to maximise the utility of the analysis, we conducted it at the level of UTLA, a subnational administrative division of England into 173 areas, over which local government has a public health remit.

COVID19 symptoms Synonyms or related expressions
Altered consciousness altered consciousness
Anorexia appetite loss, loss of appetite, lost appetite
Anosmia loss of smell, can’t smell
Arthralgia joint ache, joint aching, joints ache, joints aching
Chest pain chest pain
Chills chills
Cough cough
Diarrhea diarrhea, diarrhoea
Dry cough dry cough
Dyspnea breathing difficult, short breath, shortness of breath
Epistaxis nose bleed, nose bleeding
Fatigue fatigue
Head ache head ache, headache
Myalgia muscle ache, muscular pain
Nasal congestion blocked nose, nasal congestion
Nausea nausea, nauseous
Pyrexia fever, high temperature
Pneumonia pneumonia, respiratory infection, respiratory symptoms
Rash rash
Rhinorrhea runny nose
Seizure seizure
Sore throat sore throat, throat pain
Sternutation sneeze, sneezing
Tiredness tiredness
Vomiting vomit, vomiting
Table 1: 25 symptoms related to COVID19 (as identified by PHE) and their synonyms or related expressions.

2.2 Search data

We extracted all queries submitted to the Bing search engine from users in England. Each query was mapped to a UTLA according to the postcode (derived from the IP address of the user) from which the user was querying. We counted the number of users per week who queried for each of the keywords from each UTLA, and normalized by the number of users who queried for any topic during that week from each UTLA. The fraction of users who queried for keyword at week in UTLA is denoted by .

Data was extracted for the period between January 1st, 2020 to May 28th, 2020. For privacy reasons, UTLAs with fewer than 10,000 Bing users were removed from the analysis. Additionally, any keyword which had fewer than 10 users in a given week at a specific UTLA was also removed from the analysis by setting it to zero.

This study was approved by the Institutional Review Board of the Technion, Israel Institute of Technology.

2.3 Validation data

We compared our detection methodology (described below) to mortality data and case reports. The former were obtained at a weekly resolution from the UK Office of National Statistics Death Registrations and Occurrences by Local Authority and Health Board666https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/causesofdeath/datasets/deathregistrationsandoccurrencesbylocalauthorityandhealthboard. Case report numbers per day were accessed from UK government’s dashboard for COVID19777https://coronavirus.data.gov.uk/.

2.4 Analysis

Analysis was conducted at a weekly resolution, beginning on Mondays of each week, starting on March 4th, 2020. For a given week we found control UTLAs for each UTLA such that could be predicted from (). To do this, a greedy procedure was followed for each UTLA :

  1. Find a UTLA which is at least 50km distant from the -th UTLA for which the linear function mapping the symptom rates at to the symptom rates at reaches the highest coefficient of determination (). That is, for each , in a least-squares sense.

  2. Repeat (1), adding at each time another area that maximally increases when added to the previously established set of areas.

The linear function was optimized for a least squares fit, with an intercept term.

The result of this procedure is a linear function which predicts the symptom rate for each UTLA given the symptom rates at 5 other UTLAs at week . We denote this prediction as .

The function is applied at week

to each UTLA, and the difference between the estimated and actual symptom rate for each symptom is calculated:

. We refer to this difference as the

UTLA outlier measure

.

To facilitate comparison between the differences of different keywords, , the values of

are normalized to zero mean and unit variance (standardized) for each keyword.

3 Results

3.1 Correlation of individual keywords with case counts

For illustrative purposes, Figure 1 shows the daily number of COVID19 cases and percentage of Bing users who queried for “cough” in one of the UTLAs. We calculated the cross-correlation between these time series for each keyword and each UTLA. The highest correlation and its lag in days were noted, and the median values (across UTLAs) are shown for each keyword in Table 2.

Figure 1: Number of COVID19 cases (red circles) and percentage of Bing users who queried for “cough” in a sample UTLA (blue). Curves are smoothed using a moving average filter of length 7. The correlation between the curves is 0.837 at a delay of 20 days.

As the Table shows, the best correlations are reached for cough, sore throat, and fever, at a delay of 16-19 days. Based on initial results and using PHE case definition of COVID19 at the time, we focused on two keywords, cough and fever, for the the remaining analysis.

Keyword Median correlation Median lag (days)
chest pain 0.589 13
cough 0.746 17
diarrhea 0.606 22
fatigue 0.509 -13
fever 0.695 16
head ache 0.624 13
nausea 0.590 -4
pneumonia 0.667 34
rash 0.612 -8
seizure 0.579 6
sneezing 0.593 4
sore throat 0.775 19
vomiting 0.575 15
Table 2: Median correlation and the lag (in days) at which it is achieved, between case numbers and fraction of keywords used on Bing. A positive lag means that Bing searches appear before case counts, and vice cersa.

Figure 2 shows the improvement in model fit () as more areas are added to . As the model shows, improvement continues, but the marginal gain decreases with the number of areas, as expected.

Figure 2: Average values of the model for as the number of areas increases.

3.2 Detection ability of the outlier measure

On average, predictions were given for 116 UTLAs per week (of 173 UTLAs), where at least 10,000 users queried on Bing.

Figure 3 shows the ROC for two lags, 3 days (AUC: 0.56) and 8 days (AUC: 0.63) for the composite signal, that is, the multiplication of the normalized UTLA outlier measures for “cough” and “fever”.

Figure 3: ROC curve for of the composite measure, derived from the product of the normalized UTLA outlier measures for “fever” and “cough”, for two time lags (3 and 8 days).

Figure 4

shows the Area Under Curve (AUC) for the Receiver Operating Curve where the independent attribute is the UTLA outlier measures and the dependent variable is whether there was week-over-week jump of over two standard deviations in the number of COVID19 cases in an UTLA. As Figure 

4 shows, “fever” reaches a slightly higher AUC than “cough”, but precedes case numbers by only around 2 days, meaning that it can predict cases with only limited lead time. In contrast, “cough” reaches a lower AUC at a lead time of 4-6 days. The multiplication of the values of both symptoms (denoted in the figure as “Both”) reaches the highest AUC, at a lead of one week.

Figure 4: AUC of the UTLA outlier measure for detecting unusually large rises in COVID19 cases per UTLA, as a function the the lag between case counts and Bing data.

Figure 5 shows a similar analysis to the one above, albeit for mortality data at a weekly basis. As the Figure shows, cough is a better predictor, at a delay of 3 weeks.

Figure 5: AUC of the UTLA outlier measure for detecting unusually large rises in COVID19 mortality per UTLA, as a function the the lag (in weeks) between mortality and Bing data.

3.3 Changes in detections over time

Figure 6 shows the number of UTLAs with sufficient data, meaning that enough users queried for the relevant terms, over the weeks of the analysis. As the figure shows, the number of users asking about fever was relatively high throughout the analysis, but questions for cough dropped sufficiently that the number of UTLAs for which detections could be given was reduced by around a fifth. Figure 6 (center) shows the number of UTLAs per week that had values greater than 2 standard deviations. Here too cough decreases quickly, while fever remains at a relatively constant level. Finally, the right figure shows the number of UTLAs which experienced a rise of 2.5 or more in the number of cases, week over week. Here too the number drops significantly over time.

Figure 6: Number of UTLAs with sufficient data over time (left), number of UTLAs with values over the threshold over time (middle) and number of UTLAs with rises greater than 2.5 times (right). Week numbers correspond to the weeks since the beginning of 2020.

3.4 Demographic attributes of outlying areas

The 10 highest and 10 lowest correct and mistaken detections at each week were identified to assess if they could by associated with specific demographic characteristics of their areas.

Demographic characteristics of UTLAs were collected from the UK Office of National Statistics (ONS), and include population density, male and female life expectancy and healthy life expectancy, male to female ratio, and the percentage of the population under the age of 15.

Association was estimated using a logistic regression model. However, none of the variables were statistically significantly associated with these attributes (

with Bonferroni correction).

4 Discussion

Internet data, especially search engine queries, have been used for tracking influenza-like illness and other illnesses for over a decade, because of the frequency at which people query for the symptoms of these illnesses and the fact that more people search for them than visit a health provider [7, 14]. COVID19, a novel disease, seemed to present similar opportunities for tracking using web data, and current indications suggest that search data could be used to track the disease [8]. However, as there was little past information to enable model training, we developed a method for detecting outbreaks using a variant of a difference-in-difference model at the local level.

Our results demonstrate good correlation between case numbers and the use of the keywords “cough”, “fever” and “sore throat”, with queries leading case numbers by 16-19 days (similar to the findings of Lampos et al. [8]). Based on early indications from PHE we focused on using the first two keywords in our detection methodology.

The detections provided to PHE provided a lead time of approximately one week for case numbers, with an AUC or approximately 0.64. This modest accuracy is nonetheless useful as long as exceedance of the 2 standard deviations threshold is not interpreted at face value as an increase in disease incidence, but as an early warning signal that triggers further investigation and correlation with outputs from other disease surveillance systems. This is particularly true at the local level and the outputs of this analysis are being incorporated into local routine PHE surveillance reports alongside outputs from clinical and laboratory systems. We also note that the detection accuracy for mortality was greatest at 3 weeks, which is congruent with the time difference between illness onset and death [17].

The threshold at which a UTLA should be alerted can be set in a number of ways. In our work with PHE, we reported UTLAs where the value of both symptoms multiplied exceeded the 95-th percentile threshold of values, computed for all other symptoms that week, similar to the procedure used in the False Detection Ratio test [1].

The reasons for the modest detection accuracy include problems in the source data as well as in the data used as ground truth. Search data is noisy [15] and Bing’s market share in England is estimated at around 5% [2]. We compared our results to the number of positive COVID19 cases. These numbers are affected by testing policy, which may have caused a non-uniform difference between known and actual case numbers in different UTLAs. Additionally, COVID19 has a relatively high asymptomatic rate (currently estimated at 40-45% [11]). People who do not experience symptoms would be missed by our method. On the other hand, current serological surveys [6] suggest that at the end of May 2020, between 5% and 17% of the population (depending on area in England) have been exposed to COVID19, compared to only 0.3% than have tested positive to a screening test, suggesting that a large number of people who may have experienced symptoms of COVID19 and queried for them were not later tested, leading to errors in our comparison between detections and known case numbers.

References

  • [1] Y. Benjamini and Y. Hochberg (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57 (1), pp. 289–300. Cited by: §4.
  • [2] M. Capala (2020-03) Global search engine market share in the top 15 gdp nations (updated for 2020). Note: https://alphametic.com/global-search-engine-market-shareAccessed on 2020-06-29 Cited by: §4.
  • [3] P. Copeland, R. Romano, T. Zhang, G. Hecht, D. Zigmond, and C. Stefansen (2013) Google disease trends: An update. https://storage.googleapis.com/pub-tools-public-publication-data/pdf/41763.pdf. Cited by: §1.
  • [4] J. B. Dimick and A. M. Ryan (2014) Methods for evaluating changes in health care policy: the difference-in-differences approach. Jama 312 (22), pp. 2401–2402. Cited by: §1.
  • [5] M. Edelstein, A. Wallensten, I. Zetterqvist, and A. Hulth (2014) Detecting the norovirus season in sweden using search engine data–meeting the needs of hospital infection control teams. PloS ONE 9 (6), pp. e100309. Cited by: §1.
  • [6] P. H. England (2020-06) Sero-surveillance of covid-19. Note: https://www.gov.uk/government/publications/national-covid-19-surveillance-reports/sero-surveillance-of-covid-19Accessed on 2020-06-29 Cited by: §4.
  • [7] V. Lampos, A. C. Miller, S. Crossan, and C. Stefansen (2015) Advances in nowcasting influenza-like illness rates using search query logs. Sci. Rep. 5 (12760). Cited by: §1, §4.
  • [8] V. Lampos, S. Moura, E. Yom-Tov, M. Edelstein, M. Majumder, Y. Hamada, M. X. Rangaka, R. A. McKendry, and I. J. Cox (2020) Tracking covid-19 using online search. arXiv preprint arXiv:2003.08086. Cited by: §4, §4.
  • [9] V. Lampos, E. Yom-Tov, R. Pebody, and I. J. Cox (2015) Assessing the impact of a health intervention via user-generated internet content. Data Min. Knowl. Discov. 29 (5), pp. 1434–1457. Cited by: §1.
  • [10] P. Moss, G. Barlow, N. Easom, P. Lillie, and A. Samson (2020) Lessons for managing high-consequence infections from first covid-19 cases in the uk. The Lancet 395 (10227), pp. e46. Cited by: §1.
  • [11] D. P. Oran and E. J. Topol (2020) Prevalence of asymptomatic sars-cov-2 infection: a narrative review. Annals of Internal Medicine. Cited by: §4.
  • [12] M. Wagner, V. Lampos, I. J. Cox, and R. Pebody (2018) The added value of online user-generated content in traditional methods for influenza surveillance. Scientific reports 8 (1), pp. 1–9. Cited by: §1.
  • [13] M. Wagner, V. Lampos, E. Yom-Tov, R. Pebody, and I. J. Cox (2017) Estimating the population impact of a new pediatric influenza vaccination program in england using social media content. Journal of medical Internet research 19 (12), pp. e416. Cited by: §1.
  • [14] S. Yang, M. Santillana, and S. C. Kou (2015) Accurate Estimation of Influenza Epidemics using Google Search Data via ARGO. PNAS 112 (47), pp. 14473–14478. Cited by: §1, §4.
  • [15] E. Yom-Tov, I. Johansson-Cox, V. Lampos, and A. C. Hayward (2015) Estimating the secondary attack rate and serial interval of influenza-like illnesses using social media. Influenza and other respiratory viruses 9 (4), pp. 191–199. Cited by: §4.
  • [16] E. Yom-Tov (2016) Crowdsourced health: how what you do on the internet will improve medicine. MIT Press. Cited by: §1.
  • [17] F. Zhou, T. Yu, R. Du, G. Fan, Y. Liu, Z. Liu, J. Xiang, Y. Wang, B. Song, X. Gu, L. Guan, Y. Wei, H. Li, X. Wu, J. Xu, S. Tu, Y. Zhang, H. Chen, and B. Cao (2020) Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: A retrospective cohort study. The Lancet 395 (10229), pp. 1054–1062. Cited by: §4.