In just over seven months, COVID-19 – the disease caused by the betacoronavirus SARS-CoV-2 – has caused over 503,000 deaths worldwide, 125,000 of which are in the US . In the absence of a vaccine or an effective treatment, authorities have employed non-pharmaceutical interventions (NPIs) to slow epidemic growth, including school and business closures, work-from-home policies, and travel bans. Recently, many US states have begun progressively reopening their economies, despite estimates of cumulative US COVID-19 incidence suggesting that fewer than 10-15% of the US population has been exposed to SARS-CoV-2 . Serological studies also indicate low levels of seroprevalence even in parts of the US heavily affected by the virus (e.g., 23% in New York City by May 29, 2020) [66, 53]. The long development timeline for a vaccine  coupled with the possibility that immunity to SARS-CoV-2 may decline over time (as is the case with other coronaviruses) portends the emergence of new epidemic waves . The availability of a reliable, robust, real-time indicator of emerging COVID-19 outbreaks would aid immensely in appropriately timing a response.
Despite efforts by the research community to aggregate and make available data streams that are representative of COVID-19 activity, it is not immediately clear which of these data streams is the most dependable for tracking outbreaks. Most metrics for tracking COVID-19, such as confirmed cases, hospitalizations, and deaths, suffer from reporting delays, as well as uncertainties stemming from inefficiencies in the data collection, collation, and dissemination processes . For example, confirmed cases may be more reflective of testing availability than of disease incidence and, moreover, lag infections by days or weeks [29, 40]. Previous work has suggested that clinician-provided reports of influenza-like illness (ILI) aggregated by the Centers for Disease Control and Prevention (CDC) may be less sensitive to testing availability than confirmed cases, but these reports suffer from reporting lags of 5-12 days, depend on the thoroughness of clinician reporting, and do not distinguish COVID-19 from other illnesses that may cause similar symptoms.
Alternatively, forecasting models can assist in long-term planning, but the accuracy of their predictions are limited by the timeliness of data or parameter updates. Specifically, some models demonstrate predictive skill with respect to hospitalizations and deaths [16, 30], but these predictions are often too late to enable timely NPI implementation. Other models suffer from limited generalizability, with NPI implementation proposed only for a specific city . The CDC has launched a state-level forecasting initiative aimed at consolidating predictions from multiple models to estimate future COVID-19-attributable deaths, but the use of these predictions in state-level decision-making is still pending .
Over the last decade, new methodologies have emerged to track population-level disease spread using data sources not originally conceived for that purpose . These approaches have exploited information from search engines [22, 57, 71, 82, 45, 39, 70], news reports [4, 43, 44], crowd-sourced participatory disease surveillance systems [73, 55], Twitter microblogs [56, 50], electronic health records [79, 69], Wikipedia traffic , wearable devices , smartphone-connected thermometers , and travel websites  to estimate disease prevalence in near real-time. Several have already been used to track COVID-19 [33, 42]. These data sources are liable to bias, however; for example, Google Search activity is highly sensitive to the intensity of news coverage [71, 35, 3]. Methodologies to mitigate biases in digital data sources commonly involve combining disease history, mechanistic models, and surveys to produce ensemble estimates of disease activity [68, 61].
Our Contribution: We propose that several digital data sources may provide earlier indication of epidemic spread than traditional COVID-19 metrics such as confirmed cases or deaths. Six such sources are examined here: (1) Google Trends patterns for a suite of COVID-19-related terms, (2) COVID-19-related Twitter activity, (3) COVID-19-related clinician searches from UpToDate, (4) predictions by GLEAM, a state-of-the-art metapopulation mechanistic model, (5) anonymized and aggregated human mobility data from smartphones, and (6) Kinsa Smart Thermometer measurements. We first evaluate each of these “proxies” of COVID-19 activity for their lead or lag relative to traditional measures of COVID-19 activity: confirmed cases, deaths attributed, and ILI. We then propose the use of a metric combining these data sources into a multi-proxy estimate of the probability of an impending COVID-19 outbreak. Finally, we develop probabilistic estimates of when such a COVID-19 outbreak will occur conditional on proxy behaviors. Consistent behavior among proxies increases the confidence that they capture a real change in the trajectory of COVID-19.
Visualizing the behavior of COVID-19-tracking data sources: motivation for designing an early-warning system. Figure 1 displays the temporal evolution of all available signals considered in this study for three US states - Massachusetts (MA), New York (NY), and California (CA) - over five lengthening time intervals. These states illustrate different epidemic trajectories within the US, with NY among the worst affected states to date and CA experiencing a more gradual increase in cases than both MA and NY.
The top row of Figure 1 for each state displays normalized COVID-19 activity as captured by daily reported confirmed cases, deaths, and “Excess ILI” (hospitalizations were discounted due to data sparseness). Excess ILI refers to hospital visits due to influenza-like illness in excess of what is expected from a normal flu season , which we attribute to COVID-19 in 2020. ILI data were taken from the CDC’s US Outpatient Influenza-like Illness Surveillance Network (ILINet). The middle row for each state displays time series for five proxies of COVID-19 activity. The bottom row for each state displays state-level anonymized and aggregated human mobility data as collected by mobile phones; mobility data is viewed as a proxy for adherence to social distancing recommendations. Similar visualizations for all states are shown in Figures S1 to S17 in the Supplementary Materials.
Figure 1 demonstrates that for MA, NY, and CA, COVID-19-related clinicians’ and general population’s Internet activity, smart thermometers, and GLEAM model predictions exhibit early increases that lead increases of confirmed cases and deaths due to COVID-19. Analogously, decreases in other proxies - especially in mobility - mirror later decreases in COVID-19-attributable confirmed cases and deaths for the three states represented. This is not universally observable, however, as some states such as North Carolina, Arizona, Florida, and Texas have not seen decreases in COVID-19 activity.
Quantifying the timing of growth in proxies of COVID-19 activity.
To quantify the relative leads and lags in our collection of disease proxies, we formulated a change of behavior “event” for each proxy and compared it to three “gold standards” of COVID-19 activity: confirmed cases, deaths attributed, and Excess ILI. In keeping with classical disease dynamics, we defined an event as any initiation of exponential growth (“uptrend”). Using a Bayesian approach, we obtained a joint posterior probability distribution for parameter values in a functionover a time window of 14 days, evaluating , , and
(the variance of). A -value was then calculated per proxy per day, representing the posterior probability that is greater than zero. As the -values decrease, we grow more confident that a given time series is exhibiting sustained growth. When the -value decreases below 0.05, we define this as an individual proxy’s “uptrend” event.
The sequences of proxy-specific uptrends for an example state, New York (NY), are depicted in Figure 2. Upward-pointing triangles denote the date on which a growth event is identifiable. For the example state, COVID-19-related Twitter posts gave the earliest indication of increasing COVID-19 activity, exhibiting an uptrend around March 2. This was closely followed by uptrends in GLEAM-modeled infections, Google Searches for “fever”, fever incidence, and COVID-19-related searches by clinicians.
The activation order of COVID-19 early warning indicators in NY is characterized by earlier growth in proxies reflecting public sentiment than in more clinically-specific proxies. This ordering is broadly repeated across states (Figure 3a). COVID-19-related Twitter posts and Google Searches for “fever” were among the earliest proxies to activate, with Twitter activating first for 35 states and Google activating first for 7 states. UpToDate showed the latest activation among proxies, activating last in 37 states albeit still predating an uptrend in confirmed cases (Supplementary Figure S67). This analysis was conducted for all states excepting those with data unavailable due to reporting delays; event information was missing for deaths in 2 states, Kinsa in 1 state, and Excess ILI in 5 states.
Although data streams that directly measure public sentiment (i.e., Google and Twitter) are sensitive to the intensity of news reporting, we note that median growth in Google Searches for “fever” occur within 3 days of median growth in fever incidence (as measured by Kinsa), suggesting that many searches may be driven by newly-ill people (Figure 3a). We additionally observed that the median lags between deaths and either fever incidences or Google Searches for “fever” were respectively 22 days and 21 days, broadly consistent with previous estimates that the average delay between the onset of symptoms and death is 20 days . Detailed time-series and event activation dates are displayed in Figures S18 to S66 in the Supplemental Materials.
). As described in more detail in Section 4, a harmonic mean was taken across the-values associated with each of the indicators. The harmonic mean was used because it does not require -values across different proxies to be independent . Similar to the case for individual proxies, we defined detection of a growth event to occur when the harmonic mean -value (HMP) decreases below 0.05..
Quantifying the timing of decay in proxies. An examination analogous to that made for the uptrend events was made to identify the timing of exponential decay (“downtrend”) in proxies and gold standard time series. On each day of the time series and for each proxy, a -value is calculated to represent the posterior probability that is less than zero. An individual proxy’s downtrend was defined to occur on days when the associated -value decreases below 0.05. A sample sequence of downtrends is also depicted in Figure 2, where downward-pointing triangles denote the date on which a decay event is identifiable. For the example state, Cuebiq and Apple mobility data gave the earliest indication of decreasing COVID-19 activity, exhibiting downtrends around March 15.
The added value of extending our analysis to include decay events is the ability to characterize downstream effects of NPIs. Specifically, opportunities to rapidly assess NPI effectiveness may arise if NPI influence on transmission rates is recorded in proxy time series before it is recorded in confirmed case or death time series. We used two smartphone-based metrics of human mobility that are indicators of population-wide propensity to travel within a given US county as provided by the location analytics company Cuebiq and by Apple. These mobility indicators are used as proxies for adherence to social distancing policies and are described in detail in section 4. Apple Mobility and the Cuebiq Mobility Index (CMI) are imperfect insofar as they do not account for several important transmission factors, including importing of cases from other states. Although local travel distances are incomplete proxies for the scale at which NPIs are performed, reductions in mobility have been shown to lead subsequent reduction in fever incidence by an average of 6.5 days - an interval approximately equal to the incubation period of COVID-19 - across hundreds of US counties , suggesting that they capture NPI-induced reductions in transmission rates. Figure 3b supports this observation, with median fever incidence lagging median CMI and Apple Mobility activation by an average of 8.5 days and 5.5 days, respectively. Our use of two distinct mobility metrics is intended to reduce the influence of systematic biases arising from the methodology of either metric.
The timing of the first downtrend is consistent between Apple Mobility and CMI (maximum difference of 4 days in median activation across all states with available data), with median downtrend activation for CMI preceding median activation of all other proxies and gold standard time series (Figure 3b). Median decay in these indices predated median decay in deaths and confirmed cases by a median of 6 and 5 weeks, respectively; CMI was first to activate in 60% of states (refer to Figure S68). GLEAM, Google Searches for “covid”, and UpToDate were among the latest proxies to activate across states. Median downtrend activation for Google Searches for “quarantine” - included as a surrogate measure of mobility - lagged CMI and Apple Mobility median downtrend activation by an average of 12 days and 10.5 days, respectively. Statistically significant events were not detected, or data were not available, for GLEAM in 22 states, Excess ILI in 5 states, Apple mobility in 1 state, deaths in 2 states, and confirmed cases in 7 states.
To complement our lead-lag analysis, we conducted a diagnostic post-hoc analysis using correlation coefficients between lagged time-series, described in further detail in Supplemental Materials (Supplementary Figures S70-S76).
Applying early signals by proxies to the prediction of future outbreak timing. We hypothesize that an early warning system for COVID-19 could be derived from uptrend event dates across a network of proxies. For each state, is defined as the number of days since an uptrend event for proxy , where is the current date. A posterior probability distribution of an uptrend in confirmed COVID-19 cases is then estimated for each state conditional on the collection of proxies, , where each proxy is treated as an independent expert predicting the probability of a COVID-19 event. In this case, is the number of proxies. See Section 4 for a more detailed explanation. A similar analysis is also feasible for downtrends. This method is introduced to better formalize predictions of growth in gold standard indicators using a network of proxies, but further evaluation is required, including relative to subsequent “waves” of COVID-19 cases.
Figure 4a shows uptrend events from proxies (vertical solid lines) and the predicted uptrend probability distribution for confirmed COVID-19 cases (in red) overlayed on confirmed COVID-19 cases (gray). As more proxy-derived uptrend events are observed, the probability mass of the predicted uptrend event becomes increasingly concentrated in the vicinity of identified exponential growth in confirmed COVID-19 cases (long vertical solid line). In the case of NY, exponential growth is identified in confirmed COVID-19 cases in the earlier part of the prediction distribution, though for most states it occurs near the center as follows from how the prediction is estimated. The right panel of Figure 4b similarly shows downtrend events in proxies and the estimated downtrend posterior probability distribution for decay in daily reports of confirmed COVID-19 cases. The downtrend posterior probability distribution has a greater variance than the uptrend posterior probability distribution, with the true downtrend event again occurring earlier in this high variance distribution. A visualization of the probability distribution for all the states is included in the Supplementary Materials (Figures S68 and S69).
Event detection results for pairwise comparisons between COVID-19 proxies and gold standards for US states with available data. (a) Boxplots showing proxy-specific uptrends, or intervals of significant exponential growth relative to deaths, confirmed COVID-19 cases, and Excess ILI. (b) Boxplots showing proxy-specific downtrends. Boxplots indicate the median (central vertical line), interquartile range (vertical lines flanking the median), extrema (whiskers), and outliers (dots); differences between input variable (y-axis) and response variable (title) exceeding 50 days are omitted. Negative differences indicate the input variable event activation preceded the response variable event activation. Deaths, cases, and Excess ILI, as well as the combined measure defined in Figure2, are also included for purposes of intercomparing gold standards. Boxplots are sorted according to median value and shifted to offset delays in real-time availability. Only the event activations within the first wave are considered; the box plots therefore do not account for subsequent minor activations.
Here we have assessed the utility of various digital data streams, individually and collectively, as components of a near real-time COVID-19 early warning system. Specifically, we focused on identifying early signals of impending outbreak characterized by significant exponential growth and subsiding outbreak characterized by exponential decay. We found that COVID-19-related activity on Twitter showed significant growth 2-3 weeks before such growth occurred in confirmed cases and 3-4 weeks before such growth occurred in reported deaths. We also observed that for exponential decay, NPIs - as represented by reductions in human mobility - predated decreases in confirmed cases and deaths by 5-6 weeks. Clinicians’ search activity, fever data, estimates from the GLEAM metapopulation mechanistic epidemiological model, and Google Searches for COVID-19-related terms were similarly found to anticipate changes in COVID-19 activity. We also developed a consensus indicator of COVID-19 activity using the harmonic mean of all proxies. This combined indicator predated an increase in COVID-19 cases by a median of 19.5 days, an increase in COVID-19 deaths by a median of 29 days, and was synchronous with Excess ILI. Such a combined indicator may provide timely information, like a “thermostat” in a heating or cooling system, to guide intermittent activation, intensification, or relaxation of public health interventions as the COVID-19 pandemic evolves.
The most reliable metric for tracking the spread of COVID-19 remains unclear, and all metrics discussed in this study feature important limitations. For example, a recent study has shown that confirmed US cases of COVID-19 may not necessarily track the evolution of the disease considering limited testing frequency at early stages of the pandemic . While deaths may seem a more accurate tracker of disease evolution, they are limited in their real-time availability, as they tend to lag cases by nearly 20 days . Influenza-like illness (ILI) activity, anomalies in which may partly reflect COVID-19 , similarly suffers from a lag in availability because reports are released with a 5-12 day lag; for simplicity, we approximated this lag as 10 days in our analysis. Furthermore, a decrease in ILI reporting is frequently observed after flu season (which ended April 4 2020 and will begin again in October 2020), rendering ILI-based analyses useful only when surveillance systems are fully operational. Ref.  supports this conjecture, reporting a rapid decrease in the number of ILI patients reported in late March 2020 despite the number of reporting providers remaining largely unchanged. This decrease may also be attributed to patients foregoing treatment for milder, non-COVID-19-attributed ILI. Hospitalizations, though possibly less biased than confirmed case numbers, were ultimately omitted due to sparseness and poor quality of data.
The near real-time availability of digital data streams can facilitate tracking of COVID-19 activity by public health officials. Increases in discussions of disease terminology on Twitter and Google, for example, may be early signals of increase in COVID-19 activity (Figures 2, 3). These data streams have been used in the past to track other infectious diseases in the US [56, 82, 23]. Although Twitter and Google Trends both show growth and decay well ahead of confirmed cases, deaths, and ILI, it is unclear if their activity is in fact representative of disease prevalence. This activity may instead reflect the intensity of news coverage [71, 35, 3] and could perhaps be triggered by “panic searches” following the identification of several cases. Such false positives in social and search activity may be reduced by confirmatory use of UpToDate, whose clinician-restricted access and clinical focus limit the likelihood that searches are falsely inflated . Kinsa data may be used in a similar confirmatory capacity as they directly report user symptoms. However, the number of users and their demographics, as well as several aspects of the incidence estimation procedure, are not disclosed by Kinsa , limiting our ability to account for possible sources of bias in their data.
Given the near-ubiquity of smartphones in the US , smartphone-derived mobility data may reflect aspects of the local population response to COVID-19. We found that decreasing mobility - as measured by Apple and Cuebiq - predated decreases in deaths and cases by 6 and 5 weeks, respectively. Our results may be compared to documented findings that decreases in mobility preceded decreases in COVID-19 cases by up to 1 week in China [8, 32], by up to 2 weeks in Brazil , by up to 3 weeks in certain US states , and by up to 3 weeks globally . This variability may be attributed to differences in swiftness and strictness of implementing mobility restrictions, as well as discrepancies in definitions of “decrease.”
In contrast to the aforementioned digital data streams, GLEAM assimilates many epidemiological parameters (e.g., literature-derived incubation period and generation time [1, 30, 78]) essential for describing COVID-19. Coupled with Internet sources, GLEAM can provide a more robust outbreak detection method due to its different framework and, consequently, at least partially uncorrelated errors with the Internet-derived sources. Estimates produced by this model suggest a median increase in cases and deaths of 15 and 22 days later, respectively (Figure 3a). However, some parameters are excluded from the model due to lack of availability (e.g., age-related COVID-19 susceptibility). These excluded parameters - coupled with the need to regularly update existing parameters - may lead to sub-optimal downstream performance by the model.
The analysis we presented focuses largely on temporally analyzing different data streams that are aggregated to the state-level. This level of aggregation is able to provide a coarse overview of regional differences within the US. Smart thermometer, Twitter, and mobility data may help us replicate our analysis at finer spatial resolutions, making them suitable for capturing both regional and local effects. It follows that a promising future research avenue is the detection of local COVID-19 clusters (“hotspots”) through more spatially-revolved approaches . Such an approach would better inform regarding at-risk populations and, therefore, permit for more targeted NPIs. Whereas the data streams that we analyze do not capture important population dynamics, integrated spatial, temporal, and semantic analysis of web data  could give a more nuanced understanding of public reaction, such as estimating changes in the public emotional response to epidemics .
Using an exponential model to characterize the increase (and decrease) in activity of a COVID-19 proxy offers various advantages in event detection.Our current procedure is capable of estimating the value of with a measure on the confidence that the or . In this work, we provide event dates based on a confidence of 95% (-value ). The degree of confidence can be adjusted to provide earlier event dates (at the cost of less confidence and, consequently, more frequent false positives). -values are combined into a single metric using a harmonic mean, but a more sensitive design may be realized by assuming independence between the proxies and using Fisher’s method. Although this would generally lead to a lower combined -value given decreases in any individual proxy -value (i.e., higher sensitivity), assuming independence would make such an approach prone to false positives (i.e., lower specificity) than the HMP, which makes no independence assumption. The choice of method for combining proxy indicators requires a trade-off between specificity and sensitivity.
The ability to detect future epidemic changes depends on the stability of historical patterns observed in multiple measures of COVID-19 activity. We posit that using multiple measures in our COVID-19 early warning system leads to improved performance and robustness to measure-specific flaws. The probabilistic framework we developed (Figure 4) also gives decision-makers the freedom to decide how conservative they want to be in interpreting and consolidating different measures of COVID-19 activity (i.e., by revising the -value required for event detection). Although we can expect COVID-19 activity to increase in the future given continued opportunities for transmission, the human population in which said transmission occurs may not remain identical in terms of behavior, immune status, or risk. For example, death statistics in the early pandemic have been driven by disease activity in nursing homes and long term care facilities , but the effect of future COVID-19 waves in these settings may be attenuated by better safeguards implemented after the first wave. Our approach combines multiple measures of COVID-19 such that these changes in population dynamics - if reflected in any measure - would presumably be reflected in the event-specific probability distribution.
4 Data and Methods
For our study, we collected the following daily-reported data streams: 1) official COVID-19 reports from three different organizations, 2) ILI cases, as reported weekly by the ILINet, 3) COVID-19-related search term activity from UpToDate and Google Trends, 4) Twitter microblogs, 5) fever incidence as recorded by a network of digital thermometers distributed by Kinsa, and 6) human mobility data, as reported by Cuebiq and Apple.
COVID-19 Case Reports: Every state in the US is providing daily updates about its COVID-19 situation as a function of testing. Subject to state-to-state and organizational variability in data collection, these reports include information about the daily number of positive, negative, pending, and total COVID-19 tests, hospitalizations, ICU visits, and deaths. Daily efforts in collecting data by research and news organizations have resulted in several data networks, from which official health reports have been made available to the public. The three predominant data networks are: the John Hopkins Resource Center, the CovidTracking project, and the New York Times Repository [12, 46, 10]. We obtained daily COVID-19 testing summaries from all three repositories with the purpose of analyzing the variability and quality in the data networks.
ILINet: Influenza-like illnesses (ILI) are characterized by fever and either a cough or sore throat. An overlap in symptoms of COVID-19 and ILI has been observed, and it has further been shown that ILI signals can be useful in the estimation of COVID-19 incidence when testing data is unavailable or unreliable .
ILINet is a sentinel system created and maintained by the US CDC [6, 17] that aggregates information from clinicians’ reports on patients seeking medical attention for ILI symptoms. ILINet provides weekly estimates of ILI activity with a lag of 5-12 days; because detailed delay information is unavailable, we arbitrarily apply a lag of 10 days throughout this work. At the national-level, ILI activity is estimated via a population-weighted average of state-level ILI data. ILINet data are unavailable for Florida.
The CDC also releases data on laboratory test results for influenza types A and B, shared by labs collaborating with the World Health Organization (WHO) and the National Respiratory and Enteric Virus Surveillance System (NREVSS). Both ILI activity and virology data are available from the CDC FluView dashboard .
We followed the methodology of Ref. to estimate unusual ILI activity, a potential signal of a new outbreak such as COVID-19. In particular, we employed the divergence-based methods, which treat COVID-19 as an intervention and try to measure the impact of COVID-19 on ILI activity by constructing two control time series representing the counterfactual 2019-2020 influenza season had the COVID-19 outbreak not occurred.
The first control time series is based on an epidemiological model, specifically the Incidence Decay and Exponential Adjustment (IDEA) model . IDEA models disease prevalence over time while accounting for factors as control activities that may dampen the spread of a disease. The model is written as follows:
where represents the incident case count at serial interval time step . represents the basic reproduction number, and is a discount factor modeling reductions in the effective reproduction number over time.
In line with the approach of Ref., we fit the IDEA model to ILI case counts from the start of the 2019-2020 flu season to the last week of February 2020, where the start of flu season is defined as the first occurrence of two consecutive weeks with an ILI activity above . The serial interval used was half a week, consistent with the influenza serial interval estimates from .
The second control time series used the CDC’s virological influenza surveillance data. For any week the following was computed:
where , , , and denote positive flu tests, total specimens, ILI visit counts, and the true underlying flu counts, respectively. This can be interpreted as the extrapolation of the positive test percentage to all ILI patients. Least squares regression (fit on pre-COVID-19 data) is then used to map to an estimate of ILI activity.
The differences between the observed ILI activity time series and these counterfactual control time series can then be used as signals of COVID-19 activity. In particular, we used the difference between observed ILI activity and the virology-based counterfactual control time series to produce Excess ILI. The supplementary materials show Excess ILI computed using both the virology time series and IDEA model-based counterfactuals for all states.
UpToDate Trends: UpToDate is a private-access search database - part of Wolters Kluwer, Health - with clinical knowledge about diseases and their treatments. It is used by physicians around the world and the majority of academic medical centers in the US as a clinical decision support resource given the stringent standards on information within the database (in comparison to Google Trends, information provided within the database is heavily edited and authored by experienced clinicians) .
Recently, UpToDate has made available a visualization tool on their website in which they compare their search volumes of COVID-19 related terms to John Hopkins University official health reports . The visualization shows that UpToDate trends may have the potential to track confirmed cases of COVID-19. From this tool, we obtained UpToDate’s COVID-19-related search frequencies for every US state. These search frequencies consist only of one time series described as “normalized search intensity” for selected COVID-19-related terms, where normalization is calculated as the total number of COVID-19-related search terms divided by the total number of searches within a location. At the time of analysis, the website visualization appeared to update with a 3 day delay; however, UpToDate is since operationally capable of producing time series with delays of 1 day. More details are available at https://covid19map.uptodate.com/.
Google Trends: Google Search volumes have been shown to track successfully with various diseases such as influenza [39, 41], Dengue , and Zika [45, 75], among others . In recent months, Google has even created a “Coronavirus Search Trends”  page that tracks trending pandemic-related searches. We obtained such daily COVID-19-related search trends through the Google Trends Application Programming Interface (API). The original search terms queried using the Google Trends API were composed of: a) official symptoms of COVID-19, as reported by the WHO, b) a list of COVID-19 related terms shown to have the potential to track confirmed cases , and c) a list of search terms previously used to successfully track ILI activity . The list of terms can be seen in Table 1
. For purposes of the analysis, we narrowed down the list of Google Search terms to those we felt to be most representative of the pandemic to date: “fever”, “covid”, and “quarantine.” Given that lexicon, however, is naturally subject to change as the pandemic progresses, other terms may become more suitable downstream.
|anosmia, chest pain, chest tightness, cold, cold symptoms, cold with fever, contagious flu, cough, cough and fever, cough fever, covid, covid nhs, covid symptoms, covid-19, covid-19 who, dry cough, feeling exhausted, feeling tired, fever, fever cough, flu and bronchitis, flu complications, how long are you contagious, how long does covid last, how to get over the flu, how to get rid of flu, how to get rid of the flu, how to reduce fever, influenza, influenza b symptoms, isolation, joints aching, loss of smell, loss smell, loss taste, nose bleed, oseltamivir, painful cough, pneumonia, pneumonia, pregnant and have the flu, quarantine, remedies for the flu, respiratory flu, robitussin, robitussin cf, robitussin cough, rsv, runny nose, sars-cov 2, sars-cov-2 , sore throat, stay home, strep, strep throat, symptoms of bronchitis, symptoms of flu, symptoms of influenza, symptoms of influenza b, symptoms of pneumonia, symptoms of rsv, tamiflu dosage, tamiflu dose, tamiflu drug, tamiflu generic, tamiflu side effects, tamiflu suspension, tamiflu while pregnant, tamiflu wiki, tessalon|
Twitter API: We developed a geocrawler software to collect as much georeferenced social media data as possible in a reasonable time. This software requests data from Twitter’s APIs. Twitter provides two types of APIs to collect tweets: REST and streaming . The REST API offers various endpoints to use Twitter functionalities, including the “search/tweets” endpoint that enables, with limitations, the collection of tweets from the last seven days. These limitations complicate the collection process, necessitating a complementary strategy to manage the fast-moving time window of the API in order to harvest all offered tweets with a minimal number of requests. In contrast, the streaming API provides a real-time data stream that can be filtered using multiple parameters.
Our software focuses on requesting tweets featuring location information either as a point coordinate from the positioning device of the mobile device used for tweeting or a rectangular outline based on a geocoded place name, using APIs. The combination of APIs makes crawling robust against interruptions or backend issues that would lead to missing data. For example, if data from the streaming API cannot be stored in time, the missing data can be retrieved via the redundant REST API.
All collected tweets are located within the US. To limit the dataset to COVID-19-relevant tweets, we performed simple keyword-based filtering using the keywords listed in table 2
. This method was chosen for reasons of performance, delivery of results in near real-time, and its simplicity. While a machine learning-based semantic clustering method like Guided Latent Dirichlet Allocation (LDA) may deliver more comprehensive results (e.g., through identifying co-occurring and unknown terms), controlling the ratio between false positives and false negatives requires extensive experimental work and expert knowledge.
|covid, corona, epidemic, flu, influenza, face mask, spread, virus, infection, fever, panic buying, state of emergency, masks, quarantine, sars, 2019-ncov|
Kinsa Smart Thermometer Data: County-level estimates of US fever incidence are provided by Kinsa Insights using data from a network of volunteers who have agreed to regularly record and share their temperatures (https://healthweather.us/). Estimates from past years have been shown to correlate closely with reported ILI incidence from the US CDC across regions and age groups . Such historical data, coupled with county-specific characteristics (e.g., climate and population size), are used to establish the “expected”, or forecast, number of fevers as a function of time [47, 11]. An “excess fevers” time series presumed to represent COVID-19 cases is approximated as the difference between the observed fever incidence and the forecast, with values truncated at zero so that negative excess fevers are not possible. County-level data is aggregated up to the state-level by a population-weighted average. A limitation of tracking febrility as a proxy for COVID-19 is that it is not a symptom exclusive to COVID-19, nor is it present in all COVID-19 patients. In a study with more than 5000 patients from New York who were hospitalized with COVID-19, only 30% presented with fever (100.4F/38C) at triage .
Cuebiq Mobility Index: Data are provided by the location analytics company Cuebiq which collects first-party location information from smartphone users who opted in to anonymously provide their data through a General Data Protection Regulation-compliant framework; this first-party data is not linked to any third-party data. Cuebiq further anonymizes, then aggregates location data into an index, , defined as the base-10 logarithm of median distance traveled per user per day; “distance” is measured as the diagonal distance across a bounding box that contains all GPS points recorded for a particular user on a particular day. A county index of 3.0, for example, indicates that a median county user has traveled 1,000m. We were provided with county-level data - derived from these privacy preservation steps - which we then aggregated up to the state level by a population-weighted average.
Apple Mobility: Apple mobility data is generated by counting the number of requests made to Apple Maps for directions in select countries/regions, sub-regions, and cities. Data that is sent from users’ devices to the Maps service is associated with random, rotating identifiers so Apple does not have a profile of users’ movements and searches. The availability of data in a particular country/region, sub-region, or city is based on a number of factors, including minimum thresholds for direction requests per day. More details are available at https://www.apple.com/covid19/mobility.
Global Epidemic and Mobility Model (GLEAM): GLEAM is a spatially structured epidemic model that integrates population and mobility data with an individual-based stochastic infectious disease dynamic to track the global spread of a disease [8, 2, 24, 83]. The model divides the world into more than 3,200 subpopulations, with mobility data between subpopulations including air travel and commuting behavior. Air travel data are obtained from origin-destination traffic flows from the Official Aviation Guide (OAG) and the IATA databases[27, 52], while short-range mobility flows as commuting behavior are derived from the analysis and modeling of data collected from the statistics offices for 30 countries on five continents . Infectious disease dynamics are modeled within each subpopulation using a compartmental representation of the disease where each individual can occupy a single disease state: Susceptible (), Latent (), Infectious () and Removed (). The infection process is modeled by using age-stratified contact patterns at the state level . These contact patterns incorporate interactions that occur in four settings: school, household, workplace, and the general community. Latent individuals progress to the infectious stage with a rate inversely proportional to the latent period. Infectious individuals progress into the removed stage with a rate inversely proportional to the infectious period. The sum of the average latent and infectious periods defines the generation time. Removed individuals represent those who can no longer infect others, as they were either isolated, hospitalized, died, or have recovered. To take into account mitigation policies adopted widely across the US, we incorporated a reduction in contacts in the school, workplace, and community settings (which is reflected in the contact matrices used for each state). Details on the timeline of specific mitigation policies implemented are described in Ref. . A full discussion of the model for COVID-19 is reported in Ref. .
Growth and decay event detection: Here we describe a simple Bayesian method that estimates the probability of exponential growth in a time-series that features uncertain error variance. We model a proxy time-series as following exponential growth,
over successive 14 day intervals. Before inference, proxies are adjusted to have a common minimum value of 1. The error,
, is assumed Gaussian with zero mean and standard deviation,. We assess the probability that
is greater than zero over each successive window. The joint distribution ofand , conditional on , is proportional to . Prior distributions, and , are specified as uniform and uninformative, and samples are obtained using the Metropolis-Hastings algorithm  with posterior draws.The first 500 samples are discarded to remove the influence of initial parameter values and, to reduce autocorrelation between draws, only every fifth sample is retained. The conditional posterior distribution for is inverse-Gamma and is obtained using Gibbs sampling [20, 14]
is the inverse-Gamma distribution,
is the vector of observations, andis the number of observations. Terms and are, respectively, specified to equal 4 and 1. On any given day, a
-value for rejecting the null hypothesis of no exponential growth is obtained as the fraction of posterior draws with. The procedure is repeated on successive days to obtain a time-series of -values.
Our current approach has some important limitations. The mean value in a time series is not inferred, and a highly simplified treatment of errors neglects the possibility of autocorrelation and heteroscedasticity. A more complete treatment might employ a more sophisticated sampling strategy and jointly infer (rather than impose) a mean proxy value, non-zero error mean, error autoregressive parameter(s), and heteroscedasticity across each 14-day window. Extensions to the current work will include these considerations.
Multi-Proxy -Value: -values estimated across multiple proxies are combined into a single metric representing the family-wise probability that . Because proxies cannot be assumed independent, we use the harmonic mean -value ,
where are weights that sum to 1 and, for the purposes of our analyses, are treated as equal.
Time to Outbreak Estimation: A time to outbreak estimation strategy can be formulated to provide probability estimates for when the next outbreak will occur given early indicators. We propose a strategy based on the timing of detected events among input data sources with respect to the eventual COVID-19 outbreak event in each state, as defined in the preceding section. We first modeled the behavior of the data sources in each state as a function of the state’s underlying COVID-19 case trajectory over the time period studied. Specifically, we modeled the detected events in each data source as conditionally independent given the state-level latent COVID-19 dynamics. This follows from the assumption that exponential behavior in each data source is a causal effect of a COVID-19 outbreak and that other correlations unrelated to COVID-19 are mostly minor in comparison.
The time intervals between each event and the eventual outbreak event were then pooled across states to form an empirical conditional distribution for each dataset. Since observations were sparse relative to the time window, we used kernel density estimation with Gaussian kernels to smooth the empirical distributions, where bandwidth selection for the kernel density estimation was performed using Scott’s Rule.
Thus, for any given dataset, a detected event implies a distribution of likely timings for the COVID-19 outbreak event. We define as the number of days since an uptrend event for signal where is the current date. Within a state, the relative intervals of the events for each data source specify a posterior distribution over the probability of a COVID-19 event in days from current date :
where we decomposed the joint likelihood
using conditional independence.
A uniform prior over the entire range of possible delays (a period of 180 days) was assumed, and additive smoothing was used when combining the probabilities. Because we modeled at the daily-level, the distributions are discrete, allowing evaluation of the posterior likelihood explicitly using marginalization. This process was repeated for each state. Such an approach can be viewed as pooling predictions from a set of “experts” when they have conditionally independent likelihoods given the truth , with each expert corresponding to a data source. We note as a limitation that the assumption of conditionally independent expert likelihoods given the truth is unlikely to hold perfectly as, for example, an increase in measured fevers could be correlated with an increase in fever-related Google Searches even when the underlying COVID-19 infection dynamics are similar. Such dependencies may manifest heterogeneously as correlations among locations with similar COVID-19 outbreak timings, but are likely to be small since most of our inputs represent disparate data sources.
For the purposes of this model, we excluded mobility data when modeling the uptrend because it is a direct consequence of government intervention rather than of COVID-19 activity. When no event is detected for a data source, that data source’s expert’s prediction is zero across all possible timings, which translates with smoothing to a uniform distribution.
-  (2020) Incubation period of 2019 novel coronavirus (2019-nCoV) infections among travellers from Wuhan, China, 20–28 January 2020. Eurosurveillance 25 (5), pp. 2000062. Cited by: §3.
-  (2010) Modeling the spatial spread of infectious diseases: The GLobal Epidemic and Mobility computational model. Journal of computational science 1 (3), pp. 132–145. Cited by: §4.
-  (2020) Evidence from internet search data shows information-seeking responses to news of local COVID-19 cases. Proceedings of the National Academy of Sciences 117 (21), pp. 11220–11222. Cited by: §1, §3.
-  (2008) Surveillance sans frontieres: Internet-based emerging infectious disease intelligence and the HealthMap project. PLoS medicine 5 (7). Cited by: §1.
-  (2009) Google trends: a web-based tool for real-time surveillance of disease outbreaks. Clinical infectious diseases 49 (10), pp. 1557–1564. Cited by: §4.
-  CDC FluView. Note: https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html Cited by: §4, §4.
-  (2020) Forecasts of total COVID-19 deaths. Cited by: §1.
-  (2020) The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak. Science 368 (6489), pp. 395–400. Cited by: §3, §4.
-  (2020) A strategic approach to COVID-19 vaccine R&D. Science 368 (6494), pp. 948–950. External Links: Cited by: §1.
-  Coronavirus in the US: Latest map and case count https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html. Cited by: §4.
-  (2018) Urbanization and humidity shape the intensity of influenza epidemics in US cities. Science 362 (6410), pp. 75–79. Cited by: §4.
-  (2020) An Interactive Web-Based Dashboard to Track COVID-19 in Real Time. The Lancet Infectious Diseases 20 (5), pp. 533–534. Cited by: §1, §4.
-  (2020) Staged strategy to avoid hospital surge and preventable mortality, while reducing the economic burden of social distancing. The University of Texas COVID-19 Modeling Consortium (), pp. . Cited by: §1.
(1997 (accessed July 1, 2020))
A compendium of conjugate priors. 46. Note: http://www.people.cornell.edu/pages/df36/CONJINTRnew\%20TEX.pdf Cited by: §4.
-  (2013) An IDEA for short term outbreak projection: nearcasting using the basic reproduction number. PloS one 8 (12). Cited by: §4.
-  (2020) Report 13: Estimating the number of infections and the impact of non-pharmaceutical interventions on COVID-19 in 11 European countries. Imperial College London. Cited by: §1.
-  U.S. Influenza Surveillance System: Purpose and Methods. Centers for Disease Control and Prevention. External Links: Cited by: §4.
-  (2020) The impact of early social distancing at COVID-19 Outbreak in the largest Metropolitan Area of Brazil. medRxiv 2020.04.06.20055103. Cited by: §3.
-  (2013) Bayesian data analysis. CRC press. Cited by: §4.
-  (2006) Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian analysis 1 (3), pp. 515–534. Cited by: §4.
-  (2014) Global disease monitoring and forecasting with Wikipedia. PLoS computational biology 10 (11). Cited by: §1.
-  (2009) Detecting influenza epidemics using search engine query data. Nature 457 (7232), pp. 1012–1014. Cited by: §1.
-  (2014) Evaluation of Internet-based dengue query data: Google Dengue Trends. PLoS neglected tropical diseases 8 (2). Cited by: §3, §4.
-  (2014) Assessing the international spreading risk associated with the 2014 West African Ebola outbreak. PLoS currents 6. Cited by: §4.
-  (2020) The study of the effects of mobility trends on the statistical models of the COVID-19 virus spreading. Electronic journal of general medicine 17 (6), pp. em243. Cited by: §3.
-  (2020) Coronavirus Search Trends. Cited by: §4.
-  International Air Transportation Association, https://www.iata.org/. Cited by: §4.
-  (2020) Population-level COVID-19 mortality risk for non-elderly individuals overall and for nonelderly individuals without underlying diseases in pandemic epicenters. medRxiv 2020.04.05.20054361. Cited by: §3.
-  (2020) COVID-19 positive cases, evidence on the time evolution of the epidemic or an indicator of local testing capabilities? A case study in the United States. arXiv:2004.3128874. Cited by: §1, §3.
-  (2020) Projecting the transmission dynamics of SARS-CoV-2 through the post-pandemic period. Science, pp. 860–868. Cited by: §1, §1, §3.
-  (2019) Wet markets and food safety: TripAdvisor for improved global digital surveillance. Journal of medical Internet research 5 (2), pp. e11477. Cited by: §1.
-  (2020) The effect of human mobility and control measures on the COVID-19 epidemic in China. Science 368 (6490), pp. 493–497. Cited by: §3.
-  (2020) Tracking COVID-19 using online search. arXiv:2003.08086. Cited by: §1, §4.
-  (2020) Timing of community mitigation and changes in reported COVID-19 and community mobility ― Four U.S. metropolitan areas, February 26–April 1, 2020. MMWR morbidity and mortality weekly report 69 (0), pp. 451–457. Cited by: §3.
-  (2014) The parable of Google Flu: traps in big data analysis. Science 343 (6176), pp. 1203–1205. Cited by: §1, §3.
-  (2020) Fever and mobility data indicate social distancing has reduced incidence of communicable disease in the United States. arXiv:2004.09911. Cited by: §2.
-  (2020) Incubation period and other epidemiological characteristics of 2019 novel coronavirus infections with right truncation: a statistical analysis of publicly available case data. Journal of clinical medicine 9 (2), pp. 538. Cited by: §2, §3.
-  (2019) Enhancing Situational Awareness to Prevent Infectious Disease Outbreaks from Becoming Catastrophic. Global Catastrophic Biological Risks, pp. 59–74. Cited by: §1.
-  (2019) Improved state-level influenza nowcasting in the United States leveraging Internet-based data and network approaches. Nature communications 10 (1), pp. 1–10. Cited by: §1, §4.
-  (2020) Estimating the Early Outbreak Cumulative Incidence of COVID-19 in the United States: Three Complementary Approaches. medRxiv 2020.04.18.20070821. Cited by: §1, §1, §2, §3, §4, §4, §4.
-  (2018) Accurate influenza monitoring and forecasting using novel internet data streams: a case study in the Boston Metropolis. JMIR public health and surveillance 4 (1), pp. e4. Cited by: §4.
-  (2020) Internet Search Patterns Reveal Clinical Course of Disease Progression for COVID-19 and Predict Pandemic Spread in 32 Countries. medRxiv. Cited by: §1.
-  (2015) 2014 Ebola outbreak: Media events track changes in observed reproductive number. PLoS currents 7. Cited by: §1.
-  (2016) Utilizing nontraditional data sources for near real-time estimation of transmission dynamics during the 2015-2016 Colombian Zika virus disease outbreak. JMIR public health and surveillance 2 (1), pp. e30. Cited by: §1.
-  (2017) Forecasting Zika incidence in the 2016 Latin America outbreak combining traditional disease surveillance with search, social media, and news report data. PLoS neglected tropical diseases 11 (1), pp. e0005295. Cited by: §1, §4.
-  (2020) The COVID Tracking Project. Cited by: §4.
-  (2018) A smartphone-driven thermometer application for real-time population-and individual-level influenza surveillance. Clinical Infectious Diseases 67 (3), pp. 388–397. Cited by: §4.
-  (2018) A smartphone-driven thermometer application for real-time population-and individual-level influenza surveillance. Clinical Infectious Diseases 67 (3), pp. 388–397. Cited by: §1.
-  (2020) Inferring high-resolution human mixing patterns for disease modeling. arXiv:2003.01214. Cited by: §4.
-  (2014) A case study of the New York City 2012-2013 influenza season with daily geocoded Twitter data from temporal and spatiotemporal perspectives. Journal of medical Internet research 16 (10), pp. e236. Cited by: §1.
-  Cited by: §3.
-  Official Aviation Guide, https://www.oag.com/. Cited by: §4.
-  (2020) Have deaths from COVID-19 in Europe plateaued due to herd immunity?. Lancet (London, England). Cited by: §1.
-  (1995) Local Spatial Autocorrelation Statistics: Distributional Issues and an Application. Geographical Analysis. External Links: Cited by: §3.
-  (2014) Web-based participatory surveillance of infectious diseases: the Influenzanet participatory surveillance experience. Clinical Microbiology and Infection 20 (1), pp. 17–21. Cited by: §1.
-  (2014) Twitter improves influenza forecasting. PLoS currents 6. Cited by: §1, §3.
-  (2008) Using internet searches for influenza surveillance. Clinical infectious diseases 47 (11), pp. 1443–1448. Cited by: §1.
-  COVID-19 Modeling: United States. https://covid19.gleamproject.org/. Cited by: §4.
-  (1999) Bayesian approaches to multi-sensor data fusion. Cambridge University, Cambridge. Cited by: §4.
-  (2020) Harnessing wearable device data to improve state-level real-time surveillance of influenza-like illness in the USA: a population-based study. The Lancet Digital Health. Cited by: §1.
-  (2019) A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States. Proceedings of the National Academy of Sciences 116 (8), pp. 3146–3154. Cited by: §1.
-  (2016) Citizen-centric Urban Planning through Extracting Emotion Information from Twitter in an Interdisciplinary Space-Time-Linguistics Algorithm. Urban Planning 1 (2), pp. 114–127. External Links: Cited by: §3.
-  (2018) Combining machine-learning topic models and spatiotemporal analysis of social media data for disaster footprint and damage assessment. Cartography and Geographic Information Science 45 (4), pp. 362–376. External Links: Cited by: §4.
-  (2015) Urban Emotions — Geo-Semantic Emotion Extraction from Technical Sensors, Human Sensors and Crowdsourced Data. In Progress in Location-Based Services 2014, G. Gartner and H. Huang (Eds.), pp. 199–212. External Links: Cited by: §3.
-  (2020-05) Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the New York City Area. JAMA 323 (20), pp. 2052–2059. External Links: Cited by: §4.
-  (2020) Cumulative incidence and diagnosis of SARS-CoV-2 infection in New York. medRxiv. Cited by: §1.
-  (2012) Digital epidemiology. PLoS computational biology 8 (7). Cited by: §1.
-  (2015) Combining search, social media, and traditional data sources to improve influenza surveillance. PLoS computational biology 11 (10). Cited by: §1.
-  (2016) Cloud-based electronic health records for real-time, region-specific influenza surveillance. Scientific reports 6, pp. 25732. Cited by: §1.
-  (2014) Using clinicians’ search query data to monitor influenza epidemics. Clinical infectious diseases: an official publication of the Infectious Diseases Society of America 59 (10), pp. 1446. Cited by: §1, §3, §4.
-  (2014) What can digital disease detection learn from (an external revision to) Google Flu Trends?. American journal of preventive medicine 47 (3), pp. 341–347. Cited by: §1, §3.
-  (2015) Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons. Cited by: §4.
-  (2015) Flu near you: Crowdsourced symptom reporting spanning 2 influenza seasons. American journal of public health 105 (10), pp. 2124–2130. Cited by: §1.
-  (2020-06) Nursing Home COVID-19 Data. Center for Medicare and Medicaid Services. External Links: Cited by: §3.
-  (2017) Dynamic forecasting of Zika epidemics using Google Trends. PloS one 12 (1). Cited by: §4.
-  (2020) Developer Docs https://developer.twitter.com/en/docs. Cited by: §4.
-  (2020) COVID-19 search intensity monitoring dashboard. https://covid19map.uptodate.com/. Cited by: §4.
-  (2020) Estimates of the severity of coronavirus disease 2019: a model-based analysis. The Lancet Infectious Diseases. Cited by: §3.
-  (2014) Demonstrating the use of high-volume electronic medical claims data to monitor local and regional influenza activity in the US. PloS one 9 (7). Cited by: §1.
-  (2014-10) Serial Intervals of Respiratory Infectious Diseases: A Systematic Review and Analysis. American Journal of Epidemiology 180 (9), pp. 865–875. External Links: Cited by: §4.
-  (2019) The harmonic mean p-value for combining dependent tests. Proceedings of the National Academy of Sciences 116 (4), pp. 1195–1200. Cited by: §2, §4.
-  (2015) Accurate estimation of influenza epidemics using Google search data via ARGO. Proceedings of the National Academy of Sciences 112 (47), pp. 14473–14478. Cited by: §1, §3.
-  (2017) Spread of Zika virus in the Americas. Proceedings of the National Academy of Sciences 114 (22), pp. E4334–E4343. External Links: Cited by: §4.
MS was partially supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R01GM130668. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.