The novel coronavirus (COVID-19), has at the time of writing, resulted in over 4.55 million deaths and 219 million confirmed cases worldwide as of 6th October 2021. By January 2020, new cases had been seen throughout Asia, and by the time the World Health Organisation (WHO) declared a global pandemic in March 2020, COVID-19 had spread to over 100 countries. Therefore, it was imperative to establish reliable data feeds relating to the pandemic so that researchers and analysts could model the ongoing spread of the disease and inform decision-making by government and public health officials. These data sets and models must be open-source to facilitate collaboration between researchers and allow for published results to be replicated and scrutinised. A popular interactive dashboard that collates total daily counts of confirmed cases and deaths for countries, and in some cases, regions within countries exists here. These variables are traditionally used to calculate metrics such as the reproduction number (), which is vital in understanding both the number of people on average an infected person infects and the infection growth rate or daily rate of new infections. The quality of the metrics calculated is heavily dependent on the model and ingested data.
In the United Kingdom (UK), there has been a joint effort to produce estimates of the () number, with some notable examples seen here . Laboratory-confirmed COVID-19 diagnoses are used in , UK’s National Health Service (NHS) Pathways data is used in  and hospital admissions data is used in . The statistical model developed by Moore, Rosato and Maskell  contributed to these estimates by incorporating death, hospital admission and NHS 111 call data.
Terms such as “Infodemiology”, and “Infoveillance” described in  refer to the ability to process and analyse data that is created and stored digitally in real-time pertinent to disease outbreaks. The availability of these datasets, particularly at the beginning of an outbreak when very little is known, could provide a noisy but accurate representation of disease dynamics. A popular method includes extracting data from social media and, in particular, Twitter. Before the pandemic, tweets relating to influenza-like-illness symptoms were seen to substantially improve the models predicting capacity in  and boost nowcasting accuracy by 13% in . In relation to the COVID-19 pandemic, there have been many research papers published that use social media to gain valuable information relating to the pandemic from what people tweet in real-time. Public sentiment relating to prevention strategies was analysed in  while  showed that emotion changed from fear to anger during the first stages of the pandemic. Misinformation and conspiracy theories have been shown to have propagated rapidly through the Twittersphere during the pandemic . Studies have used machine learning algorithms to automatically detect tweets containing self-reported symptoms mentioned by users  with  finding that the symptoms reported by Twitter users were similar to those used in a clinical setting. To the best of our knowledge, researchers have yet to use these symptomatic tweets to calibrate epidemiological models.
The contribution of this paper is twofold: firstly, we outline how to identify symptomatic tweets that correspond to COVID-19 related symptoms in multiple languages. The geolocation information associated with each tweet, when available, is extracted, and counts per country or region are aggregated to produce estimates for the previous 24 hours. Secondly, we present a comprehensive study of how these symptomatic tweets differ from other open-source datasets when used to calibrate the extended SEIR model described in section 3 for different geographic locations. When incorporating the surveillance data, outlined in section 2, the Mean Absolute Error (MAE) and Normalised Estimation Error Squared (NEES) values are calculated when making 7-day death forecasts.
An outline of the paper is as follows. In Section 2 we describe the open-source data feeds included in the comparative study and provide the methodology for extracting the symptomatic tweets in real-time. A description of the model is presented in Section 3, with an outline of the computational experiments and results in Section 4. Concluding remarks and directions for future work are described in Section 5.
The surveillance data used for each geographical location are summarised in Table 1. Death and positive case data for the US States and the rest of the world were downloaded from the dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) 
. It should be noted that testing methods and criteria for classifying deaths as COVID-19-related differ between geographic locations.
For NHS region-specific data, the number of deaths includes individuals with COVID-19 as the cause of death on their death certificate or those who died within 60 days of a positive test result. Patients admitted to hospital with COVID-19 symptoms and individuals that input symptoms to the ZOE COVID Symptom Study app comprise the hospital admissions and Zoe app datasets, respectively. Individuals that reported symptoms via the NHS Pathways triage and online Dashboard comprise the 111 calls and 111 online assessments datasets, respectively. Note that the Zoe app, 111 calls and 111 online assessments may include individuals who have COVID-19 symptoms but have not tested positive and individuals who perceive they have symptoms and do not have COVID-19.
All code and datasets can be found on the CoDatMo GitHub repository . The authors set this up to facilitate the sharing of code, data and ideas when modelling COVID-19.
|Geographic Location||Data Feed||Start Date||Reference|
|U.S States and the rest of the world||Deaths||24th March 2020|||
|Tests||1st March 2020|||
|13th April 2020||Section 2.1|
U.K NHS Regions
|Deaths||24th March 2020|||
|Hospital admissions||19th March 2020|||
|9th April 2020||Section 2.1|
|Zoe app||12th May 2020|||
|111 calls||18th March 2020|||
|111 online||18th March 2020|||
We created an interactive website222https://pgb.liv.ac.uk/johnheap/ that maps symptomatic tweets to geographical locations with daily counts representing the total amount of symptomatic tweets from the previous 24 hours. Information on how to download the data as a JSON can be found on the website.
The Twitter streaming API is filtered using keywords that align with COVID-19 symptoms from the MedDRA database  in English, German, Italian, Portuguese and Spanish, including terms for fever, cough and anosmia. We note that our analysis indicated that explicit COVID-19 terms (e.g. ‘coronavirus’) rarely related to individuals with symptoms. Such terms were therefore excluded. Official retweets or tweets beginning with #RT were removed to avoid duplication of tweets within the dataset.
2.1.2 Symptom Classifier Breakdown
A multi-class support vector machine (SVM) was trained with a set of annotated tweets that were vectorised using a skip-gram model. The annotated tweets were labelled according to the following classes:
User currently has symptoms,
User had symptoms in the past,
Someone else currently has symptoms,
Someone else had symptoms in the past.
The sum of tweets in classes 2-5, which is the total number of tweets that mention symptoms, was calculated for each 24-hour period. Geo-tagged tweets were mapped to their location, e.g. the corresponding city, town or village, via a series of tests using shapefiles of different countries. Previous studies demonstrate that approximately 1.65% of tweets are geo-tagged , where the exact position of the tweeter when the tweet was posted is recorded using longitude and latitude measurements. For tweets that are not geo-tagged, we look at the author’s profile to ascertain whether they provide an appropriate location. We deemed the server offline if there were any 15 minute periods during the previous 24 hours that did not have any recorded tweets. After checking all 96 15 minute periods, the count in each geographical area was multiplied by a correction factor: reported tweet count = total tweet count * 96/(96 - downtime periods).
For each language, we labelled the corpus of tweets with native speakers, with the associated class label and randomly up- and down-sampled under- and over-represented classes such that the classifier was trained with a balanced dataset. A subset of data was used to train the classifier before testing it on the remainder. The total number of labelled tweets used for training and testing of the classifiers and the resulting performance metrics can be seen in Table 2.
|Language||Number of Data Used||Performance Measures|
We repurpose the statistical model developed by Moore, Rosato and Maskell  by tweaking the observation model to be compatible with each group of surveillance data types that we use to calibrate the model in the computational experiments. We calibrate the model with a minimum of death data in all experiments, and the associated component of the observation model is unchanged. We extend the observation model to assimilate the other types of surveillance data that feature in Table 1, including Twitter and Zoe app data, by adding an extra component for each additional data type. These extra components of the observation model, the number of which can change between experiments, mirror the structure for symptom report data in the original model. More explicitly, we assume for these extra components that a generic count on day ,
, has a negative binomial distribution,
parameterised by a mean and overdispersion parameter .
4 Computational Experiments
The time series we consider begins on 17th February 2020, with the start dates of the data feeds outlined in Table 1. We consider the end of time in our analysis for the US States and the rest of the world and NHS regions to be the 1st February 2021 and 7th January 2021, respectively. In all cases, we consider a forecast to include seven days. For US states and the rest of the world, we include three predictions in the analysis; 9th July 2020 to 16th July 2020, 17th October 2020 to 24th October 2020 and 25th January 2021 - 1st February 2021. The analysis for UK NHS regions includes six predictions; 11th November 2020 to 18th November 2020, 21st November 2020 - 28th November 2020, 1st December - 8th December, 11th December - 18th December, 21st December - 28th December and 31st December 2020 - 7th January 2021.
Similar to the experiments in 
, the analysis in this paper was run on the University of Liverpool’s High-Performance Computer (HPC). Each node has two Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz processors, a total of 40 cores and 384 GB of memory. In the following experiments, six independent Markov Chains draw 2000 samples each, with the first 1000 discarded as burn-in. Run-time differs depending on the country and at what point in the time series the prediction is made, but it typically takes 4.5 hours per Markov Chain to complete.
We initially calibrate the model solely with death data and produce posterior predictive distributions of deaths for the following geographic locations independently:
- Rest of World:
2 European and 16 Latin American countries,
7 NHS regions.
We consider the final 7-daily deaths in this forecast to be the baseline to compare forecasts of deaths when incorporating low-latency data feeds. We use two metrics in our analysis to determine the accuracy of the resulting forecasts. Firstly, we calculate the MAE, which shows the average error over a set of predictions:
where is the number of predictions, and and are the predicted and true number of deaths on day , respectively.
Secondly, we consider the uncertainties associated with the forecasts by assessing the NEES, which is a popular method in the field of signal processing and tracking  and recently applied to epidemiological forecasts in 
, to determine if the estimated variance of forecasts from an algorithm differs from the true variance. If the variance is larger than the true variance, then the algorithm is over-cautious, and if the estimated variance is smaller than the true variance, it is over-confident. The NEES is defined by:
where is the estimated variance at day , as approximated using the variance of the samples for that day. If is dimensional, then should be a matrix, and the NEES should be equal to if the algorithm is consistent. Therefore, in assessing death forecasts, an ideal NEES value is .
The NEES values and MAE percentage differences between the baseline, of ingesting solely deaths, and the incorporation of low-latency data feeds for US States and the rest of the world and UK’s NHS regions can be seen in Tables 3 and 4, respectively. The results in these tables are averaged over the prediction periods described in section 4.
When forecasting deaths using the data available from , we have shown that calibrating the model with tests and tweets gives comparable increases in performance for US States, however for the rest of the world, tweets give a -17% improvement compared to just -6% for tests. For US States and the rest of the world, there is an improvement of -5% and -24%, respectively when tests and tweets are used to calibrate the model. An example of this improvement can be seen in Figure 1 for the prediction period 25th January 2021 - 1st February 2021. When comparing the mean sample, outlined in red, incorporating tests and tweets follows the deaths trend with more accuracy.
When comparing the NEES values for NHS regions in Table 4 against the baseline, it is evident that including a data feed improves the consistency of forecasts. The exception is the Zoe App data, which on average, produces estimates that are over-confident. This can be seen in Figure 2 for the prediction period 31st December 2020 - 7th January 2021. When comparing the MAE % difference, the results are similar. Ingesting hospital admissions, 111 calls, and 111 online assessments data provides improvements of -22%, -17% and -22%, respectively. However, there is a more significant difference in the MAE of 124% when including the Zoe App data than the 2% from including the Tweet data. We perceive this issue arises because, in the context of these feeds, the symptoms are self-diagnosed. Consequently, the counts may well include relatively large numbers of people who do not have COVID-19. We do not currently consider such ‘false alarms’ in the model described in section 3 but hope to extend it to handle these in the future.
Deaths forecast for Colombia using solely death data (top) and deaths, tests and Twitter data (bottom). The orange ribbon is 1 standard deviation from the mean, the red line is the mean sample and the start of the predictions is the vertical dashed blue line. The black and green dots are the true deaths.
5 Conclusions and Future Work
We have shown that calibrating the epidemiological model outlined in section 3 with certain low-latency data feeds provides more accurate and consistent nowcasts of daily deaths when compared with using death data alone.
Incorporating tweets for UK regions does not provide the same level of improvement as for other geographies. This reduced improvement could be down to many factors, including the total daily counts for NHS regions being less plentiful than for the US States or the rest of the world. We used the free Twitter streaming API for this research, which limits the number of tweets available to download to 1%. However, it is possible to pay for a Premium API that allows the user to download a higher percentage of tweets. A second way to potentially increase the hit rate of geo-located tweets is to use natural language processing techniques, such as those outlined in the review , to estimate the location of the tweet user. Another direction for future work is to train a more sophisticated classifier such as the Bidirectional Encoder Representations from Transformers (BERT) classifier .
As for the symptom report data in the original model that Moore, Rosato and Maskell introduce , we have assumed that all types of low-latency surveillance data are a weighted sum of current and lagged instances of the time series of new infections. We want to consider alternative structures for the observation model that link directly to the intermediate states of the transmission model in the future.
This work was supported by a Research Studentship jointly funded by the EPSRC and the ESRC Centre for Doctoral Training on Quantification and Management of Risk and Uncertainty in Complex Systems Environments [EP/L015927/1]; an ICASE Research Studentship jointly funded by EPSRC and AWE [EP/R512011/1]; the EPSRC Centre for Doctoral Training in Distributed Algorithms [EP/S023445/1]; and the EPSRC through the Big Hypotheses grant [EP/R018537/1].
The authors would like to thank Serban Ovidiu and Chris Hankin from Imperial College London, Ronni Bowman, Riskaware and John Harris for their support and helpful discussions of this work. We would also like to thank the team at the Universidade Nove de Julho - UNINOVE in Sao Paulo, Brazil with the help they provided in labelling the Portuguese tweets. We would also like to thank Breck Baldwin for helping to progress CoDatMo.
-  Ensheng Dong, Hongru Du, and Lauren Gardner. An interactive web-based dashboard to track covid-19 in real time. The Lancet infectious diseases, 20(5):533–534, 2020.
-  Reproduction number (r) and growth rate: methodology. https://www.gov.uk/government/publications/reproduction-number-r-and-growth-rate-methodology/reproduction-number-r-and-growth-rate-methodology. Accessed: 1 October 2021.
-  Paul Birrell, Joshua Blake, Edwin Van Leeuwen, Nick Gent, and Daniela De Angelis. Real-time nowcasting and forecasting of covid-19 dynamics in england: the first wave. Philosophical Transactions of the Royal Society B, 376(1829):20200279, 2021.
-  Quentin J Leclerc, Emily S Nightingale, Sam Abbott, and Thibaut Jombart. Analysis of temporal trends in potential covid-19 cases reported through nhs pathways england. Scientific reports, 11(1):1–8, 2021.
-  Matt J Keeling, Louise Dyson, Glen Guyver-Fletcher, Alex Holmes, Malcolm G Semple, Michael J Tildesley, Edward M Hill, ISARIC4C Investigators, et al. Fitting to the uk covid-19 outbreak, short-term forecasts and estimating the reproductive number. medRxiv, 2020.
-  Moore RE, Rosato C, and Maskell S. Assessing the uncertainty of epidemiological forecasts with normalised estimation error squared. Submitted to Philosophical Transactions of the Royal Society A: Mathematical, Physical, and Engineering Sciences.
-  Gunther Eysenbach et al. Infodemiology and infoveillance: framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the internet. Journal of medical Internet research, 11(1):e1157, 2009.
-  Harshavardhan Achrekar, Avinash Gandhe, Ross Lazarus, Ssu-Hsin Yu, and Benyuan Liu. Predicting flu trends using twitter data. In 2011 IEEE conference on computer communications workshops (INFOCOM WKSHPS), pages 702–707. IEEE, 2011.
Ovidiu Șerban, Nicholas Thapen, Brendan Maginnis, Chris Hankin, and Virginia
Real-time processing of social media with sentinel: A syndromic surveillance system incorporating deep learning for health classification.Information Processing & Management, 56(3):1166–1184, 2019.
-  Richard J Medford, Sameh N Saleh, Andrew Sumarsono, Trish M Perl, and Christoph U Lehmann. An “infodemic”: leveraging high-volume twitter data to understand early public sentiment for the coronavirus disease 2019 outbreak. 7(7):ofaa258, 2020.
-  May Oo Lwin, Jiahui Lu, Anita Sheldenkar, Peter Johannes Schulz, Wonsun Shin, Raj Gupta, and Yinping Yang. Global sentiments surrounding the covid-19 pandemic on twitter: analysis of twitter trends. JMIR public health and surveillance, 6(2):e19447, 2020.
-  Karishma Sharma, Sungyong Seo, Chuizheng Meng, Sirisha Rambhatla, and Yan Liu. Covid-19 on social media: Analyzing misinformation in twitter conversations. arXiv preprint arXiv:2003.12309, 2020.
-  Mohammed Ali Al-Garadi, Yuan-Chi Yang, Sahithi Lakamana, and Abeed Sarker. A text classification approach for the automatic detection of twitter posts containing self-reported covid-19 symptoms. 2020.
-  Abeed Sarker, Sahithi Lakamana, Whitney Hogg-Bremer, Angel Xie, Mohammed Ali Al-Garadi, and Yuan-Chi Yang. Self-reported covid-19 symptoms on twitter: an analysis and a research resource. Journal of the American Medical Informatics Association, 27(8):1310–1315, 2020.
-  Codatmo. 2021 welcome to the codatmo site. https://codatmo.github.io. Accessed: 1 October 2021.
-  Uk government. 2021 coronavirus (covid-19) in the uk. https://coronavirus.data.gov.uk/details/deaths. Accessed: 1 October 2021.
-  Uk government. 2021 coronavirus (covid-19) in the uk. https://coronavirus.data.gov.uk/details/healthcare. Accessed: 1 October 2021.
-  Zoe app: covid-public-data. https://console.cloud.google.com/storage/browser/covid-public-data;tab=objects?prefix=&forceOnObjectsSortingFiltering=false. Accessed: 1 October 2021.
-  Nhs digital. 2021 potential coronavirus (covid-19) symptoms reported through nhs pathways and 111 online. https://digital.nhs.uk/data-and-information/publications/statistical/mi-potential-covid-19-symptoms-reported-through-nhs-pathways-and-111-online/latest. Accessed: 1 October 2021.
-  Covid-19 terms and meddra. https://www.meddra.org/COVID-19-terms-and-MedDRA. Accessed: 1 October 2021.
-  Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
-  Kalev Leetaru, Shaowen Wang, Guofeng Cao, Anand Padmanabhan, and Eric Shook. Mapping the global twitter heartbeat: The geography of twitter. First Monday, 2013.
Zhaozhong Chen, Christoffer Heckman, Simon Julier, and Nisar Ahmed.
Weak in the nees?: Auto-tuning kalman filters with bayesian optimization.In 2018 21st International Conference on Information Fusion (FUSION), pages 1072–1079. IEEE, 2018.
-  Huang C-Y, Tong H, He J, and Maciejewski R. Location prediction for tweets. Frontiers in Big Data, 2:5, 2019.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.