The rapid progression of the COVID-19 pandemic has provoked large-scale data collection efforts on an international level to study the epidemiology of the virus and inform policies. Various studies have been undertaken to predict the spread, severity, and unique characteristics of the COVID-19 infection, across a broad range of clinical, imaging, and population-level datasets Gostic et al. (2020); Liang et al. (2020); Menni et al. (2020); Shi and others (2020). For instance, Menni et al. (2020) uses self-reported data from a mobile app to predict a positive COVID-19 test result based upon symptom presentation. Anosmia was shown to be the strongest predictor of disease presence, and a model for disease detection using symptoms-based predictors was indicated to have a sensitivity of about 65%. Studies like Parma and others (2020) have shown that ageusia and anosmia are widespread sequelae of COVID-19 pathogenesis. From the onset of COVID-19 there also has been significant amount of work in mathematical modeling to understand the outbreak under different situations for different demographics Menni and others (2020); Saad-Roy and others (2020); Wilder et al. (2020)
Carnegie Mellon University (CMU) and the University of Maryland (UMD) have built chronologically aggregated datasets of self-reported COVID-19 symptoms by conducting surveys at national and international levels Fan and others (2020); group,Carnegie Mellon University (2020). The surveys contain questions regarding whether the respondent has experienced several of the common symptoms of COVID-19 (e.g. anosmia, ageusia, cough, etc.) in addition to various behavioral questions concerning the number of trips a respondent has taken outdoors and whether they have received a COVID-19 test.
In this work, we perform several studies using the CMU, UMD and OxCGRT Fan and others (2020); group,Carnegie Mellon University (2020); Hale et al. (2020) datasets. Our experiments examine correlations among variables in the CMU data to determine which symptoms and behaviors are most correlated to high % of CLI. We see how the different symptoms impact the % of population with CLI across different spatio-temporal and demographic (age, gender) settings. We also predict the % of population who got tested positive for COVID-19 and achieve 60% Mean Relative Error.
Further, our experiments involve time-series analysis of these datasets to forecast CLI over time. Here we identify how different spatial window trends vary across different temporal windows. We aim to use the findings from this method to understand the possibilities of modelling CLI for geographic areas in which data collection is sparse or non-existent. Furthermore, results from our experiments can potentially guide public health policies for COVID-19.
Using self reported symptoms collected across spatio-temporal windows to understand the prevalence and outbreak of COVID-19 is the first of its kind to the best of our knowledge.
The CMU Symptom Survey aggregates the results of a survey run by CMU group,Carnegie Mellon University (2020) which was distributed across the US to ~70k random Facebook users daily. It has 104 columns, including weighted (adjusted for sampling bias), unweighted signals, demographic columns (age, gender etc) for county and state level data. We use the data from Apr. 4, ’20 to Sep. 11, ’20.
The UMD Global Symptom Survey aggregates the results of a survey conducted by the UMD through Facebook Fan and others (2020). We use the data of 968 regions, available from May 01 to September 11. There are 28 unweighted signals provided, as well as a weighted form (adjusted for sampling bias). These signals include self reported symptoms, exposure information, general hygiene etc.
The Oxford COVID-19 Government Response Tracker (OxCGRT) Hale et al. (2020) contains government COVID-19 policy data as a numerical scale value representing the extent of government action.
3 Method and Experiments
Correlation Studies: Correlation between features of the dataset provides crucial information about the features and the degree of influence they have over the target value. We conduct correlation studies on different sub groups like symptomatic, asymptomatic and varying demographic regions in the CMU dataset to the discover relationships among the signals and with the target variable. We also investiage the significance of obesity and population density on the susceptibility to COVID-19 at state level CDC (2020). Please refer to the Appendix for more information.
Outbreak Prediction: For outbreak prediction, we predict the % of the population that tested positive from CMU state data. After pruning unweighted and other signals, we are left with 36 input signals (Refer to the Appendix for details about the signal pruning process). We rank these 36 signals according to their f_regression (f_statistic of the correlation to the target variable) and predict the target variable using the top n ranked features. We experiment n features2011) are tested. Only the results for the best-performing model (Gradient Boosting) are shown.
Time Series Analysis: We predict the % of people that tested positive using the CMU dataset and % of people with CLI with the UMD dataset, using various combinations of features in the CMU (36) and UMD (56) datasets for multivariate multi-step time series forecasting. Given the data is spread across different spatial windows (geographies) at a state level, we employ an agglomerative clustering method independently on symptoms and behavioural/external patterns, and sample locations which are not in the same cluster for our analysis. Using the Augmented Dickey-Fuller test Cheung and Lai (1995)
we found the time series samples for these spatial windows to be stationary. Furthermore, we bucket the data based on the age and gender of the respondents, to provide granular insights on the model performance on various demographics. With a total of 12 demographic buckets [(age, gender) pairs] available, we use a Vector Auto Regressive (VAR)Holden (1995) model and an LSTM Gers et al. (1999) model for the experiments. We also look at the impact of government policies (contact tracing, etc) on the spread of the virus.
4 Results and Discussion
Correlation Studies: State level analysis revealed a mild positive correlation between the % of people tested positive and statewide obesity level. Here the obseity is defined as BMI NIH (2020). These results are consistent with prior clinical studies like Chan and others (2020) and indicate that further research required to see if lack of certain nutrients like Vitamin B, Zinc, Iron or having a BMI 30.0 could make an individual more susceptible to COVID-19. Figure 1 shows the correlation amongst multiple self reported symptoms and the symptoms having a significant positive correlations are highlighted. This clearly reveals that Anosmia , Ageusia and fever are reletively strong indicators of COVID-19. From Figure 5, we see that contact with a COVID-19 positive individual is strongly correlated with testing COVID-19 positive. Conversely, the % of population who avoid outside contact and the % of population testing positive for COVID-19 have a negative correlation. We also find a mild positive correlation between population density and % of population reporting COVID-19 positivity, which indicate easier transmission of the virus in congested environment. These observations reaffirm the highly contagious nature of the virus and the need for social distancing.
The above results motivate us to estimate the % of people tested COVID-19 positive based on % of people who had a direct contact with anyone who recently tested positive. In doing so, we achieve an MRE of 2.33% and MAE of 0.03.
|Age 18-34||30||1.23||66.35||(65.59, 67.12)|
|Age 35-54||35||1.29||67.59||(67.13, 68.04)|
|Age 55+||33||1.20||66.40||(65.86, 66.94)|
Results for prediction of % of population tested positive across demographics. The 95% confidence interval (CI) for MRE is calculated on 20 runs (data shuffled randomly every time). the MRE and MAE are average of 20 runs.
Policies vs CLI/Community Sick Impacts : The impacts of different non pharmaceutical interventions (NPIs) could be analysed by combining the CMU, UMD data and Oxford data Hale et al. (2020). A particular analysis from that is reported here, where we notice that lifting of stay at home restrictions resulted in a sudden spike in the number of cases. This can be visualised in figure 4.
Error Metric - We find that a low MAE value is misleading in the case of predicting the spread of the virus; the MAE for outbreak prediction is low and has a small range (1-1.4) but more than 75% of the target lies between 0-2.6, meaning only a small percentage of the entire population has COVID-19. This makes MRE a better metric to use.
Outbreak prediction on CMU Dataset: Table 1 shows best accuracy achieved per dataset. For every dataset, the best ”n” is in 30s. We achieve an MRE of 60.40% for the entire dataset. The performance is better on Female-only data when compared to Male-only data. The performance is slightly better on 55+ age data than other age groups. This can also be observed from figure 2.
Top Features - Except for minor reordering, the top 5 features are - CLI in community, loss of smell, CLI in house hold (HH), fever in HH, fever across every data split. Top 6-10 features per data split are given in table 3. We can see that ’worked outside home’ and ’avoid contact most time’ are useful features for male, female and 55+ age data. Figure 2
shows mre vs number of features selected for different data splits. Overall, the error decreases as we add more features. However, the decrease in error isn’t very considerable when we go beyond 20 features (1%).
Time Series Analysis - As seen in Tables 2, 3, 4 and 5, we are able to forecast the PCT_CLI with an MRE of 15.11% using just 23 features from the UMD dataset. We can see that VAR performs better than LSTM on an average. This can be explained by the dearth of data available. Furthermore, we can see that the outbreak forecasting on New York was done with 11.28% MRE, making use of only 10 features. This might be caused by an inherent bias in the sampling strategy or participant responses. For example, the high correlation noted between anosmia and COVID-19 prevalence suggests several probable causes of confounding relationships between the two. This could also occur if both symptoms are specific and sensitive for COVID-19 infection.
|New York||11.28, 95% CI [10.9, 11.6]||0.15|
|California||13.48, 95% CI [13.4, 13.5]||0.23|
|Florida||17.49, 95% CI [17.5, 17.5]||0.38|
|New Jersey||17.93, 95% CI [17.9, 18]||0.26|
|New York||23.61, 95% CI [23.6, 23.7]||0.36|
|California||45.06, 95% CI [45, 45.2]||0.91|
|Florida||64.98, 95% CI [64.8, 65.1]||1.51|
|New Jersey||15.78, 95% CI [15.7, 15.9]||0.26|
|Tokyo||17.77, 95% CI [17.7, 17.8]||0.28|
|British Columbia||21.35, 95% CI [21.3, 21.4]||0.34|
|Northern Ireland||42.72, 95% CI [42.7, 42.8]||0.87|
|Lombardia||15.31, 95% CI [15.3, 15.4]||0.22|
|Tokyo||30.00, 95% CI [29.9, 30.1]||0.53|
|British Columbia||31.11, 95% CI [30.9, 31.3]||0.56|
|Northern Ireland||42.46, 95% CI [42.1, 42.9]||1.21|
|Lombardia||16.11, 95% CI [16, 16.2]||0.21|
Symptoms vs CLI overlap : The % of population with symptoms like cough, fever and runny nose is much higher than the % of people who suffer from CLI or the % of people who are sick in the community. Only 4% of the people in the UMD dataset who reported to have CLI weren’t suffering from chest pain and nausea.
Ablation Studies : Here, we perform ablation studies to verify and investigate the relative importance of the features that were selected using f regression feature ranking algorithm 18. In the following experiments the top features obtained from the f regression algorithm are considered as the subset for evaluation.
In this experiment, the target variable which is the percentage of people affected by COVID 19 is estimated by considering features from a given set of top features by dropping 1 feature at a time in every iteration in a descending order. The results are visualised in figure 6 from which it is clear that there is a considerable increase error when the most significant feature is dropped and the loss in performance is not as drastic when any other feature is dropped. This reaffirms our feature selection method.
Cumulative Feature Dropping: In this experiment, we estimate the target variable based on top =10 features and then carry out the experiment with features in every iteration where is the iteration count. The features are dropped in the descending order. Figure 7 shows the results. The change in slope from the start to the end of the graph strongly supports our previous inference that the most important feature has a huge significance on the performance and error rate and reaffirms our features selection algorithm.
5 Conclusion And Future Work
In this work, we analyse the benefits of COVID-19 self reported symptoms present in the CMU, UMD, and Oxford datasets. We present correlation analysis, outbreak prediction, and time series prediction of % of re+spondents with positive COVID-19 tests and % of respondents who show COVID-like illness. By clustering datasets across different demographics, we reveal micro and macro level insights into the relationship between symptoms and outbreaks of COVID-19. These insights might form the basis for future analysis of the epidemiology and manifestations of COVID-19 in different patient populations. Our correlation and prediction studies identify a small subset of features that can predict measures of COVID-19 prevalence to a high degree of accuracy. Using this, more efficient surveys can be designed to measure only the most relevant features to predict COVID-19 outbreaks. Shorter surveys will increase the likelihood of respondent participation and decrease the chances that respondents providing false information. We believe that our analysis will be valuable in shaping health policy and in COVID-19 outbreak predictions for areas with low levels of testing by providing prediction models that rely on self-reported symptom data.
In the future, we plan to use advanced deep learning models for predictions. Furthermore, given the promise shown by population level symptoms data we find more relevant and timely problems that can be solved with individual data. Building machine learning systems on data from mobile/wearable devices can be built to understand users’ vitals, sleep behavior etc., have the data shared at an individual level, can augment the participatory surveillance dataset and thereby the predictions made. This can be achieved without compromising on the privacy of the individual. We also plan to compare the reliability of such participatory surveillance methods with actual number of cases in the corresponding regions and it’s generalisability across the population.
We acknowledge the inputs of Seojin Jang, Chirag Samal, Nilay Shrivastava, Shrikant Kanaparti, Darshan Gandhi and Priyanshi Katiyar. We further thank Prof. Manuel Morales (University of Montreal), Morteza Asgari and Hellen Vasques for helping in developing a dashboard to showcase the results. We also acknowledge Dr. Thomas C. Kingsley (Mayo Clinic) on his suggestions on the future works and inputs to many comments.
- Data and statistics. Note: https://www.cdc.gov/obesity/data/index.html Cited by: §3.
- Type i interferon sensing unlocks dormant adipocyte inflammatory potential. Nature Communications 11 (1). External Links: Cited by: §4.
- Lag order and critical values of the augmented dickey–fuller test. Journal of Business & Economic Statistics 13 (3), pp. 277–280. Cited by: §3.
- COVID-19 world symptom survey data api. Cited by: §1, §1, §2.
Learning to forget: continual prediction with lstm.
1999 Ninth International Conference on Artificial Neural Networks ICANN 99.. Cited by: §3.
- Estimated effectiveness of symptom and risk screening to prevent the spread of covid-19. eLife 9. External Links: Cited by: §1.
- External Links: Cited by: §1, §1, §2.
- Oxford covid-19 government response tracker blavatnik school of government. Cited by: §1, §2, §4.
- Vector auto regression modeling and forecasting. Journal of Forecasting 14 (3), pp. 159–166. Cited by: §3.
- Development and Validation of a Clinical Risk Score to Predict the Occurrence of Critical Illness in Hospitalized Patients With COVID-19. JAMA Internal Medicine 180 (8), pp. 1081–1089. External Links: Cited by: §1.
- Real-time tracking of self-reported symptoms to predict potential covid-19. Nature medicine, pp. 1–4. Cited by: §1.
- Real-time tracking of self-reported symptoms to predict potential covid-19. Nature Medicine 26 (7), pp. 1037–1040. External Links: Cited by: §1.
- Adult body mass index (bmi). Note: https://www.nhlbi.nih.gov/health/educational/lose_wt/BMI/bmicalc.htm Cited by: §4.
- More than smell. covid-19 is associated with severe impairment of smell, taste, and chemesthesis. medRxiv. External Links: Cited by: §1.
- Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §3.
- Immune life history, vaccination, and the dynamics of sars-cov-2 over the next 5 years. Science. Cited by: §1.
Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19. IEEE Reviews in Biomedical Engineering, pp. 1–1. External Links: Cited by: §1.
-  (2007-2020) Sklearn f regression. Note: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html Cited by: §4.
Tracking disease outbreaks from sparse data with bayesian inference. arXiv preprint arXiv:2009.05863. Cited by: §1.
The sample features present in the datasets can be observed in table 6.
|UMD||COVID-like illness symptoms, influenza-like illness symptoms, mask usage|
|CMU||sore throat, loss of smell/taste, chronic lung disease|
|OxCGRT||containment and closure policies, economic policies, health system policies|
The detailed plots of the correlation analysis of the CMU dataset is noted in figure 11.
We drop demographic columns such as date, gender, age etc. Next we drop the unweighted columns because their weighted counterparts exist. We also drop features like % of people who got tested negative, weighted % of people who got tested positive etc as these are directly related to testing and would make the prediction trivial. Further, we drop derived features like(t), like estimated % of people with influenza-like illness because they were not directly reported by the respondents. Finally, we drop the features which calculate mean (such as average number of people in respondent’s household who have cli) because their range was in the order of . After the entire process we are left with 36 features.
|1||COVID-like Illness in Community||14938.48816456|
|2||Loss of smell or taste||9498.89229794|
|3||COVID-like Illness in Household||6050.88250153|
|4||Fever in Household||5490.15612527|
|6||Sore Throat in Household||1787.42269067|
|7||Avoid contact with others most of the time||1494.25038393|
|8||Difficulty breathing in Household||1330.48793481|
|9||Persistent Pain Pressure in Chest||1257.78331468|
|11||Worked outside home||1023.50285601|
|12||Nausea or Vomiting||1016.94758914|
|13||Shortness of breath in Household||1004.67944587|
|17||Shortness of Breath||440.88344033|
|18||Cough in Household||322.05679444|
|19||No symptoms in past 24 hours||241.72819985|
|21||Chronic Lung Disease||224.24651285|
|23||Other Pre-existing Disease||158.31567587|
|24||Tiredness or Exhaustion||134.36715409|
|26||No Above Medical Conditions||84.40193799|
|28||Multiple Medical Conditions||52.61630823|
|32||Average people in Household with COVID-like ilness||14.52969291|
|34||Muscle Joint Aches||1.72398411|
|35||High Blood Pressure||0.48328156|
In table 8 we continue to experiment with different spatial windows, like trying to predict PCT_CLI for different locations like ”Tokyo” and ”British Columbia” using different combination of features. Further on table 10 analysis is done on more US states with an LSTM based deep learning model to predict PCT_CLI and we notice that there is no significant gain in using DL models (probably due to lack of data). The pct_community_sick is another variable which we try to predict, and the results can be seen in table 9
In figs [13,15] we do Dynamic Time Warping(DTW) to compare how well our forecasted timeseries curve matches with the original curve. DTW was used due to the flexibility to compare timeseries signals which are of different lengths. This will enable us to compare different temporal windows across different spatial windows to understand the effectiveness of the model at different contexts.
|Location||Bucket||RMSE||MAE||MRE (%)||Features Used|
|Abu Dhabi||male and age 18-34||2.43||2.23||167.86||difficulty breathing + anosmia ageusia (weighted)|
|Tokyo||female and age 35-54||0.56||0.47||30.16||difficulty breathing + anosmia ageusia (weighted)|
|British Columbia||male and age 55+||1.09||0.59||28.68||difficulty breathing + anosmia ageusia (weighted)|
|Lombardia||male and age 55+||0.95||0.67||28.72||difficulty breathing + anosmia ageusia (weighted)|
|Lombardia||male and age 55+||0.95||0.67||28.72||Behavioural / external features (weighted)|
|British Columbia||male and age 55+||1.07||0.76||50.17||Behavioural / external features (weighted)|
|Tokyo||female and age 35-54||0.58||0.49||31.38||Behavioural / external features (weighted)|
|Abu Dhabi||male and age 18-34||2.91||2.78||207.94||Behavioural / external features (weighted)|
|Abu Dhabi||male and age 18-34||9.99||8.94||73.11|
|Tokyo||female and age 35-54||1.13||1.02||41.67|
|British Columbia||male and age 55+||3.21||2.65||137.13|
|Lombardia||male and age 55+||1.25||1.25||24.49|
|TX||male and age overall||1.56||1.21||43.00||VAR|
|CA||male and age overall||1.22||0.93||23.44||VAR|
|NY||female and age overall||0.7||0.56||21.59||VAR|
|FL||female and age overall||1.48||1.18||19.35||VAR|
|TX||male and age overall||6.28||4.06||89.4||LSTM|
|CA||male and age overall||2.83||2.68||71.24||LSTM|
|NY||female and age overall||2.02||1.9||68.17||LSTM|
|FL||female and age overall||4.33||4.19||73.34||LSTM|
|Male||no feature removed||1.389806313||77.42367322|
|Female||no feature removed||1.100879926||57.63336087|
|Young||no feature removed||1.231519891||67.07207641|
|Mid||no feature removed||1.276053866||67.05778653|
|Old||no feature removed||1.172592164||63.98633923|
|Overall||no feature removed||1.143995128||60.83421503|