Since the identification of the SARS-CoV-2 virus, the COVID-19 pandemic has led to over 236 million confirmed cases and 4.8 million confirmed deaths. In response to this, one of the largest ever public health responses has been mounted, with over 6.2 billion vaccine doses administered and a large variety of non-pharmaceutical interventions that have fundamentally changed behaviour and healthcare provision around the world since the start of 2020 (World Health Organization, 2021; Hale et al., 2021; Google LLC, 2021).
Understanding the clinical presentation and course of SARS-CoV-2 infection remains central for transmission control, particularly in determining policy for identification of cases for isolation and tracing of their contacts (Fyles et al., 2021; Crozier et al., 2021), and potentially to predict clinical outcomes. COVID-19 cases can present with symptoms from a wide range of categories: respiratory, systemic, cardiovascular and gastrointestinal (Struyf et al., ), with high variability between individuals depending on factors such as age and comorbidities (Williamson et al., 2020; Clift et al., 2020)
. In addition, a significant proportion of infections are estimated to remain asymptomatic(Buitrago-Garcia et al., ).
Assessment of diversity of genotypes and (endo-)phenotypes is a long-standing
tool in both infectious diseases and chronic non-communicable diseases, which
has been significantly accelerated by modern experimental and theoretical
techniques (Hofmann and Zeuzem, 2011; Deliu et al., 2017; Geifman et al., 2018). In particular, such
analysis often helps with the standard process of identifying multiple disease
aetiologies with the same presentation, or vice versa a single disease with
highly variable outcomes. This latter distinction is particularly important for
COVID-19, where different courses of action, including public health
interventions, are taken depending on symptom status (NHS, 2021).
Here, we report patterns of symptom occurrence, co-occurrence and clustering
in PCR-positive symptomatic SARS-CoV-2 cases – previously considered predominantly in hospitalisation data heavily skewed towards more severe infections(Swann et al., 2020; Millar et al., ; Sudre et al., 2021) – in four very large community-based datasets for the time period May 2020 to March 2021 in the UK. Due to this data collection time period, and the effects of vaccination on preventing disease, we expect the datasets contain predominantly unvaccinated individuals. These datasets are diverse in their sampling and data collection methods and include: (a) 1,637,965 symptomatic cases from ‘Pillar 2’ testing data from the National Health Service (NHS) Test and Trace system, designed to capture cases in the general population; (b) 112,925 symptomatic cases from the Second Generation Surveillance System (SGSS) from England’s national laboratory reporting system, which includes cases associated with healthcare settings among patients and healthcare staff; (c) 52,084 symptomatic self-reported cases from the COVID-19 Symptom Study (CSS), which uses a smartphone app associated with https://covid.joinzoe.com/
to collect daily symptom reports; and (d) 9,166 symptomatic cases from The Office for National Statistics COVID-19 Infection Survey (CIS), a longitudinal study of a representative sample of UK households.
From each dataset, we extract all PCR-positive individuals, and associate them with symptoms experienced within a time window of the test appropriate for the dataset. More detail about each dataset, data collection and extraction are given in the Supplementary Materials. For the -th individual and -th symptom, we let if the symptom is present during the time window around the positive test and otherwise. For a dataset with individuals measuring symptoms, we can then construct an matrix , where the rows of this matrix form a set of length-
feature vectors for individuals,, and the columns form a set of length- feature vectors for symptoms,
, each of which can then be used as input for unsupervised learning algorithms. In addition to descriptive analysis of the data, we used three complementary approaches to looking at clustering and co-occurrence of symptoms.
We first performed hierarchical clustering using complete linkage(Hastie et al., 2009) and the Jaccard distance between symptom vectors as appropriate for binary data, with results shown in Fig. 1. This figure shows the matrix of such distances as a heatmap, with a dendrogram to its right. We read these dendrograms from right to left, with splitting points representing points at which the algorithm suggests a separation of symptoms into groups on the basis of their occurrence in infected individuals.
Of the plots, the CIS data in panel D shows the clearest signal of separation of symptoms under this analysis method: gastrointestinal symptoms form a separate symptom grouping, joining the rest of the hierarchy only at the
highest level; the distinctive loss of taste and smell joins the tree at the
next; and the remaining symptoms join individually at remaining levels. In
Pillar 2 and SGSS data (Fig. 1 panels A&B) a similar pattern is observed, except for
additional complexity associated with uncommon symptoms in of
positives and for Pillar 2 loss of smell or taste joining at a similar point on
the tree to upper respiratory tract symptoms. For the CSS data in panel C,
we see that shortness of breath and hoarse voice, symptoms not collected in
other studies, appear before gastrointestinal symptoms join the tree.
Secondly, we performed Logistic Principal Component Analysis (LPCA), an extension of Principal Component Analysis to binary data(Landgraf and Lee, 2020). This method is used to project the set of individual feature vectors for each dataset onto (in our case two) components that sequentially are as close to the original set of vectors as possible. The results of this analysis are shown in Fig. 2, and show quite strikingly consistent patterns across datasets, despite the various biases and data collection techniques employed.
The first strong signal in the data is that the first principal component
involves all symptoms in the same direction, meaning that the closest
one-dimensional description of community symptoms is number of symptoms
experienced. The second principal component, with some exceptions that vary by
dataset, suggests that a source of variation is negative correlation between
upper respiratory tract symptoms and systemic (Pillar 2, SGSS and CIS) and
gastrointestinal symptoms (SGSS, CIS and CSS). The overall interpretation of
these results is that a parsimonious description of COVID-19 symptoms at the
individual level can be provided by quantifying the total number of symptoms
experienced, followed by the relative contribution of upper respiratory
symptoms versus systemic or gastrointestinal symptoms to the total number of
symptoms experienced. The contribution of upper respiratory versus systemic and
gastrointestinal symptoms is also seen and in fact strengthened when examining
the age-stratified data (children 0-17 years, adults 18-54 years and elder
adults aged 55 years and older, see Supplementary Materials).
Having different symptoms identified by taking a symptom-level view of clustering as in the hierarchical analysis, and an individual-level view of co-occurrence as in LPCA, is explained by questions these methods address. LPCA attempts to find a description of overall variation of the symptoms of individuals within the dataset, while hierarchical clustering groups by suitably defined co-occurrence to find natural clusters of symptoms within the dataset. Our third main analysis method aims to provide an overall picture by considering low-dimensional embeddings of the data based on the structure of interactions encoded in the datasets. In particular, Uniform Manifold Approximation and Projection (UMAP) and associated algorithms (McInnes, L and Healy, J,, 2018; McInnes et al., )
produces a low-dimensional embedding using local structure of the data (i.e. groups of commonly co-occurring symptoms) and provided the intrinsic dimension of the system is not too large, can capture some of the global structure of the data (i.e. the relationships between such groups of data points). The result is that symptoms which commonly co-occur are placed close to each other in the outputted low-dimensional embeddings. Hyperparameters are important for UMAP, so we performed the analysis for two different hyperparameter choices: one that focuses more on the local structure (shown in Fig.S13); and one that focuses less on the local structure and attempts to preserve more of the global structure of the data (shown in Fig. S12).
To more explicitly compare findings across datasets we extend the UMAP analyses above by using the AlignedUMAP algorithm (McInnes et al., ). AlignedUMAP takes several different datasets as inputs and finds the optimal embedding for each inputted dataset, subject to the loose constraint that data points that are shared between datasets are placed in similar positions in the low-dimensional embeddings. These are produced through a trade-off between finding the optimal embedding for individual datasets, and aligning the embedding of shared symptoms across datasets. By aligning embeddings we gain several useful insights, most importantly that an embedding can be directly compared with the others it was aligned against, allowing better assessment of similarities and differences.
We produce embeddings of each dataset that are aligned based upon the core symptoms shared by all the datasets in our analysis: cough, diarrhoea, fatigue, fever, headache, muscle ache, and sore throat. These, shown in Fig. 3, allows us to explore whether datasets shared a common underlying structure of symptom co-occurrence.
Inspection of the embeddings with alignment based upon the core symptoms shared by different datasets provides some evidence of a broad structure shared across all datasets. The embeddings produced can be broadly described by a central cluster of systemic symptoms, and cough. Lower respiratory tract symptoms are typically placed nearby, in particular with shortness of breath often being placed close to fatigue. The upper respiratory tract symptoms (sore throat, rhinitis, sneezing) are typically placed further away from gastrointestinal symptoms, with the exception of lost/altered smell or taste symptoms. On most plots, the gastrointestinal symptoms exist as a tail or are slightly separated from the main central group of systemic symptoms. This complements the LPCA analysis, which suggested that individuals separate between those who experience upper respiratory tract symptoms, or those who experience a mixture of systemic and gastrointestinal symptoms.
As we did with hierarchical clustering and LPCA, we stratified each dataset based upon age bands that represent children and adolescents, adults and elders, and produced aligned embeddings for ease of comparison (see Supplementary Materials). However, AlignedUMAP allows us to directly compare more embeddings than is possible for dendrograms or symptom loadings as there exists explicit relationships between the embeddings. We perform an additional analysis where we again age-stratify each dataset into 10 year strata and produce aligned embeddings. These embeddings can then be visualised in a 3-dimensional space to describe how patterns of symptom co-occurrence change as age increases, see Fig. 4
, where linear interpolation has been used to connect the different embeddings from each ten year age-strata. Across all datasets, we observe changes to the local structure, indicated by the splitting of the rope/ribbon-like structures for the youngest age strata (under 10 years old), and for the older age strata (around 70 years old). The changes indicate that, despite the attempt to align symptoms in adjacent embeddings, the symptom-co-occurrence patterns of the data have changed too substantially for that to be achieved.
This is clearest in the CIS dataset, where some gastrointestinal symptoms (diarrhoea, nausea/vomiting, abdominal pain) are separated out from the main body of symptoms for the youngest and older age-strata. In Pillar 2 and SGSS, we find the formation of new clusters of symptoms, in the older age strata with a first cluster containing vomiting and nausea, and a second cluster containing headache, sore throat, muscle ache and joint pain. For the CSS dataset, separation into two main symptom clusters is observed, with one cluster containing: abdominal pain, muscle ache, headache, sore throat, chest pain, and cough, and with the second cluster containing loss of appetite, altered/loss of smell, diarrhoea, hoarse voice, with slightly separated shortness of breath, fever, delirium and fatigue.
For the under-10s, the produced embeddings typically consist of small clusters of symptoms. The CIS dataset is the exception however, again separating out gastrointestinal symptoms from the main body of symptoms. Inspection of the Jaccard distance matrices for the youngest age strata suggests that a possible explanation may be that fewer total symptoms are reported for young children. The observed clusters in the embeddings appear to consist mainly of pairs, or triplets of symptoms that do commonly co-occur, e.g. rhinitis and sneezing. However, the level of co-occurrence between these distinct small clusters is very small leading to separation in the low dimensional embeddings.
In summary, we have shown that considerable complexity and variation exists in COVID-19 symptoms in community infections. We find that the primary source of variation is in the number of symptoms experienced by a case, but conditional on this there are various ways to be ill that provide a more fine-grained description of phenotypes. In particular, we find evidence for separation between upper respiratory and systemic symptoms, both including commonly reported symptoms, and between upper respiratory and gastrointestinal symptoms, though the latter are less common overall. While the deep structure of symptom clustering was similar across the middle range of age groups, we found some evidence that patterns of symptom reporting changed among the youngest and oldest - though further work may be required to understand whether this is due to symptom reporting differences, or differences in the symptoms experienced.
While there are some differences in our findings across the four datasets, this is unsurprising given their very different case sampling designs, data collection methods, symptom reporting windows and specific symptom data collected. Routinely tested cases, for instance, will be selected based on the symptoms that qualify cases for testing (Pillar 2) leading to lower expected variations in the presence of these symptoms compared to cases identified via random sampling. Indeed, the broad consistency of findings across these datasets, which derive from routine, representative household and participatory surveillance methods respectively, increases our confidence that our findings are robust.
Our findings have implications for the evaluation of symptomatic testing criteria in the community, school settings, and high risk settings such as care homes or hospitals. The existence of phenotypes would suggest that the one-size-fits-all criteria used in the UK may be sub-optimal in these sub-populations where multiple phenotypes are plausible. Differences by ages could imply that symptomatic testing criteria should be tailored for different settings, though this would need to be balanced with what is feasible and understandable for the public. Studies that examine the optimal combination of symptoms to initiate testing of symptomatic community cases (Elliott et al., 2021; Fragaszy et al., 2021) may be implicitly assuming the existence of a single phenotype - to ensure that a symptom testing criteria is optimal, the possible existence of multiple phenotypes and the wide spectrum of disease must be considered. Several of the datasets in this study include only positive SARS-CoV-2 cases, a result of which is that we cannot evaluate the specificity of symptom testing criteria combinations informed by the symptom co-occurrence structures we have identified here, and this would limit our evaluation of symptomatic testing policies. Emphasis should be placed on the extent of symptom variation across COVID cases in communication with the public. This messaging is critical for the initiation of transmission control interventions including isolation, testing and contact tracing. Other studies have found that adding additional specific symptoms to the criteria for community symptomatic testing in the UK could potentially include a wider set of cases to be eligible (with implications for increasing testing demand) (Elliott et al., 2021; Fragaszy et al., 2021). However, surveys have also found that a large proportion of the public is unaware of the existing symptom criteria (Smith et al., 2021), so messaging focusing strongly on this variation could improve detection of cases and control of transmission, alongside a testing and isolation policy adapted to evolving epidemic circumstances (Crozier et al., 2021). Further, it may be the case that the different characterisation of cases could inform clinical outcomes, for example the finding that cases can be described by the contribution of upper respiratory symptoms versus systemic or gastrointestinal symptoms to the total number of symptoms experienced.
With increasing vaccination, re-infections and ongoing SARS-CoV-2 evolution, as well as the resurgence of other previously suppressed respiratory infections, understanding the variability of COVID-19 symptoms presentation is critical in planning community intervention for control of transmission, identification of cases potentially requiring greater care, and possibly understanding long term presentation of the disease (Antonelli et al., 2021). Beyond even the current pandemic, the application of unsupervised learning analyses such as this one, in conjunction with clinical, epidemiological and behavioural understanding, is likely to yield important insights for other infectious diseases.
- Antonelli et al. (2021) M. Antonelli, R. S. Penfold, J. Merino, C. H. Sudre, E. Molteni, S. Berry, L. S. Canas, M. S. Graham, K. Klaser, M. Modat, B. Murray, E. Kerfoot, L. Chen, J. Deng, M. F. Österdahl, N. J. Cheetham, D. A. Drew, L. H. Nguyen, J. C. Pujol, C. Hu, S. Selvachandran, L. Polidori, A. May, J. Wolf, A. T. Chan, A. Hammers, E. L. Duncan, T. D. Spector, S. Ourselin, and C. J. Steves. Risk factors and disease profile of post-vaccination SARS-CoV-2 infection in UK users of the COVID symptom study app: a prospective, community-based, nested, case-control study. The Lancet Infectious Diseases, 2021.
- (2) D. Buitrago-Garcia, D. Egli-Gany, M. J. Counotte, S. Hossmann, H. Imeri, A. M. Ipekci, G. Salanti, and N. Low. Occurrence and transmission potential of asymptomatic and presymptomatic SARS-CoV-2 infections: A living systematic review and meta-analysis. 17(9):e1003346. ISSN 1549-1676. doi: 10.1371/journal.pmed.1003346. URL https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1003346. Publisher: Public Library of Science.
- (3) E. C. Chan, Y. Sun, K. J. Aitchison, and S. Sivapalan. Mobile app–based self-report questionnaires for the assessment and monitoring of bipolar disorder: Systematic review. 5(1):e13770. doi: 10.2196/13770. URL https://formative.jmir.org/2021/1/e13770. Company: JMIR Formative Research Distributor: JMIR Formative Research Institution: JMIR Formative Research Label: JMIR Formative Research Publisher: JMIR Publications Inc., Toronto, Canada.
- Clift et al. (2020) A. K. Clift, C. A. C. Coupland, R. H. Keogh, K. Diaz-Ordaz, E. Williamson, E. M. Harrison, A. Hayward, H. Hemingway, P. Horby, N. Mehta, J. Benger, K. Khunti, D. Spiegelhalter, A. Sheikh, J. Valabhji, R. A. Lyons, J. Robson, M. G. Semple, F. Kee, P. Johnson, S. Jebb, T. Williams, and J. Hippisley-Cox. Living risk prediction algorithm (QCOVID) for risk of hospital admission and mortality from coronavirus 19 in adults: national derivation and validation cohort study. BMJ, 371:m3731, 10 2020.
- Crozier et al. (2021) A. Crozier, J. Dunning, S. Rajan, M. G. Semple, and I. E. Buchan. Could expanding the COVID-19 case definition improve the UK’s pandemic response? BMJ, 374, 2021. doi: 10.1136/bmj.n1625. URL https://www.bmj.com/content/374/bmj.n1625.
- Deliu et al. (2017) M. Deliu, D. Belgrave, M. Sperrin, I. Buchan, and A. Custovic. Asthma phenotypes in childhood. Expert Review of Clinical Immunology, 13(7):705–713, 2017.
- Drew et al. (2020) D. A. Drew, L. H. Nguyen, C. J. Steves, C. Menni, M. Freydin, T. Varsavsky, C. H. Sudre, M. J. Cardoso, S. Ourselin, J. Wolf, T. D. Spector, A. T. Chan, and . Rapid implementation of mobile technology for real-time epidemiology of COVID-19. Science, 368(6497):1362–1367, 2020. ISSN 0036-8075. doi: 10.1126/science.abc0473. URL https://science.sciencemag.org/content/368/6497/1362.
- Elliott et al. (2021) J. Elliott, M. Whitaker, B. Bodinier, S. Riley, H. Ward, G. Cooke, A. Darzi, M. Chadeau-Hyam, and P. Elliott. Symptom reporting in over 1 million people: Community detection of COVID-19. medRxiv, 2021. doi: 10.1101/2021.02.10.21251480. URL https://www.medrxiv.org/content/early/2021/02/12/2021.02.10.21251480.
- Fragaszy et al. (2021) E. Fragaszy, M. Shrotri, C. Geismar, A. Aryee, S. Beale, I. Braithwaite, T. Byrne, W. L. E. Fong, J. Gibbs, P. Hardelid, J. Kovar, V. Lampos, E. Nastouli, A. M. D. Navaratnam, V. Nguyen, P. Patel, R. W. Aldridge, A. Hayward, and on behalf of Virus Watch Collaborative. Symptom profiles and accuracy of clinical definitions for COVID-19 in the community. results of the virus watch community cohort. medRxiv, 2021. doi: 10.1101/2021.05.14.21257229. URL https://www.medrxiv.org/content/early/2021/06/11/2021.05.14.21257229.
- Fyles et al. (2021) M. Fyles, E. Fearon, C. Overton, University of Manchester COVID-19 Modelling Group, T. Wingfield, G. F. Medley, I. Hall, L. Pellis, and T. House. Using a household-structured branching process to analyse contact tracing in the SARS-CoV-2 pandemic. Philosophical Transactions of the Royal Society B: Biological Sciences, 376(1829):20200267, 2021.
- Geifman et al. (2018) N. Geifman, R. E. Kennedy, L. S. Schneider, I. Buchan, and R. D. Brinton. Data-driven identification of endophenotypes of Alzheimer’s disease progression: implications for clinical trials and therapeutic interventions. Alzheimer’s Research & Therapy, 10:4, 2018.
- Google LLC (2021) Google LLC. Community Mobility Reports, 2021. URL https://www.google.com/covid19/mobility/.
- Hale et al. (2021) T. Hale, N. Angrist, R. Goldszmidt, B. Kira, A. Petherick, T. Phillips, S. Webster, E. Cameron-Blake, L. Hallas, S. Majumdar, and H. Tatlow. A global panel database of pandemic policies (Oxford COVID-19 Government Response Tracker). Nature Human Behaviour, 5(4):529–538, 2021.
- Hastie et al. (2009) T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2nd edition, 2009.
- Hofmann and Zeuzem (2011) W. P. Hofmann and S. Zeuzem. A new standard of care for the treatment of chronic HCV infection. Nature Reviews Gastroenterology & Hepatology, 8(5):257–264, 2011.
Landgraf and Lee (2020)
A. J. Landgraf and Y. Lee.
Dimensionality reduction for binary data through the projection of
Journal of Multivariate Analysis, 180:104668, 2020.
- (17) D. Lyu, Z. Wu, Y. Wang, Q. Huang, Z. Wu, T. Cao, J. Zhao, Y. Cao, Y. Hu, J. Chen, Y. Wang, Y. Su, C. Zhang, D. Peng, Z. Li, L. Cao, W. Hong, and Y. Fang. Disagreement and factors between symptom on self-report and clinician rating of major depressive disorder: A report of a national survey in china. 253:141–146. ISSN 0165-0327. doi: 10.1016/j.jad.2019.04.073. URL https://www.sciencedirect.com/science/article/pii/S0165032718331823.
- (18) L. McInnes, J. Healy, and J. Melville. UMAP: Uniform manifold approximation and projection for dimension reduction – umap 0.5 documentation. URL https://umap-learn.readthedocs.io/en/latest/index.html.
- McInnes, L and Healy, J, (2018) McInnes, L and Healy, J,. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, 2018. [arXiv:1802.03426].
- (20) J. E. Millar, L. Neyton, S. Seth, J. Dunning, L. Merson, S. Murthy, C. D. Russell, S. Keating, M. Swets, C. H. Sudre, T. D. Spector, S. Ourselin, C. J. Steves, J. Wolf, A. B. Docherty, E. M. Harrison, P. J. Openshaw, M. G. Semple, and J. K. Baillie. Robust, reproducible clinical patterns in hospitalised patients with COVID-19. page 2020.08.14.20168088. doi: 10.1101/2020.08.14.20168088. URL http://medrxiv.org/content/early/2020/09/10/2020.08.14.20168088.abstract.
- NHS (2021) NHS. Get tested for coronavirus (COVID-19), 2021. URL https://www.nhs.uk/conditions/coronavirus-covid-19/testing/get-tested-for-coronavirus/.
- Pouwels et al. (2021) K. B. Pouwels, T. House, E. Pritchard, J. V. Robotham, P. J. Birrell, A. Gelman, K. D. Vihta, N. Bowers, I. Boreham, H. Thomas, J. Lewis, I. Bell, J. I. Bell, J. N. Newton, J. Farrar, I. Diamond, P. Benton, A. S. Walker, and COVID-19 Infection Survey Team. Community prevalence of sars-cov-2 in england from april to november, 2020: results from the ons coronavirus infection survey. Lancet Public Health, 6(1):e30-e38, Jan 2021. doi: doi:10.1016/S2468-2667(20)30282-6.
- (23) M. J. Silverstein, S. V. Faraone, S. Alperin, J. Biederman, T. J. Spencer, and L. A. Adler. How informative are self-reported adult attention-deficit/hyperactivity disorder symptoms? an examination of the agreement between the adult attention-deficit/hyperactivity disorder self-report scale v1.1 and adult attention-deficit/hyperactivity disorder investigator symptom rating scale. 28(5):339–349. ISSN 1044-5463. doi: 10.1089/cap.2017.0082. URL http://www.liebertpub.com/doi/10.1089/cap.2017.0082. Publisher: Mary Ann Liebert, Inc., publishers.
- Smith et al. (2021) L. E. Smith, H. W. W. Potts, R. Amlôt, N. T. Fear, S. Michie, and G. J. Rubin. Adherence to the test, trace, and isolate system in the UK: results from 37 nationally representative surveys. BMJ, 372, 2021. doi: 10.1136/bmj.n608. URL https://www.bmj.com/content/372/bmj.n608.
- (25) T. Struyf, J. J. Deeks, J. Dinnes, Y. Takwoingi, C. Davenport, M. M. Leeflang, R. Spijker, L. Hooft, D. Emperador, S. Dittrich, J. Domen, S. R. A. Horn, A. V. d. Bruel, and C. C.-. D. T. A. Group. Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID‐19 disease. (7). ISSN 1465-1858. doi: 10.1002/14651858.CD013665. URL https://www.cochranelibrary.com/cdsr/doi/10.1002/14651858.CD013665/full. Publisher: John Wiley & Sons, Ltd.
- Sudre et al. (2021) C. H. Sudre, K. A. Lee, M. N. Lochlainn, T. Varsavsky, B. Murray, M. S. Graham, C. Menni, M. Modat, R. C. E. Bowyer, L. H. Nguyen, D. A. Drew, A. D. Joshi, W. Ma, C.-G. Guo, C.-H. Lo, S. Ganesh, A. Buwe, J. C. Pujol, J. L. du Cadet, A. Visconti, M. B. Freidin, J. S. El-Sayed Moustafa, M. Falchi, R. Davies, M. F. Gomez, T. Fall, M. J. Cardoso, J. Wolf, P. W. Franks, A. T. Chan, T. D. Spector, C. J. Steves, and S. Ourselin. Symptom clusters in COVID-19: A potential clinical prediction tool from the COVID Symptom Study app. Science Advances, 7(12), 2021. doi: 10.1126/sciadv.abd4177. URL https://advances.sciencemag.org/content/7/12/eabd4177.
- Swann et al. (2020) O. V. Swann, K. A. Holden, L. Turtle, L. Pollock, C. J. Fairfield, T. M. Drake, S. Seth, C. Egan, H. E. Hardwick, S. Halpin, M. Girvan, C. Donohue, M. Pritchard, L. B. Patel, S. Ladhani, L. Sigfrid, I. P. Sinha, P. L. Olliaro, J. S. Nguyen-Van-Tam, P. W. Horby, L. Merson, G. Carson, J. Dunning, P. J. M. Openshaw, J. K. Baillie, E. M. Harrison, A. B. Docherty, and M. G. Semple. Clinical characteristics of children and young people admitted to hospital with COVID-19 in United Kingdom: prospective multicentre observational cohort study. BMJ, 370:m3249, 2020.
- (28) D. Tomlinson, E. Plenert, G. Dadzie, R. Loves, S. Cook, T. Schechter, J. Furtado, L. L. Dupuis, and L. Sung. Discordance between pediatric self-report and parent proxy-report symptom scores and creation of a dyad symptom screening tool (co-SSPedi). 9(15):5526–5534. ISSN 2045-7634. doi: 10.1002/cam4.3235. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cam4.3235. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cam4.3235.
- Williamson et al. (2020) E. J. Williamson, A. J. Walker, K. Bhaskaran, S. Bacon, C. Bates, C. E. Morton, H. J. Curtis, A. Mehrkar, D. Evans, P. Inglesby, J. Cockburn, H. I. McDonald, B. MacKenna, L. Tomlinson, I. J. Douglas, C. T. Rentsch, R. Mathur, A. Y. S. Wong, R. Grieve, D. Harrison, H. Forbes, A. Schultze, R. Croker, J. Parry, F. Hester, S. Harper, R. Perera, S. J. W. Evans, L. Smeeth, and B. Goldacre. Factors associated with COVID-19-related death using OpenSAFELY. Nature, 584(7821):430–436, 08 2020.
- (30) O. W. A. Wilson, C. M. Bopp, Z. Papalia, and M. Bopp. Objective vs self-report assessment of height, weight and body mass index: Relationships with adiposity, aerobic fitness and physical activity. 9(5):e12331. ISSN 1758-8111. doi: 10.1111/cob.12331. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/cob.12331. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/cob.12331.
- World Health Organization (2021) World Health Organization. Coronavirus disease (COVID-19) pandemic, 2021. URL https://www.who.int/emergencies/diseases/novel-coronavirus-2019. 21 June 2021.
Funding: CSS funding:
ZOE provided in-kind support for all aspects of building, running, and supporting the ZOE app and service to all users worldwide. Support for this study was provided by the National Institute for Health Research (NIHR)-funded Biomedical Research Centre based at Guy’s and St Thomas’ (GSTT) NHS Foundation Trust. This work was supported by the UK Research and Innovation London Medical Imaging & Artificial Intelligence Centre for Value-Based Healthcare (104691). Investigators also received support from the Wellcome Trust (WT203148/Z/16/Z, WT213038/Z/18/Z, and W212904/Z/18/Z), Medical Research Council (MRC; MR/V005030/1 and MR/M004422/1), British Heart Foundation, Alzheimer’s Society, EU, NIHR, COVID-19 Driver Relief Fund, Innovate UK, the NIHR-funded BioResource, and the Clinical Research Facility and Biomedical Research Centre based at GSTT NHS Foundation Trust, in partnership with Kings College London. This work was also supported by the National Core Studies, an initiative funded by UK Research and Innovation, NIHR, and the Health and Safety Executive. The COVID-19 Longitudinal Health and Wellbeing National Core Study was funded by the MRC (MC_PC_20030).CIS funding: The ONS CIS is funded by the Department of Health and Social Care with in-kind support from the Welsh Government, the Department of Health on behalf of the Northern Ireland Government and the Scottish Government. Individual funding:
MF is supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1. K-DV is supported by the National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Healthcare Associated Infections and Antimicrobial Resistance at the University of Oxford in partnership with Public Health England (PHE) (NIHR200915). RD and TH are supported by the Engineering and Physical Sciences Research council (Award numbers 2373157 and EP/V027468/1). EF is supported by the Medical Research Council award MR/S020462/1; MF, EF, TW and TH are supported by the Medical Research Council award MR/V028618/1; TH is supported by the JUNIPER consortium (MR/V038613/1), the Royal Society (INF/R2/180067) and Alan Turing Institute for Data Science and Artificial Intelligence. SO was supported by the French government, through the 3IA Côte d’Azur Investments in the Future project managed by the National Research Agency (ANR-19-P3IA-0002). CHS was supported by the Alzheimer’s Society Junior Fellowship (AS-JF-17–011). TW is supported by grants from the Wellcome Trust, the Medical Research Council, and the Foreign Commonwealth and Development Office Joint Global Health Trials (MR/V004832/1 and 209075/Z/17/Z), and the Swedish Research Council.
Author contributions: All authors contributed to collection and processing of data, choice and implementation of analysis methods, and writing of the paper.
Competing interests: None declared.
Data and materials availability: Datasets are too sensitive for public release, and can be accessed by researchers through secure research environments. SAIL acknowledgement: This study makes use of anonymised data held in the Secure Anonymised Information Linkage (SAIL) Databank. ONS acknowledgement: This work contains statistical data from ONS which is Crown Copyright. The use of the ONS statistical data in this work does not imply the endorsement of the ONS in relation to the interpretation or analysis of the statistical data. This work uses research datasets which may not exactly reproduce National Statistics aggregates. We would like to acknowledge all the data providers who make anonymised data available for research. Code and datasets that have been approved for publication are available at: https://github.com/martyn1fyles/CovidSymptomsAnalysisPublic
Appendix A Materials and Methods
a.1 Population and setting
This is a secondary data analysis of SARS-CoV-2 PCR-positive cases from the general community in the United Kingdom between May 2020 and March 2021. We extract data about positive symptomatic cases from four diverse datasets: two datasets from the NHS Test and Trace routine testing and contact tracing programme; the Office for National Statistics COVID-19 Infection Survey (CIS), a population-representative survey of randomly selected households; and the COVID Symptom Study (CSS), a participatory surveillance mobile app project.
a.1.1 NHS Test and Trace routine testing data
NHS Test and Trace data is further split into two parts: Pillar 2, cases detected in the community, usually on the basis of symptoms to initiate testing; and the Second Generation Surveillance System (SGSS), for people tested in healthcare settings. In May 2020, the UK government made PCR testing available for individuals who had one of the following symptoms: a new, continuous cough; fever; loss of taste or loss of smell. These tests are reported through Pillar 2, through which several different avenues to testing are available. Individuals can book a test appointment through a government website for either a drive-in or walk-through testing centre, where they self-swab their nose and throat (under some supervision, with an adult carer conducting the swabbing for children), with the swab then sent to a lab for PCR testing. Alternatively, individuals can order home test kits where they self-swab at home and post the kit back, with the swab again sent to a lab for PCR testing. If the individual tests positive, their case is transferred to NHS Test and Trace who contact cases to inform them of their result and ask them to conduct a questionnaire including symptoms experienced. The questionnaire is conducted either via a web-form or over the phone with a trained contact tracer. Since the end of 2020, Pillar 2 has also included positive cases identified using rapid antigen tests among people not experiencing one of the PCR test prompting symptoms. These tests also use a nasopharyngeal swab and are conducted at the home, workplace or school and if positive are requested to be followed up by a confirmatory PCR test (though policy has varied over time). Reported positive cases from asymptomatic testing are also followed up by NHS Test and Trace.
The Second Generation Surveillance System (SGSS) dataset includes people who test because they work in or have been tested in a healthcare setting as a patient. This latter group includes both those in hospital because of severe COVID-19 symptoms, but also those in hospital for other reasons but receiving SARS-CoV-2 testing. Thus there are likely on average to be more severe cases in the SGSS versus Pillar 2 data, but not exclusively. Again, individuals are swabbed and PCR tested, with their case transferred to NHS Test and Trace if testing positive for symptom reporting and contact tracing.
a.1.2 COVID Symptom Study (CSS)
The CSS is a participatory surveillance study collecting data via a smartphone app. It is led by Kings College London and Zoe Global Ltd and was initiated in March 2020 in the United Kingdom and the United States (Drew et al., 2020). Individuals are asked to report daily whether they are feeling ’physically normal’ that day and, if not, what symptoms they are experiencing. As well as demographic data that is collected upon sign-up, participants are also asked to self-report whether they have had any tests for SARS-CoV-2 infection and, if so, the date of test and its result. Demographic data and data about underlying conditions are collected at first registration. Participants can also proxy-report for children or for others they care for (e.g elderly adults they care for).
As well as enabling individuals to self-report COVID-19 testing that they have undertaken via the UK’s routine testing programmes or surveillance studies, the CSS invites individuals to complete a PCR test via routine testing if they 1) have made at least one report of no symptoms in the previous week and 2) report a new symptom not on the list to prompt symptomatic testing (e.g sore throat). This means that we might expect the CSS reporting to be less dominated by the symptoms required to initiate symptom-based testing than the Pillar 2 routine testing dataset.
a.1.3 ONS COVID-19 Infection Survey (CIS)
The CIS is a UK population-representative survey of households randomly selected continuously since April 2020 from address lists and previous surveys (Pouwels et al., 2021). Households are followed longitudinally with weekly visits for the first month and monthly visits for 12 months from enrolment. A fieldworker attends enrolled households each visit for testing for household members aged 2 years and above and to conduct an interview including, among other topics, demographic data (reported at first visit) and symptoms experienced over the previous 7 days. At each visit, participants conduct a nose and throat swab under supervision of a fieldworker. These swabs are sent for PCR testing and the result communicated to participants. At the same time as swabbing, all participants are also interviewed by the fieldworker to complete a symptom questionnaire.
a.2 Data extraction and preparation
a.2.1 Sample populations
The dates over which cases are collected from each study are shown in Table S1 and in total cover the period from April 2020 to March 2021, with the largest overlap between November 2020 and January 2021. We make no exclusions based on age or other characteristics.
From each dataset, we include only cases who report at least one symptom within the symptom reporting window around a positive test (detailed below and listed in Table S2). For NHS Test and Trace data, positive cases who are never reached and interviewed post-testing are not included in this dataset. The definition of ’symptomatic’ necessarily varies across the datasets because there are differences in the full list of symptoms asked about. Symptoms that were not core dataset variables and were instead recorded by manual entry were not included. For each dataset, we chose to include all dataset symptoms from each study (except for ’write-in’ symptoms), rather than excluding symptoms that were not common across all. This was with the intention of maximising the amount of symptom information available for analysis. We also extracted demographic information.
It is expected that symptom data from the same PCR-positive cases is captured across the NHS Test and Trace, CIS and CSS datasets. Explicit deduplication of individuals across datasets was not performed but is expected to have no impact on the findings.
The proportion of symptomatic cases varies significantly between datasets, reflecting their different sampling. NHS Test and Trace Pillar 2 and CSS both have the highest number of symptomatic cases, which is not surprising given that both datasets mainly focus on symptom initiated testing. The NHS Test and Trace SGSS dataset has the next highest proportion of symptomatic cases. We expect to see some asymptomatic screening in SGSS populations, which may explain the decrease in symptomatic cases when comparing Pillar 2 to SGSS. In the CIS study, we see a much smaller proportion of symptomatic cases, likely due to the sampling strategy being independent of symptoms, therefore resulting in asymptomatic and pre-symptomatic individuals testing positive and being included in the study.
|Data collection variables|
|Location of participants||England||England||UK||UK|
|Male||736,906 (45.0%)||42,355 (40.3%)||23,540 (38.2%)||4,142 (45.2%)|
|Female||875,545 (55.0%)||62,808 (59.7%)||38,051 (61.8%)||5,024 (54.8%)|
|Total dataset size||1,898,273||179,550||61,623||27,903|
|Symptomatic cases||1,637,965 (86.3%)||112,925 (62.9%)||52,084 (84.5%)||9,166 (32.8%)|
|Asymptomatic cases||260,308 (13.7%)||66,625 (37.1%)||9,539 (15.5%)||18,737 (67.2%)|
The Pillar 2 routine testing contributed by far the largest number of cases to the study, with CIS the fewest. While all datasets contained a slight female majority, with CSS the largest (61.8%), there was some variability in the age distribution of cases (Figure S1); Pillar 2 routine testing was the youngest, while SGSS included the oldest groups. This is likely to be because SGSS more heavily represents a hospitalised population. CSS and CIS are UK-wide, while NHS Test and Trace data contains cases testing in England.
The characteristics of the infected sub-population relative to the general UK population have likely changed over time for a multitude of reasons; different levels of restrictions and lockdown across different localities, vaccination coverage and uptake, varying prevalence, weather, levels of outdoors mixing, incentives to ignore social distancing, workplace/school closures and changing availability of testing. Moreover, each study and route of data collection results in different samples of the infected population. The CIS is a population-based household sample and thus should be broadly representative (participation biases aside), those discovered through routine testing (NHS Test & Trace) may over represent a population adherent to testing guidance, those prone to more severe infections and the sub-populations with the highest prevalence and testing-seeking behaviour. For the CSS’s app based reporting, then the sub-populations with high levels of smartphone ownership and compliance are likely to be over-represented.
a.2.2 Symptom data
Data is collected at the level of the symptoms experienced by an individual, and for the majority of datasets we have a binary outcome of whether an individual experienced a symptom or not. Exact symptom questions and lists are given in Table S2. In CSS an individual is able to choose from several levels of fatigue: “none”, “mild” or “severe”. Our planned analyses are designed to work with binary data, and as a result we map multiple levels into a binary outcome variable. When performing this mapping, we choose to merge levels together, with the aim of making the symptoms as comparable as possible to what is reported in other datasets. Datasets with a binary fatigue variable report 40-60% of cases, which is consistent with most cases only reporting severe fatigue; if we included mild fatigue then we find that close to 80% of cases report fatigue which is inconsistent with what is reported in the other datasets.
a.2.3 Symptom reporting windows
The symptom reporting window and its timing relative to the positive test varies across the datasets. For CIS, participants are asked about symptoms in the previous 7 days prior to testing. For cases contacted and interviewed by NHS Test and Trace (Pillar 2 and SGSS), individuals are asked to report symptoms that they are currently experiencing. For CSS, individuals are prompted to report symptoms daily but for this dataset we include all symptoms reported in the 14 days before and 14 days after the date a positive test is reported (note this does not mean that all participants report symptoms with that level of frequency).
From the time of infection, individuals usually have a few days before they become symptomatic, while test sensitivity also varies over the course of infection, peaking around the time or just before symptom onset. Previous studies have also found patterns in the types of symptoms that present earlier versus later in the course of an infection (Sudre et al., 2021). Across each of the included datasets, the time in an individual’s infection at which they are tested on average and over which they are asked to report symptoms varies. For CIS time of testing over the course of an infection should be random over the period at which someone will test PCR-positive; for data primarily from symptomatic testing it should be a few days post symptom onset (reflecting a delay between onset and testing, test result and follow-up interview with Test and Trace). For CSS, the time of testing for many will reflect symptomatic testing in the community and some proportion of individuals with particular symptom reporting patterns are asked to obtain a test through NHS Test and Trace symptomatic testing routes.
|Test and Trace (Pillar 2, SGSS)||COVID-19 Infection Survey||COVID Symptom Study|
|Are you experiencing any of the following symptoms? Please select at least one. (cases may select “I have no symptoms”)||Have you had any of the following symptoms in the last 7 days?||Are you feeling physically normal today? (I feel physically normal; I do not feel physically normal)|
|Abdominal pain||-||Abdominal pain||Do you have an unusual abdominal pain?|
|Altered consciousness||Altered consciousness||-||-|
|Altered/loss of smell||-||-||Do you have a loss of smell/taste?|
|Chest pain||-||-||Are you feeling an unusual chest pain or tightness in your chest?|
|Cough||A new, continuous cough||Cough||Do you have a persistent cough?|
|Delirium||-||-||Do you have any of the following symptoms: confusion, disorientation, or drowsiness?|
|Diarrhoea||Diarrhoea||Diarrhoea||Are you experiencing diarrhoea?|
|Fatigue||Extreme tiredness||Weakness/tiredness||Are you experiencing unusual fatigue? (mild; severe)*|
|Fever||High temperature or fever (higher than 38C)||Fever||Do you have a fever?|
|Headache||Headache||Headache||Do you have a headache?|
|Hoarse voice||-||-||Do you have an unusually hoarse voice?|
|Joint pain||Joint pain||-||-|
|Loss of appetite||Loss of appetite||-||Have you been skipping meals?|
|Loss of smell||-||Loss of smell||-|
|Loss of smell or taste||Loss or change to your sense of smell or taste (you cannot smell or taste anything, or things smell or taste different to normal)||-||-|
|Loss of taste||-||Loss of taste||-|
|Muscle ache||Muscle ache||Muscle ache||Do you have unusual strong muscle pains?|
|Nausea||Feeling sick (nausea)||-||-|
|Nose bleed||Nose bleed||-||-|
|Shortness of breath||-||Shortness of breath||Are you experiencing unusual shortness of breath? (no; yes mild symptoms/ slight shortness of breath during ordinary activity; yes significant symptoms -breathing is comfortable only at rest; yes, severe symptoms/ breathing is difficult even at rest)**|
|Sore throat||Sore throat||Sore throat||Do you have a sore throat?|
a.2.4 Symptom classification
To aid interpretation we classify symptoms according to their clinical characteristics. These classifications were made a priori in consultation with an infectious diseases clinician (TW) with experience of caring for people with COVID-19 and without input from observed clustering patterns. We included systemic symptoms, lower respiratory, upper respiratory, gastrointestinal, altered state symptoms and ’other’ symptoms that did not fit into any of these categories.
a.2.5 Ethical approval
The secondary analyses described in this paper received ethical approval from the London School of Hygiene and Tropical Medicine (22752). The COVID Symptom Study was approved by the Partners Human Research Committee (Protocol 2020P000909) and King’s College London ethics committee (REMAS ID 18,210, LRS-19/20–18,210) and the CIS received ethical approval from the South Central Berkshire B Research Ethics Committee (20/SC/0195).
We describe the frequency with which each symptom was reported in each dataset, categorising them using our symptom classification. We then perform three unsupervised learning techniques, each with a different but complementary aim. Our goal is to understand patterns of symptom co-occurrence and if there is any evidence of symptom clustering, as multiple distinct clusters would be evidence for the existence of distinct COVID-19 symptom phenotypes.
We use a variety of methods to understand the behaviour of symptoms, and the analyses are sometimes performed on the Jaccard distance matrix of symptoms. The Jaccard distance is defined as
where is the feature vector constructed from the presence or absence of symptom in cases. The simple interpretation of Jaccard distance is then, the proportion of cases who experienced both symptoms and , given that they experienced at least one of symptoms or . In the case of missing data, the Jaccard distance is computed using only the subset of individuals for which there is no missing data for either symptoms and .
Hierarchical clustering starts with a set of symptoms, and the feature vector for each is constructed from their presence or absence in individuals with a positive test and report of at least one symptom (i.e. those positive cases not excluded as asymptomatic). The Jaccard distance is used as an appropriate metric for such binary data. Clusters of symptoms are agglomeratively joined on the dendrogram produced on the basis of the maximum distance between cluster members (called ‘complete linkage’). Symptoms with a low shortest distance between each other on the final dendrogram tend to co-occur, and those with a long distance are not often both present. Clusters can also be identified by ‘cutting’ the dendrogram at a given distance.
is an extension of principal component analysis (PCA) to binary data, and reduces the dimension of the symptom space in a manner that preserves the maximum level of variance between individuals (rather than symptoms)(Landgraf and Lee, 2020). The projection values of symptoms onto lower dimensional basis are called loadings, and these demonstrate the directions in which individual phenotypes most commonly vary. In practice, the first component is likely to have relatively even contributions from each of the symptoms, and will represent an overall severity of illness at the individual level, with subsequent components demonstrating more subtle ways in which symptoms can vary.
Given an data matrix , our aim is to find a low dimensional representation of the natural parameter matrix , where . This is achieved by finding , a rank approximation to such that the Bernoulli deviance,
is minimised. This is conceptually related to logistic regression models, as these also attempt to minimise the Bernoulli deviance. In practice, the minimisation is solved over, such that , with . The column vector of are the loadings of the principal components.
As with all dimensionality reduction techniques, we need to choose the number of dimensions in our low-dimensional approximation. We follow the recommendation of Landgraf and Lee (Landgraf and Lee, 2020), and examine the change in the Bernoulli deviance as we increase . Consider a rank 0 approximation, where for . That is to say that the natural parameter matrix contains a constant value in every column. This is treated as the null model, to which all other models are compared against.
For a model with components, the proportion of Bernoulli deviance explained relative to the null model is given by
If , then , as the model is saturated and , this resulting in . This means that can be interpreted similarly to standard PCA, in the sense that of the variance is explained by the first components. The marginal Bernoulli deviance, , is defined as the change in the Bernoulli deviance explained by adding the component, for , defined as
When selecting the number of components in our low-dimensional representation of the data, we primarily focus upon the marginal Bernoulli deviance, and aim to find the largest such that for , the marginal Bernoulli deviance decreases rapidly. We also examine the proportion of Bernoulli deviance explained - if this gets close to 1, then that suggests we have selected too many components and are over-fitting.
In practice, there are in fact two hyperparameters that need to be chosen for logistic PCA: the number of components , and which controls the magnitude of the loadings. The optimal choice of varies depending upon , and is selected by leave-one-out cross validation for a range of proposed values.
An example of the model selection is plotted in Figure S2. In the plotted example, we would choose , indicated by the vertical dashed line. This is due to the first two components, having a significantly higher marginal Bernoulli deviance than all models with components. The marginal Bernoulli deviance’s for models where have small differences between successive values of , making it hard to favour one model over the other. For , the marginal Bernoulli deviance does decrease rapidly - however, at this point we have explained close to 100% of the Bernoulli deviance, and are over-fitting the model at this point. Hence, for this example we choose .
If we select components, then approximately 33% of the Bernoulli deviance of the saturated model is explained. In classic PCA, ideally the model would find a number of components that explains as close to 100% of the variance while not over-fitting to noise. In Logistic Principal component analysis however, if a model explains close to 100% of the Bernoulli deviance relative to the null Model, then this is indicative of dramatic over-fitting. For example, the saturated model where
will exactly reproduce the input data and explains 100% of the Bernoulli deviance. The true natural parameter matrix will not explain 100% of the Bernoulli deviance, as it tells us about the probability of a symptom occurring, this will lead to a non-zero Bernoulli deviance. As such, our goal is not to explain 100% of the Bernoulli deviance, relative to the null model.
Model selection plots for the number of components can be found in the code repository for this paper (https://github.com/martyn1fyles/CovidSymptomsAnalysisPublic). For some plots, the model selection process would suggest that the optimal number of components is . During the model selection process, we find that is rejected. Throughout this paper we have opted to present LPCA results where we take , given that several datasets show clear and strong signals that we reason that is it likely that the true number of components is across all datasets. However, we acknowledge uncertainty in the model selection, and make available in the code repository LPCA plots where we have taken . Unlike in traditional PCA, LPCA components are dependent upon the total number of components selected, and as such PC1 for example differs depending upon the total number of components selected. This is why we must rerun the analysis when we take , and present these results separately to results produced where we set .
UMAP (Uniform Manifold Approximation and Projection) is a technique for dimensional reduction of complex data based on pairwise distances between symptoms. In contrast to the other methods, it is designed to achieve good separation between unknown classes in the low dimensional space, and as such complements the other machine learning methods used above. We compute the Jaccard distance matrix for the observed symptoms, and configure UMAP to attempt to place the symptoms in a 2-dimensional Euclidean space such that symptoms with a smaller distance between them are considered to be more similar. To enable comparison across the datasets and across age strata, we present results using the AlignedUMAP algorithm, where we align the embeddings between: 1) different datasets, aligning core symptoms common across datasets; and 2) for each dataset, across ten year age strata. As AlignedUMAP necessitates some trade-off between finding the optimal embedding and performing the alignment, we complement the analysis by also producing non-aligned embeddings of the datasets.
There are a wide number of UMAP hyper parameters that can be adjusted, and finding the optimal combination is not a solved problem to our knowledge. We have opted to produce two UMAP outputs for each dataset, one configured to produce a tight clustering of symptoms, and another configured to produce a loose clustering of symptoms. This is achieved by changing the number of neighbouring points that UMAP considers when it is placing points, similar to the -nearest neighbours algorithm. As a result, smaller values of the n_neighbours parameter will configure UMAP to focus on local structures, and it may not capture the global structure - this produces what we refer to as a tight clustering and produces well separated clusters. Setting the n_neighbours parameter to higher values will configure UMAP to focus less on the local structures of the symptoms, but produce a more general clustering of the data - this produces what we refer to as a loose clustering. Both loose and tight UMAP embedding will capture different parts of the symptom topology, and produce complementary analyses.
When we produce the loose UMAP embeddings that focus more upon the global structure of the data, we take n_neighbours = 4, and for the tight UMAP embeddings that focus more upon the local structure of the data, we take n_neighbours = 2. When using AlignedUMAP for the fine age strata, we align each slice of data with the two prior slices, and the two slice post the current slice. Aligning with fewer slices produced plots with less smoothing, and aligning with more slices obscured signals in the data (e.g: under 10’s might be aligned with those much older than they are, who might have significantly different symptom occurrence patterns). For the broad age strata, all three strata were aligned. When performing alignment, there is a regularisation parameter that controls the ’strength’ of the alignment. To select the value of the regularisation parameter, we initially started off with values that produced little to no alignment, and increased the strength of the alignment until visual inspection suggests that datasets are aligned, without being over-aligned. We also explored varying the min_dist parameter which controls the minimum distance between points in the produced embeddings - an assumption of the algorithm is that two points cannot have 0 distance between them. Overall, we find that this parameter does not change the structure of embeddings, and largely produces visual changes useful when producing plots without significant overlap of points.
We first run hierarchical clustering, LCPA and AlignedUMAP for all the included cases from each dataset, and then repeat them stratified by age group (0-17 years, 18-54 years and 55 years and older). For only three age strata, it is not necessary to plot the results in 3D space as was required for the results from the finer age strata shown in Fig. 4. This produces a complementary analysis to the results from the main paper.
a.3.1 Pre-hoc considerations for comparison across datasets
Because of the different sampling of positive cases and resulting sample composition, data collection methods, and symptom questions across the datasets, we expect potential differences in findings arising from:
Sampling: The majority of routinely detected community cases in the UK were detected via symptom-prompted tests, particularly prior to the widescale availability of rapid antigen testing for asymptomatic individuals in the Spring of 20201. Thus we expect Pillar 2 to over-represent individuals with at least one of cough, fever and loss of taste and/or smell. This bias is also likely to exist within the CSS as a majority of self-reported tests would also have been performed because they met the symptom criteria for routine community testing, though the study also invited a proportion of regular app-user participants to test based on reporting other symptoms. These biases are not present within the ONS study sample.
Data collection method Across all datasets, symptoms are assessed via self-report, including fever. The experience of symptoms and their description is likely to vary across individuals and across demographic characteristics, such as by gender, ethnicity, region, and age. People are likely to report symptoms differently whether they are doing so via an in-person interview, a weekly or bi-weekly survey or via a daily symptom tracking app, and the design of the app or questionnaire interface, as well as the preceding questions will likely affect reporting. The majority of studies examining the efficacy of symptom self-report have focused on psychiatric disorders. These have generally found agreement between patient self-report and clinician assessment, although this varies from 60% to 90% (Lyu et al., ; Silverstein et al., ; Chan et al., ). In major depressive disorder, self-reported symptoms are more severe than clinician assessed symptoms (Lyu et al., ). When self-tracking for health and fitness purposes, BMI is systematically under-reported (Wilson et al., ). Knowledge of test status could also affect symptom reporting, though this will be less of an issue in the CIS dataset, where individuals will not yet have received their test result. Some studies involve reporting on behalf of others, particularly children or adults receiving care, and communicating the subjective experience of symptoms might be challenging in these cases. When reporting symptoms related to cancer treatment, a dyad (parent and child) approach to reporting symptoms was found to be more effective and preferable to child self-reporting or parent proxy reporting alone (Tomlinson et al., ).
Phase of infection The symptom reporting window around positive test time varies across the different datasets. There is evidence from previous studies (Sudre et al., 2021) that some symptoms tend to appear earlier in infection while some appear later. We also know that people who test negative, who are not included in this dataset, report a wide range of symptoms that are not related to SARS-CoV-2 infection(Elliott et al., 2021); widening the symptom reporting window around a test date might include symptoms that are non-specific to the SARS-CoV-2 infection. Our approach collapses across time and these variations in the reporting window could affect our findings regarding symptom frequency and clustering. While there is not a way of varying this for the routinely collected NHS Test and Trace data, we do conduct sensitivity analyses to examine a wider symptom reporting window around the day of testing for the ONS dataset, making it more comparable to CSS. We arbitrarily define positive episodes as a new positive occurring more than 90 days after an index positive or after 4 consecutive negative tests, and consider symptoms reported in [-7,+35] days around the index positive. We do not find that this wider symptom window affects our clustering and co-occurrence findings.
Epidemic phase The characteristics of cases differ over the course of the epidemic, for example by age, region, socioeconomic characteristics or variant of SARS-CoV-2 infection, which in turn could plausibly affect the symptoms experienced and the likelihood that they are reported. Some positive cases could be from single or double-vaccinated individuals, particularly from later time periods in Winter/Spring 2021. Similiarly to our AlignedUMAP embeddings for age-stratified data, we could also produce AlignedUMAP embeddings for time-stratified data, allowing us to investigate how symptom co-occurrence patterns change over time. This would be of particular interest as vaccination effects build, or as a new variant with a different disease profile becomes dominant. The requirement of such an analysis is that each time-strata has a sufficient number of points such that the estimated Jaccard distance matrix is not subject to significant uncertainty. An initial exploration of this analysis was performed for Pillar 2 and SGSS datasets, by stratifying into week-long strata however no significant changes to the symptom co-occurrence patterns were observed during this time period.
Appendix B Supplementary Text
b.1 Symptom frequencies
All datasets include only cases reporting at least one symptom for these analyses. The most commonly reported symptom across all datasets was headache, with approximately half of cases in the Pillar 2, SGSS and CIS datasets reporting them, and almost two thirds of those from CSS, (Fig. 3). The frequency of systematic symptom reports is high across the datasets. Fever, a systemic symptom intended to prompt isolation and testing in the UK, was experienced by less than one third of all symptomatic cases. Cough, another isolation and testing initiating symptom (when new and continuous, which was not captured in these datasets) was also common (39% to 59%). NHS Test and Trace did not include any other lower respiratory tract symptoms, but in CIS, shortness of breath was experienced by 24% and by 5% in CSS, while 26% of those symptomatic cases participating in CSS reported chest pain (not collected in other datasets). Each dataset includes information about altered/loss of smell and/or taste but collected this differently, though all variations were commonly reported. Altered/loss of smell was most frequently reported in the CSS (52%), while loss of taste and smell separately (CIS) and in combination (NHS Test & Trace) was reported by over 30%. These symptoms also trigger isolation and testing. Sore throat was a common upper respiratory symptom in all datasets (30% to 42%). Sneezing and rhinitis, only collected by Test and Trace, were reported by around one quarter of symptomatic cases. Gastrointestinal symptoms tended to be less frequent than systemic and respiratory, but were not unusual (mainly reported by 10-20% though less frequently for vomiting alone), with the exception of loss of appetite, which reported by between one quarter and one third of cases in Test and Trace and CSS, datasets in which it was collected. Symptoms that we described as ’altered state’ were rarer and not collected in CIS. Rash and nosebleeds were reported by 2% of symptomatic cases in Test and Trace, but not collected in CIS or CSS.
b.2 UMAP results without alignment between datasets
Looking at Fig. S12, we see a global structure similar to what we observe in the main paper using the AlignedUMAP algorithm. The embeddings of most datasets can be described by a central cluster of systemic and lower respiratory tract symptoms. Upper respiratory tract symptoms, such as rhinitis, sneezing, hoarse voice, sore throat are typically placed close to the systemic symptoms cluster, with the exception of loss of smell and taste symptoms. Gastrointestinal symptoms are often placed further away from the upper respiratory tract symptoms, and often form a tail leading to some of the rarer symptoms. We note that these embeddings synthesise the results we observed from the LPCA loadings, where the second loading suggested that cases could be separated based upon whether they experience predominantly experienced upper respiratory tract symptoms, or systemic and gastrointestinal symptoms. The relatively low rates of occurrence of gastrointestinal symptoms explains their appearance high in the hierarchical tree, while the higher frequency of systemic and respiratory symptoms explains their relative importance in LPCA loadings, within the general structure revealed by UMAP.
We repeat the UMAP analysis without alignment, this time with the algorithm tuned to focus more on the local structure of the data and less on the global structure of the data. This produces better separation of the symptoms into clusters in the low dimensional embeddings, however some of the relationship between these clusters may be lost. In the resulting embeddings, we observe several pairs of symptoms that commonly co-occur but appear to be distinct from the main of other symptoms, notably sneezing and rhinitis in the Pillar 2 and SGSS datasets, headache and sore throat in the CSS dataset, and loss of smell and taste in the CIS dataset. The remaining symptoms are often packed into 2 tight clusters. For Pillar 2 and SGSS, a clear separation between systemic and upper respiratory tract symptoms, and the less frequently occurring gastrointestinal, altered state and other symptoms is observed. Similarly gastrointestinal are placed into their own cluster in CIS, and in CSS with the exception of loss of appetite. Focusing more on the local structure can make the resulting embeddings more variable between datasets, as the choice of symptoms included in the dataset appears to make more of a difference. We note that the embeddings focusing more on local structure can be more variable between repeats, however they do highlight small local structures in the data. The aligned UMAP results in the main paper focus more on local structure, however the requirement to align several related slices of the datasets appears to make these results more consistent between runs.
Looking at Fig. S13 and Fig. S12, we see a global structure to the relationship between symptoms that synthesises other results. This is clearest in the CIS data, where we can draw a line from gastrointestinal through systemic, to respiratory tract symptoms, but with sore throat closer to cough than it is to loss of taste and smell. Such a line could be interpreted as describing a spectrum of COVID-19 symptoms. In the other datasets, this pattern is complicated by other types of symptom, which typically occur closest to gastrointestinal. The relatively low frequency of these symptoms explains their appearance high in the hierarchical tree, while the higher frequency of systemic and respiratory infections explains their relative importance in LPCA components within the general structure revealed by UMAP.
b.3 Age stratified findings
We repeated our main analyses - hierarchical clustering, Logistic PCA and AlignedUMAP - on each dataset, stratified by broad age groups: children (0-17 years), adult (18-54 years) and elder adults (55+ years), Supplementary Figures S4-S21.
Broadly, we did not find strong differences in the clustering and co-occurrence patterns of symptoms across age groups and studies. The unstratified findings reflect more strongly the middle age category (18-54 years), who account for the majority of the sample in each dataset. It is possible that symptom data collection particularly among young children, which relies upon caregiver reports, could contribute to explaining some differences observed.
The clear separation of gastrointestinal symptoms and loss of taste and smell is observed across the age strata in the CIS, Supplementary Figure S7, with minor differences in the order at which some other individual symptoms join the tree (e.g. shortness of breath among children and sore throat amongst elder adults). In Pillar 2 and SGSS datasets, Supplementary Figures S4 and S5 respectively, across age groups the rarer symptoms separate earlier from other symptoms, with some later separation between systemic and upper respiratory symptoms observable. Patterns did not differ greatly across the age strata. Across age strata, symptoms among cases in the CSS, Supplementary Figure S6, show shortness of breath and delirium (rare symptoms) separating early, followed by some gastrointestinal symptoms (diarrhoea and abdominal pain) and, most clearly among adults 18-54, splitting between systemic and gastrointestinal symptoms and primarily lower and upper respiratory symptoms.
For all age-stratified LPCA analyses plotted in Supplementary Figures S8-S10, the first principle component essentially describes variation in severity, followed by characterisation according to either upper respiratory (loss of taste and smell) or upper respiratory symptoms. For CIS, plotted in Supplementary Figures S11, cough had a high loading on the second component among children but not adults or elder adults, pointing the opposite direction to upper respiratory symptoms. The presence of gastrointestinal symptoms was more important in describing cases among elder adults, compared to children, with adults aged 18-54 years in between.
Similar patterns of separation between upper respiratory, systemic and gastrointestinal symptoms are seen across age groups when examining the UMAP embeddings when hyperparameters were selected that produce well separated clusters, Supplementary Figures S18-S21. Despite the age-strata being coarser here than in the results of the main paper, Fig 4, we do observe similar structural changes to the data: in the children’s age strata, we often observe the formation of several small clusters of symptoms; in the adult’s age strata, the embeddings tend to resemble a larger cluster; and in the elder’s age strata, the embeddings again start to fragment into two smaller clusters of symptoms. The structural changes are less striking than in the results in Fig. 4, where finer age slices are used. However, this is expected, given that the coarser age strata used in Supplementary Figures S18-S21 make it harder for UMAP to detect structural changes to patterns of symptom co-occurrence that occur over small changes in age.
The results from tuning the UMAP algorithm to focus more on global structure are plotted in Supplementary Figures S14-S17. Unlike in embeddings that focus more on the local structure of the dataset, we do not observe strong separation of symptoms into several small clusters in the youngest, or separation into two main clusters in the elderly population. This is to be expected, as focusing more on the global structure results in an embedding that attempts to describe more of the spectrum of the disease, and less on small groups of commonly co-occurring symptoms, providing a complementary analysis. Our interpretation is that, in the youngest and oldest age groups, patterns of co-occurrence of reported symptoms do change, particularly for pairs of symptoms, however we do not observe significant changes to the overall spectrum of the disease which can still be broadly described by number of symptoms experienced, and then the relative contribution of upper respiratory tract symptoms, or gastrointestinal symptoms. Across Pillar 2, SGSS and CIS we consistently observe a central cluster of systemic and lower respiratory tract symptoms. Upper respiratory tract symptoms are clustered close to the systemic symptoms, but further away from the gastrointestinal symptoms. The CSS dataset is the most different, where shortness of breath, fatigue and delirium are clustered close to gastrointestinal symptoms, but further away from the main cluster of systemic, upper respiratory tract and lower respiratory tract symptoms.