Since the outbreak of novel SARS-CoV-2 in late 2019, the spread of COVID-19 has changed nearly every aspect of our daily life, challenging modern society to find a way to function under conditions never seen before. Governmental plans on public health have played a crucial role in its control in the absence of an effective treatment to cure COVID-19 or a vaccine to prevent it. Typically reported variables in the COVID-19 pandemic, available in public repositories (as [1, 2], among others), are the total cases , active cases , discharged/recovered , and deaths . These variables serve as input for the development and evaluation of governmental plans and to fit the vast variety of SIR-like mathematical models recently proposed (see,e.g., [3, 4, 5, 6] and references therein for a brief review of them). Among the several factors conditioning the quality/reliability of the variables mentioned above are those related to the sensitivity and specificity of diagnostic tests, delays between sampling and diagnosing, and the delay between presenting symptoms and getting tested. The latter varies from country to country, depending heavily on the local government’s testing strategy and resources.
Different parameters can be used to evaluate the evolution of the SARS-CoV-2 outbreak. Among them we may find the documentation rate , secondary infection rate, serological response to infection, number of vacancies at ICU , and the Basic Reproduction Number [10, 11], which is one of the most widely used. This parameter () represents the number of persons a single infected individual might infect before either recovering or dying 
. Traditional forms to estimate theare rather complex, heavily depending on the fitting of SIR models to local data [13, 14, 15, 16]. In a previous work , we proposed a methodology to obtain real-time estimations of directly from raw data, which was satisfactorily applied to evaluate the panorama of the COVID-19 spread in different countries and to forecast its evolution . Nevertheless, its heavy dependence on the reported data required the study of common error sources affecting it, and the development of methodologies to control, correct, and quantify their impact .
In this work, we analyze the sources of error in the typically reported epidemiologic variables and their impact on our understanding of COVID-19 spreading dynamics. We address the existence of different delays in the report of new cases, induced by the incubation time of the virus and testing-diagnosis time gaps, and provide a straightforward methodology to avoid the propagation of delay-induced errors to model-derived parameters. Using our statistically-based algorithm, we perform a temporal reclassification of individuals to the day where they were -statistically- most likely to have acquired the virus, building a new smooth curve with corrected variables. We present an analogous methodology to estimate the number of discharged/recovered individuals, based on the reported evolution of the viral infection, the performance of the different tests for its diagnosis, and the case fatality, which can be easily adapted for a particular country. We used our methodology to assess the evolution of the pandemic in Chile, identifying different moments in which data was misleading governmental actions.
2 On the performance of tests for diagnosing COVID-19
Different methods for diagnosing COVID-19 have been developed and reported in the literature, with real-time RT-PCR being the standard applied in the globe . Nevertheless, techniques as the IgG and IgM rapid tests, the Chest Computed Tomography (Chest CT), and CRISPR-Cas systems are also being used. In this section we provide a brief analysis of them, highlighting the different characteristics of both the techniques and their basic approach to the viral infection.
2.1 Real-time RT-PCR
Real-time reverse transcription polymerase chain reaction (RT-PCR) is a mechanism for amplification and detection of RNA in real time (see 
for an exhaustive description of the technique). Initially, RNA obtained from samples is retrotranscribed to DNA using a reverse transcriptase enzyme. By applying temperature cycles, the conditions are created for new copies of the DNA to be synthesized from the initial one. The lower the initial DNA concentration, the lower the probability of a synthesis reaction in a given cycle. It is presumed that the minimum time from contagion until testing positive in the RT-PCR test is( CI, –) before the onset of symptoms , which typically appear 5.2 days ( CI, –) after contagion . Unsurprisingly, [21, 23] showed that to of the total infections occur in the pre-symptomatic period. It has been inferred that the viral load reaches a peak value before 0.7 days from the onset of symptoms ( CI, –), from which falls monotonically together with the infectivity . Finally, the virus has been detected for a median of 20 days after the onset of symptoms , but infectivity may decrease significantly eight days after that moment.
A high false-negative rate [25, 26] and a sensitivity of 71 to [27, 28] have been reported for the real-time RT-PCR technique, and several vulnerabilities of it have been identified and quantified . Considering sample obtention, handling, testing, and reporting, the total time necessary to obtain the RT-PCR results may range between 2 to 3 days . However, the time it takes to perform the RT-PCR experiment takes about 2 to 3 hours .
2.2 IgG and IgM rapid tests
Part of the immune response to the SARS-CoV-2 infection is the production of specific antibodies against it, including IgG and IgM . Serological tests detect the presence of those antibodies and, unlike the other detection methods, take only to produce results . These tests have a sensitivity of and a specificity of . This technology was developed for the SARS-CoV epidemic, which was caused by a virus belonging to the same family of coronaviruses as SARS-CoV-2, providing satisfactory results after 2–3 days from the onset of symptoms (for IgG), and after eight days (for IgM) .
2.3 Chest Computed Tomography
The principle behind the Chest Computed Tomography (Chest CT) is the analysis of cross-sectional lung images to identify viral pneumonia characteristics, like ground-glass opacity, consolidation, reticulation/thickened interlobular septa or nodules . Chest CT has shown a sensitivity between and [28, 27]. However, due to the similarities between CT images accounting for COVID-19 and CT images from other viral types of pneumonia, false-positive are likely to occur. Compared to RT-PCR, Chest CT tends to be more reliable, practical, and quick to diagnose COVID-19 . Nevertheless, requiring the presence of the potentially infected patient in a health center lacks the flexibility that rapid tests provide, and can backfire on movement restriction measures. COVID-19 pneumonia manifests with abnormalities on computed tomography images of the chest, even in asymptomatic patients .
2.4 CRISPR-Cas systems
In CRISPR-Cas systems, a guide RNA (gRNA) is designed to recognize a specific RNA sequence, like any particular gene of SARS-CoV-2 coronavirus. Endonuclease enzymes of the Cas family and the specific gRNA will search for the sequence match. This match will deliver a signal that confirms the presence of SARS-CoV-2 RNA in the sample . DETECTR and SHERLOCK are two examples of CRISPR-Cas technologies for the detection of SARS-CoV-2, being able to obtain results in less than 1 hour at a significantly lower cost compared to the RT-PCR technique. DETECTR showed a 95% positive predictive agreement and 100% negative predictive agreement , while SHERLOCK has not been validated using real patient samples and is not suitable for clinical use at this time .
Our work aims to expose and quantify, both theoretically and in a case study, the impact of different sources of error in commonly reported data of the COVID-19 spread, as the newly reported cases , the total cases , the infected , and recovered fractions of the population.
First, we define random variables associated with the delay in both sampling and diagnosing new cases, and by modeling their probability distribution functions, we derive a method to re-classify accordingly the newly reported cases. As the reclassification occurs backward, for re-evaluating the current scenario through our methodology, we cast predictions on the reported new cases using an ARIMA auto-regression model. Having the corrected variables, we evaluate differences on the values of, following the methodology presented by :
Auto-regression models for the forecast of were implemented using the statsmodels Python library . All other calculations and visualizations were made in MATLAB R2018a.
4.1 Temporal misclassification of new cases
. We can assume the incubation period follows an exponential distribution with:
This incubation time is especially relevant in the case of a symptoms-based testing strategy or when the spread has reached the non-traceability stage. Moreover, even though the required time for performing the test is short , delays between testing and diagnosis have been reported . We will sum up secondary delays, such as the symptom-testing and testing-diagnosing time gaps, into a random variable
, which, for the sake of simplicity, will be assumed to follow a uniform distribution betweenand :
Consequently, we may postulate a reclassification for obtaining the real new contagions occurred in a day as the contribution of the cases reported with a delay of days:
where represents the fraction of patients that were notified at time but had acquired the virus at . Note that the different delays are referred to the random variable . The probability distribution function for is obtained by the convolution method, assuming and are independent and combining equations 2 and 3:
Assuming that data is reported on a daily basis, we can calculate the probability associated to having a delay of days:
For practical reasons, we can define a threshold for truncating the probability mass distribution (equation 6), which otherwise would assign a probability to every . Let be the first positive integer for which equation 7 holds,
we may rewrite equation 4 as:
The magnitude of the total delay between infection and diagnosis can be estimated through the expected value of equation 5 (or equivalently, equation 6). A schematic representation of the proposed methodology is presented in Figure 1. Assuming the lowest reported value for the average incubation time, , and a conservative timeframe for the delay between the appearance of symptoms, testing and diagnosing, (2–5 days), the expected delay is about .
4.2 Case discharge/recovery criteria
As discussed previously, errors in the amount of discharged/recovered patients are likely to be greater only when no quantitative criteria are applied. In such cases, some countries (like Chile) have adopted the following criteria (officially reported in ), possibly based on the recommendations published by the .
If there were no previously existing pathology, a patient would be discharged 14 days after testing positive for COVID-19.
If there were previously existing pathologies, the patient would be discharged 28 days after testing positive for COVID-19.
This criterion turns out to be quite simplistic, especially considering the existence of uncertainties regarding the diagnosis and contagion days. If we try to model the probability of recovering from COVID-19, some assumptions are necessary. Let be the random variable for the time of discharge/recovery:
Further assumptions are necessary to estimate the probability distribution function , as it depends on local diagnosis criteria, testing strategy, and the fraction of the population having preexisting pathologies. In particular, depending on the test applied for diagnosis and its sensitivity/specificity – which were carefully described in Section 2– the probability profile would change. The simplest form that can be assumed, and which, for clarity reasons, is adopted herein, is a triangular distribution:
5 Case study: COVID-19 spreading dynamics in Chile
The spread of COVID-19 in Chile is far from being controlled, as shown by the exponential growth that new cases have had in recent weeks . In order to apply our methodology, we need to cast predictions on the trends of . Figure 2 presents the current and forecast trends, using an auto-regression ARIMA model.
First, we perform the temporal reclassification of new cases to obtain , presented in Figure 3. It can be seen that our methodology, besides exhibiting an horizontal semi-displacement, generates an smooth curve. The last part of the red curve is dashed because it partially contains contributions of the forecast of , and therefore might change in the upcoming days, when the required data for completing the reclassification of cases would be available.
Using such values, we proceed to calculate , using equation 1 with raw data, mobile averages of raw data, and the methodology proposed herein. As shown in Figure 4, an abrupt growth in was evidenced around April 22nd, consistently with the relaxation of restrictive measures that were applied in Santiago and the apogee of the governmental plan for a “safe return to work”. Even though different trends seem to decrease again in the second week of May, it is not totally clear, as the forecast of strongly influences the statistical correction for that week, and it is well-known that forecast models fail to predict exponential growth [44, 45].
The different iconic dates highlighted in Figure 4 were obtained from the chronology presented in  and references therein. We can assess both the success and the misleading effect that the different governmental actions have had on the spreading dynamics of COVID-19 in Chile by analyzing the corrected trends of . We can observe that the different actions had a strong relationship with the locally observed values from raw-data, yet appear to be too late according to the statistically-corrected trend. In particular, the apogee of the governmental plan for a safe return to work happened in the zone where the raw-data driven values were at a minimum, but the corrected trends showed a steep growing trend.
We have presented an exhaustive assessment of error sources in reported data of the COVID-19 pandemic and provided a methodology to minimize –and correct– their effect on both reported variables and model-derived parameters by applying a statistically-driven reclassification of newly reported cases, and corrections to discharge/recovery criteria. By using the corrected variables, SIR-like models could be fitted directly, as every value would represent the real dynamics. We present the methodology in a general framework, aiming to provide a useful tool for researchers and decision-making actors looking to adapt it for their particular interests.
In a case study on the spreading dynamics of COVID-19 in Chile, we observed the effects that different iconic actions taken by the government had on and discussed the reasons behind them under the eye of raw data and our methodology. The delay-induced error in raw data slowed-down the reaction time, so the actions taken were too late. Our statistically-driven method corrected such error, exposing the real dynamics at a given time. The proposed methodology also serves as a non-invasive smoothing process, as it does nothing but the temporal re-sorting the cases according to their most likely delay.
We expect our methodology to serve as a valuable input for researchers trying to add statistical value to their calculations and to raise public awareness on the need for a proper (and standardized) strategy for the report and curation of data in the COVID-19 pandemic.
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Conceptualization, SC, HAV; methodology, SC; validation JPB-L, DM-O, AO-N; investigation, HAV, JPB-L, NL-K; writing, original draft preparation, JPB-L, HAV, NL-K, DM-O, SC; writing, review and editing, SC, JPB-L, DM-O, AO-N; supervision, SC, HAV; project administration, AO-N; funding resources, AO-N.
The authors gratefully acknowledge support from the Chilean National Agency for Research and development through ANID PIA Grant AFB180004, and the Centre for Biotechnology and Bioengineering - CeBiB (PIA project FB0001, Conicyt, Chile). DM-O gratefully acknowledges Conicyt, Chile, for PhD fellowship 21181435.
-  Worldometers.info. Official numbers for the coronavirus outbreak in chile. https://www.worldometers.info/coronavirus/, 20 May, 2020. Accessed: 2020-05-20.
-  Ensheng Dong, Hongru Du, and Lauren Gardner. An interactive web-based dashboard to track covid-19 in real time. The Lancet infectious diseases, 2020.
-  Sebastian Contreras, H. Andres Villavicencio, David Medina-Ortiz, Juan Pablo Biron-Lattes, and Alvaro Olivera-Nappa. A multi-group seira model for the spread of covid-19 among heterogeneous populations, 2020.
-  Zifeng Yang, Zhiqi Zeng, Ke Wang, Sook-San Wong, Wenhua Liang, Mark Zanin, Peng Liu, Xudong Cao, Zhongqiang Gao, Zhitong Mai, et al. Modified seir and ai prediction of the epidemics trend of covid-19 in china under public health interventions. Journal of Thoracic Disease, 12(3):165, 2020.
-  Weijie Pang. Public health policy: Covid-19 epidemic and seir model with asymptomatic viral carriers. arXiv preprint arXiv:2004.06311, 2020.
-  Jacob B Aguilar, Jeremy Samuel Faust, Lauren M Westafer, and Juan B Gutierrez. Investigating the impact of asymptomatic carriers on covid-19 transmission. medRxiv, 2020.
-  Bryan Wilder, Marie Charpignon, Jackson A Killian, Han-Ching Ou, Aditya Mate, Shahin Jabbari, Andrew Perrault, Angel Desai, Milind Tambe, and Maimuna S Majumder. The role of age distribution and family structure on covid-19 dynamics: A preliminary modeling assessment for hubei and lombardy. Available at SSRN 3564800, 2020.
-  World Health Organization et al. Protocol for assessment of potential risk factors for coronavirus disease 2019 (covid-19) among health workers in a health care setting, 23 march 2020. Technical report, World Health Organization, 2020.
-  Mei Fong Liew, Wen Ting Siow, Graeme MacLaren, and Kay Choong See. Preparing for covid-19: early experience from an intensive care unit in singapore. Critical Care, 24(1):1–3, 2020.
-  Mattia Allieta, Andrea Allieta, and Davide Rossi Sebastiano. Covid-19 outbreak in italy: estimation of reproduction numbers over two months toward the phase 2. medRxiv, 2020.
-  Jana Gevertz, James Greene, Cynthia Hixahuary Sanchez Tapia, and Eduardo D Sontag. A novel covid-19 epidemiological model with explicit susceptible and asymptomatic isolation compartments reveals unexpected consequences of timing social distancing. medRxiv, 2020.
-  Antoine Perasso. An introduction to the basic reproduction number in mathematical epidemiology. ESAIM: Proceedings and Surveys, 62:123–138, 2018.
-  Johan Andre Peter Heesterbeek. A brief history of r 0 and a recipe for its calculation. Acta biotheoretica, 50(3):189–204, 2002.
-  Paul L Delamater, Erica J Street, Timothy F Leslie, Y Tony Yang, and Kathryn H Jacobsen. Complexity of the basic reproduction number (r0). Emerging infectious diseases, 25(1):1, 2019.
-  Y Wang, XY You, YJ Wang, LP Peng, ZC Du, S Gilmour, D Yoneoka, J Gu, C Hao, YT Hao, and JH Li. [estimating the basic reproduction number of covid-19 in wuhan, china]. Zhonghua liu xing bing xue za zhi = Zhonghua liuxingbingxue zazhi, 41(4):476—479, April 2020.
-  Junling Ma. Estimating epidemic exponential growth rate and basic reproduction number. Infectious Disease Modelling, 5:129 – 141, 2020.
-  Sebastian Contreras, H. Andres Villavicencio, David Medina-Ortiz, Claudia P Saavedra, and Alvaro Olivera-Nappa. Real-time estimation of for supporting public-health policies against covid-19. medRxiv, 2020.
-  David Medina-Ortiz, Sebastian Contreras, Y Barrera-Saavedra, Gabriel Cabas-Mora, and Alvaro Olivera-Nappa. Country-wise forecast model for the basic reproduction number in the covid-19 outbreak. under review in. Frontiers in Physics, 2020.
-  Giuseppe Lippi, Ana-Maria Simundic, and Mario Plebani. Potential preanalytical and analytical vulnerabilities in the laboratory diagnosis of coronavirus disease 2019 (covid-19). Clinical Chemistry and Laboratory Medicine (CCLM), 1(ahead-of-print), 2020.
-  UE Gibson, Christian A Heid, and P Mickey Williams. A novel method for real time quantitative rt-pcr. Genome research, 6(10):995–1001, 1996.
-  Xi He, Eric HY Lau, Peng Wu, Xilong Deng, Jian Wang, Xinxin Hao, Yiu Chung Lau, Jessica Y Wong, Yujuan Guan, Xinghua Tan, et al. Temporal dynamics in viral shedding and transmissibility of covid-19. Nature Medicine, pages 1–4, 2020.
-  Stephen A Lauer, Kyra H Grantz, Qifang Bi, Forrest K Jones, Qulu Zheng, Hannah R Meredith, Andrew S Azman, Nicholas G Reich, and Justin Lessler. The incubation period of coronavirus disease 2019 (covid-19) from publicly reported confirmed cases: estimation and application. Annals of internal medicine, 2020.
-  Tapiwa Ganyani, Cecile Kremer, Dongxuan Chen, Andrea Torneri, Christel Faes, Jacco Wallinga, and Niel Hens. Estimating the generation interval for covid-19 based on symptom onset data. medRxiv, 2020.
-  Fei Zhou, Ting Yu, Ronghui Du, Guohui Fan, Ying Liu, Zhibo Liu, Jie Xiang, Yeming Wang, Bin Song, Xiaoying Gu, et al. Clinical course and risk factors for mortality of adult inpatients with covid-19 in wuhan, china: a retrospective cohort study. The Lancet, 2020.
-  Yafang Li, Lin Yao, Jiawei Li, Lei Chen, Yiyan Song, Zhifang Cai, and Chunhua Yang. Stability issues of rt-pcr testing of sars-cov-2 for hospitalized patients clinically diagnosed with covid-19. Journal of Medical Virology, 2020.
-  Ai Tang Xiao, Yi Xin Tong, and Sheng Zhang. False-negative of rt-pcr and prolonged nucleic acid conversion in covid-19: Rather than recurrence. Journal of Medical Virology, 2020.
-  Yicheng Fang, Huangqi Zhang, Jicheng Xie, Minjie Lin, Lingjun Ying, Peipei Pang, and Wenbin Ji. Sensitivity of chest ct for covid-19: comparison to rt-pcr. Radiology, page 200432, 2020.
-  Chunqin Long, Huaxiang Xu, Qinglin Shen, Xianghai Zhang, Bing Fan, Chuanhong Wang, Bingliang Zeng, Zicong Li, Xiaofen Li, and Honglu Li. Diagnosis of the coronavirus disease (covid-19): rrt-pcr or ct? European journal of radiology, page 108961, 2020.
-  Trieu Nguyen, Dang Duong Bang, and Anders Wolff. 2019 novel coronavirus disease (covid-19): paving the road for rapid detection and point-of-care diagnostics. Micromachines, 11(3):306, 2020.
-  Zhengtu Li, Yongxiang Yi, Xiaomei Luo, Nian Xiong, Yang Liu, Shaoqiang Li, Ruilin Sun, Yanqun Wang, Bicheng Hu, Wei Chen, et al. Development and clinical application of a rapid igm-igg combined antibody test for sars-cov-2 infection diagnosis. Journal of medical virology, 2020.
-  Xiaowei Li, Manman Geng, Yizhao Peng, Liesu Meng, and Shemin Lu. Molecular immune pathogenesis and diagnosis of covid-19. Journal of Pharmaceutical Analysis, 2020.
-  Patrick CY Woo, Susanna KP Lau, Beatrice HL Wong, Kwok-hung Chan, Chung-ming Chu, Hoi-wah Tsoi, Yi Huang, JS Malik Peiris, and Kwok-yung Yuen. Longitudinal profile of immunoglobulin g (igg), igm, and iga antibodies against the severe acute respiratory syndrome (sars) coronavirus nucleocapsid protein in patients with pneumonia due to the sars coronavirus. Clin. Diagn. Lab. Immunol., 11(4):665–668, 2004.
-  Tao Ai, Zhenlu Yang, Hongyan Hou, Chenao Zhan, Chong Chen, Wenzhi Lv, Qian Tao, Ziyong Sun, and Liming Xia. Correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: a report of 1014 cases. Radiology, page 200642, 2020.
-  Heshui Shi, Xiaoyu Han, Nanchuan Jiang, Yukun Cao, Osamah Alwalid, Jin Gu, Yanqing Fan, and Chuansheng Zheng. Radiological findings from 81 patients with covid-19 pneumonia in wuhan, china: a descriptive study. The Lancet Infectious Diseases, 2020.
-  Julianna LeMieux. Covid-19 drives crispr diagnostics: Crispr’s role in dna detection, not editing, may fill the gap in covid-19 testing. Genetic Engineering & Biotechnology News, 40(5):21–22, 2020.
-  James P Broughton, Xianding Deng, Guixia Yu, Clare L Fasching, Venice Servellita, Jasmeet Singh, Xin Miao, Jessica A Streithorst, Andrea Granados, Alicia Sotomayor-Gonzalez, et al. Crispr–cas12-based detection of sars-cov-2. Nature Biotechnology, pages 1–5, 2020.
-  Yuefei Jin, Haiyan Yang, Wangquan Ji, Weidong Wu, Shuaiyin Chen, Weiguo Zhang, and Guangcai Duan. Virology, epidemiology, pathogenesis, and control of covid-19. Viruses, 12(4):372, 2020.
-  Skipper Seabold and Josef Perktold. Statsmodels: Econometric and statistical modeling with python. In Proceedings of the 9th Python in Science Conference, volume 57, page 61. Scipy, 2010.
-  Qun Li, Xuhua Guan, Peng Wu, Xiaoye Wang, Lei Zhou, Yeqing Tong, Ruiqi Ren, Kathy SM Leung, Eric HY Lau, Jessica Y Wong, et al. Early transmission dynamics in wuhan, china, of novel coronavirus–infected pneumonia. New England Journal of Medicine, 2020.
-  Stephen A Lauer and Kyra H Grantz. Qifang bi, forrest k jones, qulu zheng, hannah r meredith, andrew s azman, nicholas g reich, and justin lessler. the incubation period of coronavirus disease 2019 (covid-19) from publicly reported confirmed cases: estimation and application. Annals of internal medicine, 3, 2020.
-  JA Backer, D Klinkenberg, and J Wallinga. The incubation period of 2019-ncov infections among travellers from wuhan. China. medRxiv, 2020.
-  MINSAL. Tech report: Criteria for discharging a covid-19 infected individual (criterios que se consideran para un paciente covid-19 sin riesgo de contagio). https://www.minsal.cl/wp-content/uploads/2020/04/2020.04.13_ALTA-DE-CUARENTENA.pdf, 14 Apr, 2020. Accessed: 2020-05-23.
-  WHO. Considerations for quarantine of individuals in the context of containment for coronavirus disease (covid-19): interim guidance, 19 march 2020. Technical report, World Health Organization, 2020.
-  C. Cervellera, D. Macciò, and T. Parisini. Learning robustly stabilizing explicit model predictive controllers: A non-regular sampling approach. IEEE Control Systems Letters, pages 1–1, 2020.
-  Neil M. Lancastle. Is the impact of social distancing on coronavirus growth rates effective across different settings? a non-parametric and local regression approach to test and compare the growth rate. medRxiv, 2020.
-  Wikipedia.org. Cronología de la pandemia de enfermedad por coronavirus de 2020 en chile. https://es.wikipedia.org/wiki/Anexo:Cronolog%C3%ADa_de_la_pandemia_de_enfermedad_por_coronavirus_de_2020_en_Chile, 23 May, 2020. Accessed: 2020-05-23.