Nonlinear mixed-effects model applied to COVID-19 deaths data
The analysis of complex longitudinal data such as COVID-19 deaths is challenging due to several inherent features: (i) Similarly-shaped profiles with different decay patterns; (ii) Unexplained variation among repeated measurements within each country, these repeated measurements may be viewed as clustered data since they are taken on the same country at roughly the same time; (iii) Skewness, outliers or skew-heavy-tailed noises are possibly embodied within response variables. This article formulates a robust nonlinear mixed-effects model based in the class of scale mixtures of skew-normal distributions for modeling COVID-19 deaths, which allows the analysts to model such data in the presence of the above described features simultaneously. An efficient EM-type algorithm is proposed to carry out maximum likelihood estimation of model parameters. The bootstrap method is used to determine inherent characteristics of the nonlinear individual profiles such as confidence interval of the predicted deaths and fitted curves. The target is to model COVID-19 deaths curves from some Latin American countries since this region is the new epicenter of the disease. Moreover, since a mixed-effect framework borrows information from the population-average effects, in our analysis we include some countries from Europe and North America that are in a more advanced stage of their COVID-19 deaths curve.READ FULL TEXT VIEW PDF
Nonlinear mixed-effects model applied to COVID-19 deaths data
The world is facing a global pandemic of coronavirus disease (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Initiated at the end of 2019 in Wuhan, China, we are coming, on the 26th of June of 2020, to more than 9 millions cases and 500 thousands deaths spread in 216 countries. The World Health Organization (WHO) has made a joint effort with nations to tackle the disease. Some countries as Germany, Italy, Spain and United Kingdom faced the peak of the disease in March 2020, and now appear to have the disease under control. In other regions such as South-East Asia and Africa it seems to be at the beginning (panel at WHO, https://covid19.who.int/). Currently, Latin American countries are the new epicenter of the disease.
Some institutions around the world have dedicated to data collection and graphical analysis. For example, the panel of the WHO presents graphs of the number of cases and deaths over time for regions and countries. The Johns Hopkins University collects and makes available the daily data, besides producing maps of the occurrence by countries (https://coronavirus.jhu.edu/map.html). Another available repository is the website Worldometers (https://www.worldometers.info/coronavirus/).
There are many institutions in the world modeling and performing forecast of the number of cases and deaths of COVID-19. For example, at the beginning of March 2020, an study of the Oxford University projected more than thousand of deaths in Brazil (https://www.pnas.org/content/117/18/9696). Another study from the University of Washington’s Institute for Health Metrics and Evaluation (IHME) projects a total of thousand of deaths (https://covid19.healthdata.org/projections). The Imperial College presented an specific study to Brazil that estimates with high confidence that the reproduction number remained above , indicating that the epidemic was not yet controlled and would continue to grow (ImperialColledge). In Brazil, there are some organizations studying the evolution of the disease over time allowing online apps with daily updates, for instance, the Department of Statistics of the Federal University of Minas Gerais (covidlp) and the Department of Statistics of the Federal University of Juiz de Fora (JFsalvandotodos).
The Americas have become the epicenter of the global coronavirus outbreak, logging nearly million infections and deaths. Brazil, Chile, Colombia, Mexico, and Peru have been particularly hard hit in recent weeks. To point out the possible impact of the disease in Latin America, the covidlp, in June , estimates a total of more than thousand deaths only for these countries until the end of the pandemic.
Some studies for modeling the COVID-19 data are based on nonlinear models. TsallisT2020 proposed a nonlinear model based on volume of stock-markets to predict the number of active cases:
where , , , , and . The constant indicates the first day of appearance of the epidemic in that particular region; it is conventionally chosen to be zero for China; for other countries, it is the number of days elapsed between the appearance of the first case in China and the first case in that country. The normalizing constant reflects the total population of that particular country. We refer to TsallisT2020, for more details about the interpretation of the parameters .
The covidlp proposed an hierarchical Bayesian non-Gaussian and non-linear model to capture the dynamics of the epidemic (last accessed on the
of June on 2020). The daily counts are assumed to come from a Poisson distribution and the daily pandemic evolving by a generalized logistic dynamic:
where is responsible to capture subnotification on specific day(s), control the infection rate, is an asymmetry parameter, , and control the asymptote of the curve, given by , with the peak occurring at time . However, to the best of our knowledge, most models developed up to this date are univariate, not taking into account a possible structure of dependence on the disease among countries or regions. Also, as the onset of the disease occurs at different periods between countries, those at a more advanced stage of the disease can provide valuable information to countries at an early stage.
In recent years, nonlinear mixed-effects (NLME) models have been proposed for modeling many complex longitudinal data (Lindstrombates90; Wu2010). However, one often assumes that both random error and random effects are normally distributed, which may not always give reliable results if the data exhibit excessive skewness and heavy tailedness, as is the case of COVID-19 deaths data. In this paper, we present a novel class of asymmetric NLME models that provides an efficient estimation of the parameters in the analysis of longitudinal data. We assume that, marginally, the random effects follow a scale mixtures of skew-normal distribution (SMSN, Branco_Dey01) and that the random errors follow a symmetric scale mixtures of normal distribution (SMN, Lange93) providing an appealing robust alternative to the usual normal distribution in NLME models. We propose an approximate likelihood analysis for maximum likelihood (ML) estimation via an EM-type algorithm that produces accurate ML estimates and significantly reduces the numerical difficulty associated with the exact ML estimation. The newly approximate procedures are applied to COVID-19 deaths data and it is showed that models with skew-heavy-tailed assumption may provide more reasonable results and these may be important for COVID-19 research in providing quantitative guidance to better understand the stages and future development of the disease.
The purpose of this study is to predict the number of deaths caused by COVID-19 to short and long term, in countries at early stage such as Peru, Mexico, Chile, Brazil and Colombia along with some countries at a more advanced stage of the disease such as Belgium, Italy, the USA and the United Kingdom. Furthermore, a general bootstrap method is used for constructing confidence intervals for the fitted COVID-19 deaths curves and its mode (date of the peak) for the selected countries. The method proposed in this paper is implemented in R (rmanual), and the codes are available for download from Github (https://github.com/fernandalschumacher/NLMECOVID19).
The rest of the paper is organized as follows. In Section 2, we describe the motivating COVID-19 deaths datasets obtained from the John Hopkins repository. In Section 3, we present the methodology and associated ML estimation procedure via the EM-algorithm. In Section 4, we present our result, where we forecast the total number of deaths in the selected countries for the short term and days and long term and days with starting point at June 25 2020. Finally, the paper concludes with some discussions in Section 5.
The Johns Hopkins University through the Center for Systems Science and Engineering created a COVID-19 repository (https://github.com/CSSEGISandData/COVID-19/) that is updated daily with data from many locations of the world. The repository combine data from a variety of official agencies and others reliable sources and unify them.
Due to different testing capacity between countries, subnotification is a challenge to understand the true contamination numbers of COVID-19 in the population. Despite that subnotification might also happen in number of deaths, they are less likely to be affected by detection biases. In fact, recent studies use the number of deaths as a proxy measure for COVID-19 cases (see, for example, maugeri2020estimation; ribeiro2020underreporting; amaro2020global). Figure 1 shows the reported number of daily deaths in different countries.
As it can be seen the selected countries are in different stages of the COVID-19 pandemic. Belgium, Italy, the United Kingdom (UK), and the United States of America (US) present a controlled number of deaths by the disease. Other countries like Brazil, Mexico, and Peru seem to be around their peak, while Chile and Colombia apparently are in the increasing part of the curve. Figure 1 also presents a LOESS (cleveland1979robust) fit and a
days prediction for each series. Non-parametric curves are alternatives for model fitting because of their flexibility. As it can be seen, the curves have difficulties to capture the rapid increase in deaths in some countries but overall provide a reasonable fit. However, when using this model to make prediction we can clearly see that their extrapolation capacities are not reliable, making necessary the usage of a more adequate statistical parametric model to capture the nature of the pandemic and provide sensible fit and prediction.
Although that Peru moved quickly to lock down its citizens as the pandemic took hold in early March and has extended the lockdown until the end of June 2020, cases nonetheless exploded in May, reaching a peak of more than cases per day late in May, in part explained because the informal employment reaches of the Peruvian labor market. Peru has now reported cases and deaths of COVID-19 (Worldometers), the second-highest number of confirmed cases of the disease in Latin America, behind Brazil, and the seventh-highest globally.
In Mexico, the accumulated COVID-19 cases reached and a death total of until June . Moreover, the statistics show that the curve is ascending (Worldometers). TorrealbaCH2020
presented a modeling and prediction of accumulated cases of COVID-19 infection in Mexico using the models of Gompertz and Logistic and a framework of Artificial Neural Network. These models predict the peak between Mayand June , which might be unrealistic as presented in Section 4.
Colombia is in an ascending stage of the curve of cases and deaths, having accumulated cases and deaths until June, according to Worldometers. Rivera-Rodriguez2020.04.14.20065466 used a SEIR model to estimate the number of patients that would required Intensive Care Units (ICU) care (critical), and only hospital care (severe) in order to manage their limited resources. arm2020sir used a SIR model to estimate the number of cases and deaths in Colombia using data of Italy and South Korea where the peak of infections would be by June . As shown in Figure 1, Colombia presents an increasing pattern and the peak should be observed somewhere in the future.
For Chile, the statistics are of accumulated cases and a total of deaths until June (Worldometers). It is important to observe that there are two days with observations far above the standard: number of deaths of (June ) and number of cases of (June ).
Lastly, Brazil presented cases and deaths until June according to Worldometers and is the country with second-highest number of confirmed cases and deaths in the world. covidlp estimates that the Brazilian peak of deaths should happen in June and a estimated total number of death of which is closely related to the predictions presented in Section 4. From Figure 1 we can see that Brazil is the country in more advanced stage of the evolution of the disease in the Latin American region.
The different stages of spread of the disease are an important information that should be used into modeling. In the next section we introduce our methodology that jointly accommodates the different stages of the diseases and borrow information of the different time series to provide a more robust and reliable fit and prediction.
The idea of the SMSN distributions originated from an early work by Branco_Dey01, which included the skew-normal (SN) distribution as a special case. We say that a
random vectorY follows a SN distribution with location vector , positive definite dispersion matrix and skewness parameter vector (often known as the shape parameter) and write
if its probability density function (pdf) is given by
where stands for the pdf of the –variate normal distribution with mean vector and dispersion matrix , say, and
is the cumulative distribution function (cdf) of the standard univariate normal. Note for, (2) reduces to the symmetric -pdf, while for non-zero values of , it produces a perturbed (asymmetric) family of -pdf’s. Let Since for all scalar the SMSN family can be defined as follows: a SMSN distribution is that of a dimensional random vector
is a positive random variable with the cdfand pdf , and independent of the –random vector Here is a scalar or vector parameter indexing the distribution of the mixing scale factor . Given , follows a multivariate skew–normal distribution with location vector , scale matrix and skewness parameter vector , i.e., . Thus, by (2), the marginal pdf of is
The notation will be used when has pdf (4). The asymmetrical class of SMSN distributions includes the skew– (ST), the skew–slash (SSL), and the skew–contaminated normal (SCN). All these distributions have heavier tails than the skew-normal and can be used for robust inferences. When , the SMSN class reduces to the scale mixtures of normal (SMN) class, which is represented by the pdf . We use the notation when has distribution in the SMN class. We refer to schumacher2020scale for details and additional properties related to this class of distributions.
In this section, we present the models and methods in general forms, illustrating that our methods may be applicable in other applications as well. Denote the number of subjects by and the number of measurements on the th subject by . For notational convenience, let be a vector incorporating independent variables such as number of ICU beds, , . The NLME model can be written as
where the subscript is the subject index, , with being the response value for individual at time , , with being a nonlinear known function, is random error vector, is an -dimensional linear function, is the vector of random effects .
Following schumacher2020approximate, we assume that
where , with , with ,
is the unknown within-subject variance,is the variance-covariance matrix of , which depends on unknown and reduced parameters of dimension and is a vector of skewness parameters for the random effects. Using the definition of a SMSN random vector and (6), it follows that marginally
so that . Thus this model considers that the ’s, related to within-subject errors are symmetrically distributed, while the distribution of random effects is assumed to be asymmetric and with mean zero. In addition, under this consideration the regression parameters are all comparable.
which generally does not have a closed form expression because the model function is not linear in the random effects. In the normal case, to make the numerical optimization of the likelihood function in a tractable problem, different approximations to (8) have been proposed. Some of these methods consist of taking a first-order Taylor expansion of the model function around the conditional models of the random effects (Lindstrombates90). Following this idea, the marginal distribution of for can be approximated as
is an expansion point in a neighborhood of ,
, and denotes approximated
The approximated empirical Bayes estimator of , denoted by , obtained by the conditional mean of given , is
with , ,
We refer to Lachos_Ghosh_Arellano_2009, schumacher2020scale and schumacher2020approximate, for further details.
In this section, we demonstrate how to use the EM algorithm (Dempster77) to obtain approximate maximum likelihood (ML) estimator of a SMSN–NLME model. We denote the current estimates of by . In this case, the linearization procedure (Wu2010) consists of taking the first-order Taylor expansion of around the current parameter estimate and the random effect estimates , which is equivalent to iteratively solving the following linear mixed-effect (LME) model
for , where
and . The model defined in (11) can be seen as a slight modification of the general SMSN-LME model proposed by Lachos_Ghosh_Arellano_2009 and schumacher2020scale, where a simple and efficient EM-type algorithm for iteratively computing ML estimates of the parameters in the SMSN-LME model has been proposed and for which we use the
skewlmm package in R (skewlmm-manual; rmanual). The approximated likelihood function, derived from (9),
can be easily computed as a byproduct of the EM-algorithm and is used for monitoring convergence and for model selection, such as, the Akaike (AIC) and Bayesian Information Criterion (BIC) (wit2012all).
Suppose now that we are interested in the prediction of , a future vector of measurement of , given the observed measurement , where . The minimum mean square error (MSE) predictor of , defined as the conditional expectation given and , is given next.
Let be an expansion point in a neighborhood of , be an vector of future measurement of (or possibly missing), and be an matrix of known prediction regressors. Then, under the SMSN–NLME model the predictor (or minimum MSE predictor) of can be approximated as
, , , and
with The expression for are given in schumacher2020scale.
Bootstrap methods are statistical tools used in many statistical problems such as calculus of bias, variance, confidence intervals, sampling distributions of the estimators, among others. For technical details, we refer to EfronH2016. In this paper, we use the parametric Bootstrap to calculate a confidence interval for the mode (date of the peak) of COVID-19 deaths by country. Also, we construct confidence bands to the estimated COVID-19 deaths curve along the time. We generate random samples of length of the model (eq. 5–6) using the estimated parameters and for each random sample, we fit the proposed model and calculate the statistical mode, the fitted and predicted curve (). So, for each country, we will have estimates of the mode and . We calculate the percentiles and to construct a confidence interval of level for the mode and for the fitted curve of the COVID-19 deaths. In this notes, we use .
To propose the confidence interval for the peak of deaths, we select the two dates such that the percentile curve coincides with the highest value of the percentile curve. Thus, we guarantee that all curves belonging to the confidence band will peak within this interval.
In order to analyse the COVID-19 data described in Section 2, we propose to fit the SMSN-NLME model given in (5) with the derivative of the generalized logistic model as nonlinear function, which can be written as follows:
where , , and , for , with the exponential transformation being used to ensure positiveness of the parameters. Note that this function is similar to the one considered in the univariate approach of covidlp, as presented in (1), except that in (13) random effects are included to enable a multivariate approach and borrow information between the different time series.
For numerical stability, we use the linear transformation, where is the number of registered deaths at the th country and th day since first death, and
is chosen to be the sample standard deviation from the Colombia data, which is the smaller one in the observed data. For model selection, we consider AIC and BIC, as described in Section3.3, for distributions normal (N), skew-normal (SN), student’s-t (t), and skew-t (ST). As shown in Table 1, the distribution with lowest AIC and BIC is the ST, which will be used in the analysis hereafter. It is worth noting that the fitted ST distribution has (Table 2), evidencing the need for heavy tail models.
To obtain long term evolution estimates, we computed the 300-days-ahead prediction for all countries considered in this study. Figure 2 presents the fitted model, along with the prediction, the real data and confidence intervals information. To preserve the individual properties observed for each country, the random error was generated conditioned on the mixing scale factor , , which is estimated as a byproduct of the EM algorithm and is responsible to preserve the observed characteristics of the series for each country. Additionally, the 95% confidence intervals were obtained as described in Subsection 3.5, where samples were generated, from which the model estimate or prediction for samples resulted in numerical error and/or non-convergence. For the remaining samples we performed a 15% trim of the series to remove those noise series that were not able to replicate the characteristics of the data. To identify such series we used as metric the mean square error (MSE) between the fitted values and observed data and removed those such that the MSE have the highest values. This procedure was performed to prevent convergence problems to affect the interval estimates, resulting in a final of samples.
From Figure 2, we can see that for countries in more advanced stage of the evolution of COVID-19, the confidence intervals are narrow. On the other hand, for countries in early stage such as Chile and Colombia, the uncertainty regarding the disease evolution is much bigger, which is reflected by the wider confidence intervals and very vague information about the peak.
Table 3 reports predictions for the total number of deaths for each country and up to , , , and -days-ahead of the date from the latest observation considered (2020-06-24), using the fitted ST-NLME model (our selected model), along with its 95% confidence interval, obtained from the bootstrap results. As can be seen, based on the observed data, the European countries and the US have the deaths by COVID-19 practically controlled, since the death estimates are stable in all future prediction times. Brazil, Mexico and Peru are around its peak (Figure 3). Their prediction mildly increase with time and they present a little more imprecision as observed by their 95% confidence interval estimates if compared to the controlled countries. Finally, Chile and Colombia seem to be in the increasing phase. The predicted number of deaths changes drastically between the time intervals with a wide uncertainty.
|Country||Total expected by 2020-07-25||Total expected by 2020-08-24|
|Belgium||10 296 (9 675; 11 206)||10 301 (9 678; 11 212)|
|Italy||34 100 (32 779; 35 204)||34 128 (32 798; 35 230)|
|United Kingdom||42 181 (39 420; 44 465)||42 279 (39 495; 44 562)|
|US||125 165 (117 506; 128 895)||126 340 (118 297; 129 911)|
|Brazil||75 275 (70 409; 79 020)||86 786 (78 978; 91 442)|
|Mexico||47 037 (42 284; 53 337)||68 476 (56 279; 87 550)|
|Peru||12 664 (11 795; 14 170)||15 384 (13 937; 18 089)|
|Chile||12 386 (9 477; 16 034)||21 640 (13 024; 35 669)|
|Colombia||5 597 (4 306; 6 449)||9 359 (5 852; 12 001)|
|Country||Total expected by 2020-09-23||Total expected by 2020-11-22|
|Belgium||10 302 (9 679; 11 213)||10 302 (9 679; 11 213)|
|Italy||34 132 (32 800; 35 234)||34 133 (32 801; 35 235)|
|United Kingdom||42 295 (39 506; 44 577)||42 299 (39 509; 44 580)|
|US||126 627 (118 467; 130 158)||126 713 (118 512; 130 230)|
|Brazil||91 894 (82 288; 96 861)||94 889 (83 967; 100 163)|
|Mexico||83 630 (64 215; 113 683)||98 263 (69 998; 143 861)|
|Peru||16 761 (14 942; 20 432)||17 692 (15 551; 22 243)|
|Chile||30 517 (15 382; 64 188)||42 754 (17 081; 134 072)|
|Colombia||12 782 (6 944; 18 186)||17 242 (7 783; 29 461)|
Table 4 shows the total estimated number of deaths at the end of the pandemic and the estimated peak date by each country. This results corroborates and reinforce the analysis presented in the previous paragraph. Moreover, Table 4 presents the estimates of the parameters of the logistic dynamic of the model. As it can be seen, showing a large right skewness for the pandemic dynamics. In other words, the increase phase occurs much faster than the decrease phase as observed in many places. This is a clear effect of borrowing strength from curves in different stages. For example, for the curves from the Latin American countries, where observations are mainly on the left side of the peak, estimation of the skewness parameter is unstable due to the lack of information after the peak. Although in similar scale, we can see some variation between and , , which are important to capture the unique characteristic of each time series and provide a good fit and meaningful predictions.
|Country||Tot. Est. Death||Est. Peak Date|
This article proposes a robust modeling of COVID-19 deaths based on a NLME model, where the Gaussian distribution of the random terms are replaced by the SMSN class of distributions. We believe that the proposed methods may have a significant impact on COVID-19 research because, in the presence of skewness and heavy tails in the data, appropriate statistical inference for COVID-19 research can lead to more accurate and reliable clinical decisions, in addition, our model can be useful to guide some policies that need to be taken by the selected Latin American governments in order to overcome the COVID-19 pandemic. To the best of our knowledge, this is the first attempt in working on such general distributional structure related to COVID-19 deaths data. Our proposed method is quite general and is not limited to the analysis of the selected countries and reported deaths, thus we have made theR codes available for download from Github (https://github.com/fernandalschumacher/NLMECOVID19), which will encourage other researchers to use NLME models and the SMSN class of distributions in their studies of other characteristics of COVID-19 data.
Understanding the dynamics of the pandemic and being able to predict its future behavior is of extreme importance. While many countries and consortia try to create an effective vaccine for the disease, the best current alternative is to flatten the contamination curve to guarantee that the health systems are able to provide the necessary care for the population without being overwhelmed. With the use of countries that have controlled the pandemic number of deaths we strongly believe that our methodology can provide more reliable and meaningful prediction that allow policies makers in Latin America to make effective decisions.
Extension of the presented methodology to the number of cases can also be performed, however, because of the large subnotification in the data and due to the different testing capability among countries, this approach is challenging. A possible alternative to overcome this fact is to incorporate into modeling time changing covariates that adequately capture the subnotification dynamics of each country. Another interesting future development and work is to model the cases and deaths by COVID-19 jointly since these series are necessarily correlated. Moreover, the WHO has warned that the coronavirus may never go away and concerns are growing about a second (perhaps a third) wave of infections. Thus, a natural generalization of our method is to consider a finite mixture of SMSN-NLME models (lachos2017finite; zeller2019finite). An in-depth investigation of such extensions are beyond the scope of the present paper, but certainly an interesting topic for (near) future research.
This paper was written while Marcos O. Prates was a visiting professor in the Department of Statistics at the University of Connecticut (UConn). In addition to the support of UConn, the professor would like also to thank the CovidLP group for the discussion about the topic and the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) for partial financial support. Fernanda L. Schumacher acknowledges the partial support of Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, and by Conselho Nacional de Desenvolvimento Científico e Tecnológico - Brasil (CNPq).