Learning trends of COVID-19 using semi-supervised clustering

09/14/2021
by   Semhar Michael, et al.
South Dakota State University
0

A finite mixture model is used to learn trends from the currently available data on coronavirus (COVID-19). Data on the number of confirmed COVID-19 related cases and deaths for European countries and the United States (US) are explored. A semi-supervised clustering approach with positive equivalence constraints is used to incorporate country and state information into the model. The analysis of trends in the rates of cases and deaths is carried out jointly using a mixture of multivariate Gaussian non-linear regression models with a mean trend specified using a generalized logistic function. The optimal number of clusters is chosen using the Bayesian information criterion. The resulting clusters provide insight into different mitigation strategies adopted by US states and European countries. The obtained results help identify the current relative standing of individual states and show a possible future if they continue with the chosen mitigation technique

READ FULL TEXT VIEW PDF
11/07/2020

Google Trends Analysis of COVID-19

The World Health Organization (WHO) announced that COVID-19 was a pandem...
07/12/2020

Changing Clusters of Indian States with respect to number of Cases of COVID-19 using incrementalKMN Method

The novel Coronavirus (COVID-19) incidence in India is currently experie...
08/09/2020

Generalized k-Means in GLMs with Applications to the Outbreak of COVID-19 in the United States

Generalized k-means can be incorporated with any similarity or dissimila...
03/13/2022

Investigating the Impact of COVID-19 on Education by Social Network Mining

The Covid-19 virus has been one of the most discussed topics on social n...
05/04/2017

Semi-supervised model-based clustering with controlled clusters leakage

In this paper, we focus on finding clusters in partially categorized dat...
10/14/2020

Africa 3: A Continental Network Model to Enable the African Fourth Industrial Revolution

It is widely recognised that collaboration can help fast-track the devel...

1 Introduction

According to the World Health Organization (WHO) as of April 28, 2020, 213 countries have reported close to 3 million cases and more than 202 thousand deaths due to the novel coronavirus disease (COVID-19), a respiratory tract infection 22

. The virus was first reported in Wuhan city, Hubei province of China on December 31, 2019. WHO declared this an outbreak of international public health emergency on January 31 and a pandemic on March 11, 2020. As the spread of the virus widened, different countries and territories reacted to the pandemic in various ways. The response generally starts by attempting to contain the virus by quarantining infected individuals, then moving to mitigation when containment is not possible. Without any intervention, it is estimated that around 40 million deaths worldwide

19. Governments have used many mitigation strategies ranging from those that implemented strict shelter-in-place orders (e.g., China, Italy) to those that are advising social distancing without other substantial restrictions (e.g., Netherlands, Sweden), with some targeting herd immunity 4, 6, 5. Another important factor in the spread of this virus is the availability and use of diagnostic tests 23. Countries had different timelines for the availability of diagnostic tests and implemented different criteria on when and who to test. Hence the numbers that are officially reported are a reflection of this difference 23. As a result, there can be discrepancies in the officially reported numbers and the true numbers of cases and deaths due to this disease. For the reasons listed above and others, the trends reported by countries and states have varied significantly exhibiting heterogeneity even after some preprocessing and standardization of the data is done. One interesting and important question regarding this pandemic would be to see which regions have similar trends in the spread of the disease. More specifically, we are interested in investigating how the state-level trends within the United States (US) compare with some European countries that have encountered the virus before the US.

One of the flexible models capable of capturing heterogeneity with great interpretability is a finite mixture model 10. Model-based clustering is a probabilistic modeling approach to finding groups of similar observations in heterogeneous data 10, 1. Model-based clustering can be accomplished through finite mixture models with an assumption that each mixture component represents a standalone cluster. Further, a mixture of nonlinear regression models deals with finding groups of similar observations and fitting a nonlinear regression model within each group, simultaneously 20, 8. If there is an additional restriction on the memberships of observations, semi-supervised mixture modeling can be employed. In this setting, there exists partial information about the equivalence of observations. In particular, some observations must be restricted to be in the same cluster and some others can be required to be in different clusters. Using the terminology in 11, these are called positive and negative equivalence constraints, respectively. In our setting, we know ahead of time that observations (the daily cumulative counts) coming from the same region must belong to the same cluster. Hence, semi-supervised clustering with positive equivalences will be developed. In the nonlinear regression setting, the mean of the model can take different forms. In this project, the mean will be modeled using a generalized logistic function 14 which is a flexible growth model applicable in our framework.

The rest of the paper is organized as follows. Section 2 describes the data collection and cleaning followed by the development of the proposed methodology. Sections 3.1 shows the applicability of the generalized logistic function to modeling country and state-related trends. Section 3.2 discusses the results of the proposed semi-supervised clustering. Finally, a conclusion and limitations of the work are provided in Section 4.

2 Methodology

2.1 Data collection and preprocessing

In this analysis, the daily cumulative number of confirmed cases and deaths per region are used to find regions with similar trends. For country-related data, the R package coronavirus 9 is used for the daily summary of COVID-19 data. This dataset provides daily confirmed cases and deaths by country. For the US, the New York Times GitHub page 13 provides cumulative counts of confirmed cases and deaths at the state and county levels. In our analysis, we considered the fifty states in the US and the District of Columbia (we will refer to them as US states). In addition, twenty three European countries that have had the virus for a longer period and have implemented strategies to mitigate against COVID-19 similar to the US states are considered. The countries included in our analysis are the following: Austria, Belgium, Croatia, Denmark, Finland, France, Germany, Greece, Hungary, Ireland, Italy, Netherlands, Norway, Poland, Portugal, Romania, Serbia, Slovenia, Spain, Sweden, Switzerland, Turkey, United Kingdom. In addition to COVID-19 related data, the population of each country was gathered from The World Bank 17 database. For the state-level data, the US Census Bureau database 18 was used to gather the estimated population of each state for the year 2019.

As a result, we have 74 regions included in our data (23 European countries and 51 US states). These regions differ in population size, therefore the data have been standardized by adjusting for population sizes as follows: for regions, sequences of daily observations for the th region, and cumulative confirmed cases and deaths at date . In addition, each region reported its first case of the virus at different dates. However, the objective of this study is to identify similar trajectories in the development of the epidemic, regardless of the starting date. Therefore, we computed the first date that at least 1 case per 100,000 of the population was reported for each region (i.e., ). Then, the regions were aligned according to this specific date, . We will refer to this time point as the time of onset to represent an estimated time of the beginning of the epidemic for a given region. As a result, our analysis emphasizes on how the observed time series trends for each region compare aligned by this time of onset. Figure 1 shows the sequence of a population adjusted cumulative cases and deaths for each region in our data. The figure is color- and line-type- coded to differentiate US states and European countries. In the figure, the value of zero in the x-axis indicates the time of onset. As we can see from the plot, most of the European countries are ahead of the US states. In our data, the state of West Virginia had the shortest time of 30 days and Italy had the longest time of 55 days after the time of onset. This data preprocessing allows for comparison of the regions since they are on the same scale in both rate of confirmed cases and deaths and the time variable. Therefore, the modeling can be done to identify clusters in this collection of 74 regions.

Figure 1: Population adjusted cumulative counts aligned by the first date on which at least 1 case per 100,000 is reported (the time of onset).

2.2 Mixture modeling and model-based clustering

Let be a sample of

independent observations identically distributed according to the probability density function (pdf) given by

(1)

Here,

is the full parameter vector,

represent mixing proportions or weights subject to constraints and , and is the mixture component of known functional form with component-specific parameter vector .

is the number of components, also known as the mixture order. The estimation of parameters is usually carried out by means of the expectation-maximization (EM) algorithm

3. The EM algorithm consists of two steps called E (expectation) and M (maximization) that are iterated until some pre-specified stopping criterion is met. The EM algorithm is known for its convenience in handling so-called missing information. In the mixture modeling framework, one assumes that the group labels are missing. Then, the complete-data likelihood can be constructed based on the full data . At the E step, the conditional expected value of the complete-data log likelihood function given observed data (traditionally denoted as the function) is calculated. At the M step, the function is optimized with respect to the parameter vector . Upon convergence, the EM algorithm yields the maximum likelihood estimate . If the mixture order is unknown and has to be estimated, the most traditional approach is to employ one of information criterion, among which Bayesian information criterion (BIC) 15 is the most popular in the mixture modeling context. BIC is calculated for a different number of mixture components and the model producing the lowest value is declared the winner.

A popular variant of finite mixtures is the mixture of regression models. In this setting, Gaussian pdf is usually employed for each component and the mixture is specified as

(2)

where

is the variance parameter and

is the mean of the component represented by the regression function with regression-specific parameters and being the vector of explanatory variables.

The most famous application of finite mixture modeling is model-based clustering that relies on the existence of the one-to-one relationship between mixture components and data groups. This relationship provides a highly intuitive interpretation of clustering results as data groups can be seen as samples from heterogeneous subpopulations constituting the superpopulation. The model-based clustering result is obtained based on the estimated posterior probabilities

produced at the last E step of the EM algorithm. According to the Bayes decision rule, estimated membership labels are found as follows: .

2.3 Generalized logistic function

Figure 2: Generalized logistic function.

Consider a nonlinear regression setting with denoting a sample of independent observations

distributed according to normal distribution

with variance and mean . The functional form of the mean is chosen to be a four-parameter generalized logistic function defined as for with and . It is straightforward to notice that and are two horizontal asymptotes representing the lower and upper bounds of the function. Parameter is known as the carrying capacity, can be interpreted as a shift, and is the growth rate parameter. It can be shown that the inflection point is observed at , where and . If , the inflection point is located at . Hence, parameter is responsible for the shift of the inflection point toward the lower (if ) or upper (if ) bound. The considered form of the generalized logistic function illustrated in Figure 2 presents a flexible tool for modeling the cumulative number of observed disease cases or deaths. In our framework, we model the total number of cases and overall number of deaths jointly. This leads to the multivariate nonlinear regression setting with bivariate response , where the first and second vector coordinates represent the cumulative numbers of reported disease cases and deaths, respectively. Then, the sequence of independent data points observed at times is given by -dimensional matrix with , where is the mean vector and is the covariance matrix. Vectors and represent variable specific parameters of the generalized logistic function and .

2.4 Semi-supervised clustering of regression mixture modles

Let each of countries and states considered in this paper be represented by the observed response matrix of dimensions , where and is the length of the bivariate sequence. In other words, we observe with . The objective of this paper is to reveal common trends among heterogeneous data matrices . However, the direct application of the mixture model

where represents time and is the bivariate Gaussian pdf, is not possible in our context as data points observed at different times within the same country-specific sequence cannot be assigned to different components. In other words, there should be a condition that prevents assigning observations and to different clusters. To address this undesired feature of the model, one can employ so-called semi-supervised clustering with positive equivalence constraints. Such constraints assume that some observations in the data set are known to belong to the same group and hence must be treated jointly. In this framework, it is easier to work with blocks of observations tied by positive constraints. In our context this implies that each country or state represents a separate block with all elements required to belong to the same cluster. Then, the clustering procedure needs to be applied to blocks. The EM algorithm corresponding to the semi-supervised clustering setting needs to take into consideration the fact that membership labels of observations belonging to the same block must be equal. For more details on positive and negative equivalence constraints, we refer the reader to the paper by 11. It can be shown that the -function associated with our proposed model is given by

(3)

where two dots on the top of posterior probabilities

represent the estimates at the current iteration of the EM algorithm, while one dot on the top of refers to the previous iteration. It can be shown that the E step reduces to updating posterior probabilities by the expression

It is worth mentioning that posterior probabilities refer to the chances of the entire block to belong to the component. The M step consists of the following expressions for updating parameters:

Parameters associated with the generalized logistic functions can be updated by the numerical maximization of the function

with respect to for . This concludes the steps of the EM algorithm.

2.5 Computation aspects

In our analysis, the stopping criterion is based on the relative difference in log likelihood values from two consecutive iterations. In addition, the initialization of the EM algorithm is crucial in obtaining a reasonable solution. In our analysis, we used multiple random starts approach for different values of and the best solution was chosen based on BIC. In particular, each candidate model was initialized with random starts. The EM algorithm is stopped in one of the two scenarios: 1) if the relative difference in log likelihood values from two consecutive iterations is less than a tolerance level of or 2) if one or more spurious solutions are obtained. In the latter case, the solution is not retained. A Linux cluster "Roaring Thunder" at South Dakota State University is used to run the jobs for random starts at each in parallel. Looking forward, as the data on the disease are still being updated, the model can also be updated by initializing the EM-algorithm using the best model parameter estimates from the previous iteration of data.

3 Results

3.1 The utility of the generalized logistic function

In this section, we discuss the utility of the generalized logistic function in the framework of the considered problem. Figure 3 shows plots constructed for cumulative numbers of disease cases (left plot) and deaths (right plot) for Switzerland, Netherlands, Illinois, and Massachusetts. The two countries are just two examples of European countries included in our study. Netherlands and Switzerland are not the first countries that have been hit with the virus and not the last in Europe either. The two US states have been chosen based on similar reasoning.

Figure 3: Generalized logistic function fitted to four regions.

Each plot in Figure 3 presents inflection points reflected by crossed dotted lines and horizontal asymptotes (dashed lines) showing the expected total number of disease cases and deaths. It can be noticed that the generalized logistic function fits the data very well in all cases. An important observation can be made about the onset time of the epidemic. Both European countries encountered a steady growth in the number of cases and deaths around March 14, while both states faced the disease extension roughly two weeks later in March. This motivates our paper as the analysis of European trends can be used for selecting an optimal public health policy for US states. Indeed trends observed for different countries and states are not identical. For example, Switzerland is considerably closer to the horizontal asymptote than the Netherlands and the two states. The fitted curves can be particularly helpful for making decisions with regard to when quarantine measures can be relaxed. For example, Switzerland can is expected to overcome the outbreak by the middle of May while the Netherlands would roughly require an extra month to handle the disease. Expectedly, Illinois and Massachusetts would need even more time to approach the upper bound. It is important to mention that the fitted curves are based on current public health policies. If a country or state changes its response to the disease, for instance, by terminating the stay-at-home order, a different data trend will be observed and the generalized logistic function is likely to produce an unsatisfactory fit. For example, we observed such a trend for China after the quarantine was relaxed and new disease cases were primarily observed among visitors arriving from the outside of the country.

3.2 Clustering results

Figure 4: Model selection using BIC.

The proposed clustering model was fitted to the data (74 total observations: 23 European countries and 51 US states) varying the number of clusters . BIC for can be seen in Figure 4. The optimal model for the considered dataset based on BIC is the six-component mixture suggesting there are 6 distinct groups (Groups 1-6) in our dataset. The parameter estimates of the best model fit are reported in Table 1. The six groups are ordered in the table according to the expected upper bounds on the rate of cases in each group ().

According to the best model, the upper bound of the number of cases ranges from 232 (in Group 1) to 7,497 (in Group 6) cases per 100,000. On the other hand, the number of deaths ranges from 9 (in Group 1) to 822 (in Group 6) per 100,000. Figure 5 shows a color-coded assignment to the six clusters and the corresponding cluster mean trend in each cluster. Individual cluster plots including the members of the cluster are given in Figure 6. The first column in this plot represents population-adjusted cases and the second column shows population-adjusted deaths. The legend of the plot lists the countries and US states that has been assigned to the group according to the highest posterior probability. The regions in the legend are listed in the descending order of the cumulative rates.

1 0.189 232.003 9.206 3.809 4.983 12.197 12.930 8.185 9.747 8.814 0.203 0.613
2 0.541 554.105 29.341 3.640 4.567 12.263 12.866 9.396 10.925 23.421 1.048 0.756
3 0.095 817.711 140.018 3.069 3.998 12.475 12.714 11.007 11.746 44.120 5.054 0.925
4 0.095 2193.871 199.540 4.103 6.403 12.086 12.639 10.387 11.682 46.557 1.903 0.809
5 0.014 3217.663 10.916 6.510 1.887 12.858 8.364 12.819 12.215 8.087 0.094 -0.036
6 0.068 7496.741 822.326 5.363 7.058 11.510 12.250 8.298 10.041 143.919 5.828 0.879
Table 1: Parameter estimates
Figure 5: The six groups detected by the best model-based clustering solution.

Description of clusters

Group 1 (black cluster): This group has 4 European countries (Finland, Croatia, Poland, Greece) and 11 US states (Nebraska, North Dakota, Arkansas, Texas, North Carolina, West Virginia, Oregon, Alaska, Hawaii, Montana) listed in the descending order of their current infection rate. The group has the lowest rate of confirmed cases and deaths as compared to the other groups. In addition to that, the countries in this group have relatively shorter sequences with an average of 36 days since the time of onset (zero on the x-axis). Overall, the group appears to include Eastern European countries and US states that faced the disease at a later time. Besides, several of the US states within this group are characterized by low population density. The regions in this group should aim at maintaining this trend.

Group 2 (red cluster): This group contains the largest number of regions with 10 European countries and 30 US states (see the list in Figure 6). The group has the second smallest rate of infection with a modest death rate. Relative to Group 1, this group has been in the epidemic for a slightly longer period with an average of 38 days since the time of onset. There are some interesting things we can note as we look at individual regions within the group. Illinois and Maryland are the top two in terms of the population-adjusted case and death rates in this group, while Utah has the lowest death rate even with a mid-level rate of cases. From the European countries in the group, Portugal has the highest rate of cases and deaths while Slovenia has low rate of cases with a mid-level rate of death. In general, this group represents trends observed in European countries handling the pandemic the best: Germany, Austria, Norway, Denmark, and some Eastern European countries that faced the pandemic later: Serbia, Hungary, Romania, and Turkey. It is of interest that the group includes relatively densely populated European countries such as Germany, Denmark, and Austria and sparsely populated states such as Wyoming, Utah, and Idaho. This could be due to the effective mitigating strategies adopted by the former. The case of these European countries and US states such as California and Washington is to be mentioned as exemplary as they have large populations with high density and have had the pandemic for long periods but have ended up in a group with the second-best result.

Group 3 (green cluster): Group 3 is composed of 7 European countries and no US states. The countries in this group ordered by the rate of cases are Spain, Belgium, Italy, France, the United Kingdom (U.K.), Netherlands, and Sweden. This group contains countries in Europe that had encountered the disease earlier than most (with an average of 46 days after the time of onset). They are composed of countries that either are implementing limited restrictions (Netherlands, Sweden) or have implemented restrictions with some delay (Spain, Italy, and the United Kingdom). The group is characterized by high COVID-related mortality relative to the corresponding rate of detected diseases. Belgium and Spain are the top two countries in both rates of cases and death. It is interesting to notice that even though Spain has had a higher rate of confirmed cases for most of the duration, Belgium has shown much faster growth in terms of death rate after about 35 days since the time of onset. However, according to the official Belgian government site 7, the country includes both confirmed and suspected deaths due to COVID-19 in the officially reported counts as opposed to other countries often reporting only confirmed cases. The US is currently considering reporting both confirmed and suspected cases separately for each US state 13.

Group 4 (blue cluster): This group has two European countries: Ireland and Switzerland and five US states namely Massachusetts, Rhode Island, District of Columbia, Delaware, and Pennsylvania. Even though Group 4 has on average a higher rate of confirmed cases, the average death rate is lower than that of Group 3. An average of 41 days since the time of onset makes this group the second-longest since encountering the epidemic. In general, this group represents regions that are highly affected by COVID-19. Despite the very high disease rate, their mortality rate seems to be under control compared with that of countries in Group 3.

Group 5 (cyan cluster): This group contains only one region - the state of South Dakota (SD). This state has the unique characteristic of having the second-highest rate of cases with an expected population-adjusted upper bound of 3217 and the lowest rate of deaths with a population-adjusted expected upper bound of 11. The reason for this specific trend might be that the high number of cases are concentrated in the Smithfield Foods Incorporation - a pork processing plant in Minnehaha county making the demography slightly younger than the at-risk group. Currently, the US age distribution shows that around 80% of deaths are observed for individuals above the age of 65 2. On the other hand, 80% of cases in South Dakota are from less than 60 years old individuals 16. An additional reason for the low death rate might be the time lag from severe symptoms to death which is estimated to be close to two weeks 21.

Group 6 (magenta cluster): The last group had the largest number of cases and deaths. This group contains five US states namely New York, New Jersey, Connecticut, Louisiana, and Michigan arranged in the descending order of infection rates. This group is the most severely affected with no European country encountering the disease to such an extent. The group averaged 39 days since the time of onset with New York being the longest at 43 days. Both infection and mortality rates are higher than any other group with an expected upper bound of 7497 cases and 822 deaths per 100,000.

Figure 6: Individual cluster memberships for population adjusted cumulative cases and deaths (Groups 1 - 3).
Figure 7: (cont.) Individual cluster memberships for population adjusted cumulative cases and deaths (Groups 4 - 6).

4 Conclusion

A semi-supervised clustering model is used for identifying similar trends in the number of cases and deaths for 23 European countries and 51 US states. The mixture of non-linear regression models based on generalized logistic function proved to be flexible and identified regions with similar trends of infection and death rates. This model provides insights into how different regions reacted to this global crisis, which of them have successfully mitigated against its unmanageable spread, and which of them are in a worse state. The dataset contained most European countries that were exposed to the virus earlier than the majority of US states. As a result, we can learn how each group’s trends differ and use this analysis to gain insight.

Although, members of the same group show similar trends, the regions exhibit broad variation in demography, geography, and social behaviour - factors that are known to influence the transmission rate 12. Their inclusion in a group appears to be due to a combination of mitigation strategies, time of onset, delay in mitigation, and demography. A positive case in point is the state of California with its large population, relatively higher density, earlier time of onset assigned in one of the groups with good trends. The effect of demography is made clear in the case of South Dakota where the death rate is very low owing to the majority of its cases being younger than the national average. On the other hand, the opposite effect of demography is also apparent in Group 3 where a higher mortality rate relative to their infection rate is observed. These countries are characterized by an aging population with 18 - 22% greater than 65 years old 17. The trend in the most impacted group that is comprised of New York, New Jersey, and others may be a combination of delayed mitigation, high population density, and socio-cultural circumstances (as in the case of Louisiana’s spiked cases due to an annual gathering ’Mardi Gras’12). In addition, a recent report has identified that the virus arrived in New York weeks ahead of the first case from Europe. It appears that latent infection may have given the virus foothold by the time the first case was officially announced.

Overall, each cluster exhibits unique characteristics that can, in most cases, be explained by the composition of its members. One of the reasons for this grouping was joint modeling of the death and case rates. The nonlinear regression models within each cluster can be extrapolated to give a precise estimate of where a given region might be in a few days or weeks if they continue with the same strategy and as a result the same trajectory. In particular, we hope this analysis may help provide an insight to make a data-driven decision. Generally, the groupings can help regions to make strategic decisions on possible changes in mitigation and reopening or relaxing of restrictions. The model can be implemented to look into behaviors of other parts of the world to help develop public health policies by learning from data and experience.

The main limitations of this analysis are the reliability of the reported cases and deaths and inconsistencies in testing practices. Therefore, some of the trends and hence the groupings might be an artifact of the above. Additionally, to bring the regions to the same scale, the time of onset was chosen to be the date one region reaches 1 case per 100,000. This was a reasonable but somewhat arbitrary choice and results might change slightly depending on the choice of this date. Finally, the factors that may be related to high or low infection and death rates are not extensively explored. Further studies are needed to assess different factors including those that are inherent (e.g., geographic location, population density, or demographic structure) and mitigation strategies that can ameliorate these factors.

5 Data Availability Statement

The COVID-19 related data used in this paper are publicly available at https://cran.r-project.org/web/packages/coronavirus/index.html and The New York times GitHub page https://github.com/nytimes/covid-19-data. The population data are available at https://www.census.gov/programs-surveys/popest/data/data-sets.html and https://data.worldbank.org/indicator/sp.pop.totl

References

  • Banfield  Raftery 1993 banfieldandraftery93Banfield, JD.  Raftery, AE.  1993. Model-based Gaussian and non-Gaussian clustering Model-based gaussian and non-gaussian clustering. Biometrics803–821.
  • CDC 2020 cdcreport20CDC, .  2020. Provisional death counts for coronavirus disease (covid-19) Provisional death counts for coronavirus disease (covid-19). https://cdc.gov/nchs/nvss/vsrr/COVID19/, Last accessed on 2020-28-4
  • Dempster . 1977 dempsteretal77Dempster, AP., Laird, NM.  Rubin, DB.  1977. Maximum likelihood for incomplete data via the EM algorithm (with discussion) Maximum likelihood for incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B391-38.
  • Dogru . 2016 dogruetal16Dogru, FZ., Bulut, YM.  Arslan, O.  2016. Finite mixtures of matrix variate t distributions Finite mixtures of matrix variate t distributions. Journal of Science252335-341.
  • Ferguson . 2020 fergusonetal20Ferguson, N., Laydon, D., Nedjati Gilani, G., Imai, N., Ainslie, K., Baguelin, M.others  2020. Report 9: Impact of non-pharmaceutical interventions (NPIs) to reduce COVID19 mortality and healthcare demand Report 9: Impact of non-pharmaceutical interventions (npis) to reduce covid19 mortality and healthcare demand.
  • Gardner . 2020 gardneretal20Gardner, JM., Willem, L., van der Wijngaart, W., Kamerlin, SCL., Brusselaers, N.  Kasson, P.  2020. Intervention strategies against COVID-19 and their estimated impact on Swedish healthcare capacity Intervention strategies against covid-19 and their estimated impact on swedish healthcare capacity. medRxiv.
  • Government of the Belgium,. 2020 belgium20Government of the Belgium,.  2020. Analysis on the excess mortality due to Covid-19. Analysis on the excess mortality due to covid-19. https://www.info-coronavirus.be/en/news/graph-excess-mortality/, Last accessed on 2020-28-4
  • Grün  Leisch 2007 grunandleisch07Grün, B.  Leisch, F.  2007. Fitting finite mixtures of generalized linear regressions in R Fitting finite mixtures of generalized linear regressions in R. Computational Statistics & Data Analysis51115247–5252.
  • Krispin 2020 krispin20Krispin, R.  2020. coronavirus: The 2019 Novel Coronavirus COVID-19 (2019-nCoV) Dataset coronavirus: The 2019 novel coronavirus covid-19 (2019-ncov) dataset []. https://CRAN.R-project.org/package=coronavirus R package version 0.1.0
  • McLachlan  Peel 2000 mclachlanandpeel00McLachlan, GJ.  Peel, D.  2000. Finite Mixture Models Finite mixture models. New YorkJohn Wiley and Sons, Inc.
  • Melnykov . 2016 melnykovetal16Melnykov, V., Melnykov, I.  Michael, S.  2016. Semi-supervised model-based clustering with positive and negative constraints Semi-supervised model-based clustering with positive and negative constraints. Advances in data analysis and classification103327–349.
  • MMWR 2020 cdcgeo20MMWR, .  2020. Geographic differences in COVID-19 cases, deaths, and incidence–United States, February 12–April 7, 2020 Geographic differences in covid-19 cases, deaths, and incidence–United States, February 12–April 7, 2020 ( 69).
  • New York Times 2020 newyorktimes20New York Times, .  2020. Coronavirus (Covid-19) Data in the United States. Coronavirus (covid-19) data in the united states. https://github.com/nytimes/covid-19-data. GitHub.
  • Richards 1959 richards59Richards, F.  1959. A flexible growth function for empirical use A flexible growth function for empirical use. Journal of experimental Botany102290–301.
  • Schwarz 1978 schwarz78Schwarz, G.  1978. Estimating the dimensions of a model Estimating the dimensions of a model. Annals of Statistics6(2)461-464.
  • South Dakota Department of Health 2020 sddoh20South Dakota Department of Health, .  2020. Covid-19 in South Dakota Covid-19 in South Dakota. https://doh.sd.gov/news/coronavirus.aspx, Last accessed on 2020-28-4
  • The World Bank 2020 worldbank20The World Bank, .  2020. Population total. Population total. https://data.worldbank.org/indicator/sp.pop.totl.
  • United States Census Bureau 2020 uscensus20United States Census Bureau, .  2020. Population and Housing Unit Estimates Datasets. Population and housing unit estimates datasets. https://www.census.gov/programs-surveys/popest/data/data-sets.html.
  • Walker . 2020 walkeretal20Walker, P., Whittaker, C., Watson, O., Baguelin, M., Ainslie, K., Bhatia, S.others  2020. Report 12: The global impact of COVID-19 and strategies for mitigation and suppression Report 12: The global impact of covid-19 and strategies for mitigation and suppression. https://www.imperial.ac.uk/mrc-global-infectious-disease-analysis/covid-19/report-12-global-impact-covid-19/
  • Wedel  DeSarbo 1995 wedelanddesarbo95Wedel, M.  DeSarbo, WS.  1995. A mixture likelihood approach for generalized linear models A mixture likelihood approach for generalized linear models. Journal of classification12121–55.
  • Wilson . 2020 wilsonetal20Wilson, N., Kvalsvig, A., Barnard, LT.  Baker, MG.  2020. Case-Fatality Risk Estimates for COVID-19 Calculated by Using a Lag Time for Fatality. Case-fatality risk estimates for covid-19 calculated by using a lag time for fatality. Emerging infectious diseases266.
  • World Health Organization 20201 who20World Health Organization, .  20201. Coronavirus disease 2019 (COVID-19) Pandemic. Coronavirus disease 2019 (covid-19) pandemic. https://www.who.int/emergencies/diseases/novel-coronavirus-2019.
  • World Health Organization 20202 whoreport8520World Health Organization, .  20202. Coronavirus disease 2019 (COVID-19): situation report, 85: Laboratory testing strategy recommendations for COVID-19: interim guidance, 22 March 2020 Coronavirus disease 2019 (covid-19): situation report, 85: Laboratory testing strategy recommendations for covid-19: interim guidance, 22 march 2020. WHO/COVID-19/lab_testing/2020.1