Generalized-additive-models-to-capture-the-death-rates-in-Canada-COVID-19
None
view repo
To capture the death rates and strong weekly, biweekly and probably monthly patterns in the Canada COVID-19, we utilize the generalized additive models in the absence of direct statistically based measurement of infection rates. By examining the death rates of Canada in general and Quebec, Ontario and Alberta in particular, one can easily figured out that there are substantial overdispersion relative to the Poisson so that the negative binomial distribution is an appropriate choice for the analysis. Generalized additive models (GAMs) are one of the main modeling tools for data analysis. GAMs can efficiently combine different types of fixed, random and smooth terms in the linear predictor of a regression model to account for different types of effects. GAMs are a semi-parametric extension of the generalized linear models (GLMs), used often for the case when there is no a priori reason for choosing a particular response function such as linear, quadratic, etc. and need the data to 'speak for themselves'. GAMs do this via the smoothing functions and take each predictor variable in the model and separate it into sections delimited by 'knots', and then fit polynomial functions to each section separately, with the constraint that there are no links at the knots - second derivatives of the separate functions are equal at the knots.
READ FULL TEXT VIEW PDF
Sparse generalized additive models (GAMs) are an extension of sparse
gen...
read it
Simple smooth additive models for the observed death-with-COVID-19 serie...
read it
Sparse generalized additive models (GAMs) are an extension of sparse
gen...
read it
Regression models are popular tools in empirical sciences to infer the
i...
read it
Analyzing data from multiple neuroimaging studies has great potential in...
read it
Count data with high frequencies of zeros are found in many areas, speci...
read it
Links between frequentist and Bayesian approaches to smoothing were
high...
read it
None
In late December 2019, some severe pneumonia cases of unknown cause were reported in Wuhan, Hubei province, China. These cases epidemiologically linked to a seafood wholesale market in Wuhan, although many of the first 41 cases were later reported to have no known exposure to the market. In early January 2020, the World Health Organisation (WHO) named this novel coronavirus as Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) and the associated disease as COVID-19 [14]. Following this report, there has been a rapid increase in the number of cases as on 24th June 2020, there were over 9.2 million confirmed cases and almost 475,000 deaths worldwide, while the confirmed cases in Canada was 103,000, with 8,500 deaths mostly distributed in three provinces of Quebec, Ontario, and Alberta. The death rates for these cases which are available from https://www.canada.ca/en/public-health/services/diseases/2019-novel-coronavirus-infection.html
are shown in figure 1.
From the box plot one can see that there are a few outliers in the Alberta data only, see figure 2.
Also the trend, seasonal and random components of each case can be obtained by decomposition function of time series where in figure 3 we plotted only the Canada deaths.
To see the normality of the data, we can plot the histogram as well as the density of the deaths, figure 4. The plots clearly show that there are substantial overdispersion relative to the Poisson for all cases except possibly for the Alberta data.
In general, generalized additive models (GAMs), see e.g. S. Wood, 2017, [10] are a semi-parametric extension of the generalized linear models (GLMs), used often for the case when there is no a priori reason for choosing a particular response function (such as linear, quadratic, etc.) and need the data to speak for themselves. GAMs first introduced by Hastie and Tibshirani, 1986 [7] and Hastie and Tibshirani, 1990 [8] and are widely used in practice (Ruppert et al. 2003) [6]; (Fahrmeir, L., Lang, S. 2001) [1]; (Fahrmeir, et al. 2004) [2] (Fahrmeir et al. 2013) [3]
. On the other hand, GLMs themselves are extension of the linear models (LMs). To understand the thing better, let us start with LMs. For a univariate response variable of multiple predictors one may simply write
(2.1) |
It is clear that the response variable
is normally distributed with mean
, and variance
and the linearity of the model is apparent from the equation. One of the issues with this model is that, the assumptions about the data generating process are limited. One remedy for this is to consider other types of distribution, and include a link function relating the mean , i.e.,(2.2) |
In fact the typical linear regression model is a generalized linear model with a Gaussian distribution and identity link function. To further illustrate the generalization, we may consider a distribution other than the Gaussian for example Poisson or a negative binomial distribution for a count data where the link function is natural log function.
(2.3) |
where is any exponential family distribution. We can still generalize more to add nonlinear terms in the above equation namely
where and an exponential or non-exponential family distribution and
are any univariate or multivariate functions of independent variables called smooth and nonparametric part of the model, which mean that the shape of predictor functions are fully characterized by the data as opposed to parametric terms that are defined by a set of parameters like the parameter vector
in the linear part in which both to be estimated.
Having said that, the ordinary least square problem generalizes into
(2.4) |
where , controlling the extent of penalization and establishes a trade-off between the goodness of fit and the smoothness of the model.
On the other hand, these functions are represented using appropriate intermediate rank spline basis expansions of modest rank , as
(2.5) |
Substituting the function under the integral sign with the above equation we get
(2.6) |
where the right hand side is a quadratic form with respect to the known matrix .
Collecting both the parametric and non-parametric coefficients into a double parameter , we obtain
(2.7) |
Writing we get the more compact form
(2.8) |
Let be the maximizer of and the negative Hessian of at . From a Bayesian viewpoint is a posterior mode for . The Bayesian approach views the smooth functions as intrinsic Gaussian random fields with prior given by , where is a Moore–Penrose orpseudoinverse of . Furthermore in the large sample limit
(2.9) |
Writing the density in (2.8) as , and the joint density of and as , the Laplace approximation to the marginal likelihood for the smoothing parameters and is . Finally, nested Newton iterations are used to find the values of and ; maximizing ) and the corresponding . Wood, S. N. 2016 [11]
Let denotes the deaths reported on day . We will study the Canada deaths in general and the provinces of Quebec, Ontario, and Alberta in particular cases. Due to overdispersion seen in the deaths, we assume that follows a negative binomial distribution with mean and variance
. Note that , where .
Then let
(3.1) |
where is a link function log in our discussion, is a smooth function of time, , measured in days, is a zero mean cyclic smooth function of day of the week , set up so that , is a zero mean cyclic smooth function of day of the biweek , set up so that , is a zero mean cyclic smooth function of day of the month , set up so that , and denotes the order of the derivative in , and , see S. Wood 2020 [13]. Based on the discussion and notations in section.1, , , and are basis functions representing the underlying death rate and the strong weekly, biweekly and monthly cycles seen in the data respectively.
From (2.8) we can easily compute the confidence intervals for each f and make inferences about when the peak in f occurs. This can be done by executing gam function of mgcv library in R code. The fitted models to the reported deaths in Canada, Quebec, Ontario, and Alberta are shown in figures 5, 6, 7 and 8 respectively. In the Canada case, we model the deaths with respect to day, day of week and day of month. As the results of figure 5 shows, all the variables are statistically significant with
as R-squared.For the deaths of Quebec, we see that the day, day of week is an appropriate choice for the predicted variables. As the results of figure 6 shows, all the variables are statistically significant with as R-squared.
In the case of the Ontario we see that also the day, day of week and day of month is an appropriate choice for the predicted variables. As the results of Figure 7 shows, all the variables are highly significant with as R-squared.
We figured out that the day, day of biweek as predictors for the Alberta case is an appropriate choice for the model. As the results of figure 8 shows, all the variables are statistically significant with as R-squared.
The next statistics is the results of the gam.check consists of Q-Q plot where for a good fit the data should lie on the red line. The second is the histogram of the residuals. In this plot, the histogram must be symmetric with respect to the line . Third plot is the Residuals vs. Linear predictors. This plot must be symmetric with respect to the line . Finally, the last plot is the Response vs. Fitted values. The more closer the data to the line , the better the fitted model. These plots for the corresponding data of Canada, Quebec, Ontario and Alberta are given in figures 9, 10, 11 and 12 respectively.
The other statistics of the model fits to the reported deaths in all cases are shown in figures 13, 14, 15 and 16 respectively. The posterior modes (solid) and 95% confidence intervals for the model functions as well as auto-correlation functions and the deviance residuals against day for the Quebec and Alberta.
With gam model it is also straightforward to make inferences about when the peak in f occurs. To this end, it is enough to use the model matrix by removing cyclic part, the coefficients and variance-covariance matrix of the model to estimate the model functions and confidence intervals. To find the day of occurrence of the peak for each corresponding underlying death rate function, , simply we generates multivariate normal random deviates using ’rmvn’ function from ’mgcv’ package in which it takes 3 arguments as number of simulations, the coefficients and variance-covariance matrix of the model. The results for all 4 cases are shown on the figures 17 and 18 respectively.
To obtain the model priors with auto-generated code and associated data is to simulate jagam from rjags library in GAMs. We also load glm to improve samplers for GLMs. This is useful for inference about models with complex random effects structure. The new mgcv function jagam is designed to be called in the same way that the modeling function gam would be called. That is, a model formula and family object specify the required model structure, while the required data are supplied in a data frame or list or on the search path. However, unlike gam, jagam does no model fitting. Rather it writes JAGS code to specify the model as a Bayesian graphical model for simulation with JAGS, and produces a list containing the data objects referred to in the JAGS code, suitable for passing to JAGS via the rjags function jags.model (Plummer 2016)[5]. To infer the sequence of past fatal infections one needs to produce the observed sequence of deaths. Verity et al. 2020 [9] show that the distribution of time from onset of symptoms to death for fatal cases can be modelled by a gamma density with mean 17.8 and variance 71.2 (s.d. 8.44). Let be the function describing the variation in the number of eventually fatal cases over time. Let be the lower triangular square matrix of order , describing the onset-to-death gamma density, given above where is the number of day of pandemic under consideration. Then we have if and 0 otherwise. Now , where is the expected number of deaths on day . Then can be represented using an intermediate rank spline, again with a cubic spline penalty. We can then employ exactly the model of the previous section but setting . The only difference is that we need to infer over a considerable period before the first death occurs where 15 days is clearly sufficient given the form of deaths data. After executing the jagam code for the data frame of deaths, day, day of week, day of biweek and day of month variables - note that we have used different combination of the above variables to see which one is appropriate of Canada deaths in general and Quebec, Ontario and Alberta in particular cases. Jagam code with Poisson family distribution produces a model containing all the priors. Then by adding the matrix
and a bit of extra regularization to the output we pass the model to jags.model function JAGS is Just Another Gibbs Sampler. It is a program for the analysis of Bayesian models using Markov Chain Monte Carlo (MCMC) simulation. Then by passing the result to the jags.samples function with the parameters of thin=300 and iteration=1000000 we get the values of
, , and the Monte carlo array of 3 dimension with the first dimension equals to the number of parameters in the gam model. As in the gam model, we take the data in the first component of the jags model as the model.matrix and the data in first 2 dimension of the to simulate the fatal infection profiles and get all the necessary statistics such as median with confidence intervals, the peak points of the median profile, the squared second difference of the median infection profile on the log scale which is proportional to the smoothing penalty and the absolute gradient of the infection profile. See Wood, S. N. 2016 [12] These are depicted in figure 19.Acknowledgement. I am very grateful to Simon S. Wood as his paper [13] was very inspirational to my work.
Fahrmeir, L., and Lang, S. 2001, Bayesian Inference for Generalized Additive Mixed Models Based on Markov Random Field Priors, Applied Statistics, 50, 201–220.
Comments
There are no comments yet.