Simple models for COVID-19 death and fatal infection profiles

by   Simon N Wood, et al.

Simple smooth additive models for the observed death-with-COVID-19 series adequately capture the underlying death rate and strong weekly pattern in the data. Clear inference about peak timing is then possible. Further, inference about the earlier infection rate dynamics driving the death rate dynamics can be treated as a simple Bayesian inverse problem, which can be readily solved by imposing a smoothness assumption on the infection rate. This straightforward semi-parametric approach is substantially better founded than the running mean smoothers which generally form the basis for public debate. In the absence of direct statistically based measurement of infection rates, it also offers a usefully assumption-light approach to data analysis, for comparison with the results of the more assumption-rich process simulation models used to inform policy. An interesting result of the analysis is that it suggests that the number of new daily infections in the UK peaked some days before lock down was implemented, although it does not completely rule out a slightly later peak.



There are no comments yet.


page 1

page 2

page 3

page 4


Machine Learning the Phenomenology of COVID-19 From Early Infection Dynamics

We present a data-driven machine learning analysis of COVID-19 from its ...

Generalized additive models to capture the death rates in Canada COVID-19

To capture the death rates and strong weekly, biweekly and probably mont...

Exo-SIR: An Epidemiological Model to Analyze the Impact of Exogenous Infection of COVID-19 in India

Epidemiological models are the mathematical models that capture the dyna...

Semi-Parametric Estimation of Incubation and Generation Times by Means of Laguerre Polynomials

In epidemics many interesting quantities, like the reproduction number, ...

Total Effect Analysis of Vaccination on Household Transmission in the Office for National Statistics COVID-19 Infection Survey

We investigate the distribution of numbers of secondary cases in househo...

Bayesian Modeling of COVID-19 Positivity Rate – the Indiana experience

In this short technical report we model, within the Bayesian framework, ...

Network Analysis of SFU Course Registrations

We investigate the dynamics of disease infection via shared classes at S...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

At time of writing, it is five months into the COVID-19 pandemic. Despite millions of tests having been performed, there are still no results from statistically well founded sampling based testing programmes to establish basic epidemic quantities such as infection fatality rate and infection rates. In the absence of such direct data, epidemic management has to proceed on the basis of data produced largely as a side effect of the clinical response to the disease.

The purpose of this note is to point out that a very simple smooth additive model is able to satisfactorily represent country specific death rate data series, such as those shown in figure 1 and collated at Such a model allows resolution of inferential questions about the timing of the peak in the underlying death rate. Augmenting this model with information from Verity et al. (2020)

on the distribution of times from symptom onset to death, a simple extension to the model also allows inference about the infection rate dynamics required to generate the deaths series. The advantage of such simple semi-parametric models is that, in contrast to detailed epidemic models, they make only modest assumptions about the data. They can therefore offer relatively clear insight into what the data are telling us, without the influence of strong modelling assumptions, thereby providing a useful complement to more heavily model based analyses.

2 A smooth additive model for the death rate series

Let denote the deaths reported on day . Assume that

follows a negative binomial distribution with mean

and variance

. Then let

where is a smooth function of time, , measured in days, and is a zero mean cyclic smooth function of day of the week, , set up so that , where or 2 denotes order of derivative. is a known link function, either log or identity in this case.

and are represented using appropriate intermediate rank spline basis expansions, with associated cubic spline penalties (see e.g. Wood, 2017). represents the underlying death rate, while accounts for the strong weekly cycle seen in the data, presumably as a result of reporting artefacts and the well known weekly signal in hospital admissions. The model treats the weekly variability issue descriptively rather than mechanistically.

Figure 2: Results of fitting the model of section 2 to the data in figure 1. a and b

are posterior modes (solid) and 95% credible intervals for the model functions for the UK.

c and d are the equivalent for Sweden. e is the ACF of the deviance residuals for the UK, lag 1 is just significantly negative. f plots deviance residuals against day for the UK. g and h are the equivalent checking plots for Sweden.

Model inference uses the empirical Bayes approach of Wood et al. (2016)

in which the smooth functions are estimated by penalized likelihood maximisation

(e.g. Green and Silverman, 1994), with the smoothing parameters and estimated by Laplace approximate marginal likelihood maximization. Writing

for the combined vector of basis coefficients for

and , the penalized log likelihood can be written

where : and are known constant positive semi-definite matrices. Smoothing parameters, and , control the smoothness of and . Let be the maximizer of the penalized log likelihood, and its negative Hessian at . Viewing the penalty as being induced by an improper Gaussian prior, , is also the MAP estimate of . Furthermore in the large sample limit


Writing the density in (1) as , and the joint density of and as , the Laplace approximation to the marginal likelihood for the smoothing parameters and is . Nested Newton iterations are used to find the values of maximizing and the corresponding (for details see Wood et al., 2016).

Given (1) credible intervals for are readily computed, but it is also straightforward to make inferences about when the peak in occurs. Simply simulate replicate coefficient vectors from (1) and find the day of occurrence of the peak for each corresponding underlying death rate function, . This approach requires a few lines of R code to implement, which are provided in the supplementary material.

The model fits to the reported deaths in the UK and Sweden are shown in figure 2 along with basic model checking plots. Figure 3

shows the posterior mode of the underlying death rates against day, with credible intervals, as well as illustrating the posterior distribution of the day of peak underlying death rate. The UK peak is around day 28, something which could have been confidently declared around day 40 with this modelling approach. For Sweden the peak is later and by the end of the plotted data it is not yet quite possible to rule out that the peak has not yet occurred, although the probability of this is


Figure 3: Underlying death rates for the UK (left) and Sweden (right) with 95% credible intervals, excluding the weekly cycle. Day 0 is March 13th 2020. The scaled grey bar chart illustrates the distribution of the location of the peak of the death rate curve, obtained by simulation from the approximate posterior (1) for each model. The scales are chosen to be the same as figure 1, and the figures are therefore comparable on the deaths per million scale.

3 Inferring the past fatal infection profile

It is also of interest to infer the sequence of past (ultimately fatal) infections required to produce the observed sequence of deaths. This is not simply a matter of lagging the death sequence by the average infection to death time, because the distribution of time from symptom onset to death is broad and asymmetric. Verity et al. (2020) show that the distribution of time from onset of symptoms to death for fatal cases can be modelled by a gamma density with mean 17.8 and variance 71.2 (s.d. 8.44). Note that this mean is longer than that often quoted in clinical papers and the media, because it corrects for the bias associated with seeing short infections before long ones at the start of an epidemic.

Let be the function describing the variation in the number of eventually fatal cases over time. Let be the square matrix such that if and 0 otherwise, where denotes the onset-to-death gamma density, given above. If and then , where is the expected number of deaths on day . Now can be represented using an intermediate rank spline, again with a cubic spline penalty. We can then employ exactly the model of the previous section but setting . The only difference is that we need to infer over a considerable period before the first death occurs. 40 days is clearly sufficient given the form of . In fact it makes sense to reduce this interval, after inspecting a pilot run, to avoid a lengthy initial period of zero fatal cases, consequent lack of identifiability of and poor MCMC mixing. For the UK data a 20 day initial period is more than sufficient. For stable inference it also makes sense to explicitly include in the death data the fact that no deaths were observed in this initial period. Note that is rank deficient, so inferring can be viewed as an inverse problem, only soluble because of the smoothness assumption on .

Figure 4: Left: The inferred UK fatal infection profile over time (day 0 is March 13th 2020). The black curve is the posterior median, the dashed curves delimit an 80% credible interval and the dotted curves a 95% credible interval. The scaled bar chart shows the posterior distribution of the day of peak fatal infection. The full height vertical grey line shows the first day of UK lock down. The dashed grey curve is proportional to the squared second difference of the median infection profile on the log scale, which is proportional to the smoothing penalty. The grey curve is proportional to the absolute gradient of the infection profile. Right: Consistency checking. In grey are 100 replicate death profiles corresponding to the median infection profile shown left. Overlaid is the underlying death rate inferred in section 2.

In principle this model could be estimated using the framework of Wood et al. (2016), but some non-trivial work would be required to set it up. In the current context, in which results are needed rapidly, it is sensible to take the approach requiring the least amount of bespoke code. The most straightforward implementation is to use Gibbs sampling via JAGS (Plummer, 2003; Plummer et al., 2006). Implementation is particularly straightforward if the automatic code template generation described in Wood (2016) is used, as implemented in mgcv function jagam. The JAGS code template can be generated for the model given in the previous section. A few lines of additional code are then sufficient to implement the modification described here. The only disadvantage of the approach is that stochastic simulation is slow. While the previous section’s model took less than 1/20th of a second to compute, a couple of hours of computing was required for a mean effective sample size of 3.5 thousand for the coefficients of (and even then the slowest mixing coefficient had an effective sample size of only 250).

Figure 4 (left) shows the posterior median estimate of the fatal infection rate against day for the UK. Also shown are 80% and 95% credible intervals, and a scaled bar chart summarizing the posterior distribution of the day of peak fatal infection. The day of UK lock down is shown as a vertical grey line. The alignment with the time axis requires an allowance for the incubation period of the disease, i.e. the time from infection to onset of symptoms. This period is 2 to 11 days for 95% of people, with a median of 5 days (Lauer et al., 2020). In the absence of more detailed information about the incubation distribution I simply shifted the inferred curve back 5 days.

Figure 4 (right) shows a model checking comparison with the death rate models of section 2. 100 death rate profiles were simulated by taking the median number of fatal infections each day and, for each infection, randomly generating the corresponding death day from the onset-to-death gamma model. These profiles are shown in grey. Overlaid is the death rate estimate from the section 2 model. Clearly the models are consistent with each other, but at its end the median infection curve may be over-optimistic about how far the infection rate has declined.

Given that the peak is estimated to occur before lock down, it is worth considering whether this could be an artefact resulting from a smooth model trying to accommodate a sudden drop in infections at lock down. There is obviously a question about how sudden this drop would be, given that lock down initially substantially increases contact with household members at the expense of everyone else, and in addition there had already been substantial social distancing before lock down. However sticking with checking the model itself, the inferred curve has strange properties if it contains artefacts from trying to match a very steep drop. The most notable is that the magnitude of the slope of the curve is actually lower on the first day of lock down, after the peak, than before the peak. The second is that the log scale squared second differences, which measure the contribution to the smoothing penalty, are low around lock down compared to earlier: again there is no indication of the model trying to smooth through a step.

A peak before lock down is consistent with the absence of a catastrophe in the data from Sweden, which did not lock down, and is not entirely implausible, given the level of social distancing already in place before UK lock down. Alternatively, it could be that the onset to death distribution from Verity et al. (2020) gives too long durations for the UK context. For example, this could happen if age strongly effects time from onset-to-death and the age structure of the UK cases differs substantially from the age structure of the cases used by Verity et al. (2020). Similarly within hospital transmission to already very ill patients might lead to shorter disease duration, and in turn to the peak being estimated as earlier than it was, but the cross infection rate would have to be quite high for this to be a substantial factor.

The very wide confidence interval at the end of the data period relates to the rank deficiency of the

matrix and the mild smoothing assumptions on used to deal with this. Increase in at the end of the time period is biologically implausible, given earlier decline and the measures still in place at the end of the period. This implausibility is not built into the model, in keeping with the desire for the analysis to be as data driven as possible. However, the wide interval does emphasise the desirability of disease surveillance via a statistically sound randomized testing program in order to reduce uncertainty. Such testing would be inexpensive in the context of the £5.5Billion per day UK economy being put on hold for weeks.


  • Green and Silverman (1994) Green, P. J. and B. W. Silverman (1994). Nonparametric Regression and Generalized Linear Models. Chapman & Hall.
  • Lauer et al. (2020) Lauer, S. A., K. H. Grantz, Q. Bi, F. K. Jones, Q. Zheng, H. R. Meredith, A. S. Azman, N. G. Reich, and J. Lessler (2020). The incubation period of coronavirus disease 2019 (covid-19) from publicly reported confirmed cases: estimation and application. Annals of internal medicine.
  • Plummer (2003) Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003)., pp. 20–22.
  • Plummer et al. (2006) Plummer, M., N. Best, K. Cowles, and K. Vines (2006). coda: Convergence diagnosis and output analysis for MCMC. R News 6(1), 7–11.
  • Verity et al. (2020) Verity, R., L. C. Okell, I. Dorigatti, P. Winskill, C. Whittaker, N. Imai, G. Cuomo-Dannenburg, H. Thompson, P. G. Walker, H. Fu, et al. (2020). Estimates of the severity of coronavirus disease 2019: a model-based analysis. The Lancet Infectious Diseases.
  • Wood (2016) Wood, S. N. (2016). Just another Gibbs additive modeller: Interfacing JAGS and mgcv. Journal of Statistical Software 75(7).
  • Wood (2017) Wood, S. N. (2017). Generalized Additive Models: An Introduction with R (2 ed.). Boca Raton, FL: CRC press.
  • Wood et al. (2016) Wood, S. N., N. Pya, and B. Säfken (2016). Smoothing parameter and model selection for general smooth models (with discussion). Journal of the American Statistical Association 111, 1548–1575.

JAGS code

The following is modified from the template produced by the jagam call in the R code.

model {
  theta ~ dunif(1,1000)
  eta <- X[,1:20] %*% b[1:20] ## linear predictor for case
  fw <- X[,21:25] %*% b[21:25] ## lp for week cycle
  for (i in 1:n) { muc[i] <-  exp(eta[i]) } ## expected cases
  Z ~ dnorm(eta[1],.1) ## allow slight regularization at start
  mud <- B %*% muc ## expected deaths
  for (i in 1:n) { mu[i] <- exp(log(mud[i])+fw[i]) } ## model expected deaths
  for (i in 1:n) { y[i] ~ dnegbin(theta/(theta+mu[i]),theta) } ## response
  ## Parametric effect priors tau=1/56^2
  for (i in 1:1) { b[i] ~ dnorm(0,0.00032) }
  ## prior for s(day)...
  for (i in c(2:19)) { b[i] ~ dnorm(0, lambda[1]) }
  for (i in c(20)) { b[i] ~ dnorm(0, lambda[2]) }
  ## prior for s(dow)...
  for (i in c(21:25)) { b[i] ~ dnorm(0, lambda[3]) }
  ## smoothing parameter priors...
  for (i in 1:3) {
    lambda[i] ~ dgamma(.05,.005)
    rho[i] <- log(lambda[i]) ## for monitoring

R code

setwd("/foo/bar") ## directory for plots uk 0) lines(c(i,i),c(0,pt[i]),lwd=3,col="grey") ## plot } if (ps) ########################################## ## Now the infection profile modelling... ########################################## library(rjags);library(mgcv) load.module("glm") ## improved samplers for GLMs often worth loading setwd("foo/bar") ## directory for plots and JAGS code uk1 quantile,probs=c(.025,.1,.9,.975)) ## get profile CIs day 0) lines(c(i-26,i-26),c(0,pt[i]/4),lwd=3) ## Sanity check the median infection profile. fi