Modelling and predicting the spread of Coronavirus (COVID-19) infection in NUTS-3 Italian regions

by   Diego Giuliani, et al.

This paper represents a frst attempt to model and predict the spatio-temporal distribution of Coronavirus (COVID-19) infections at NUTS-3 regional level in Italy. The statistical endemic-epidemic multivariate areal count time-series model introduced by Paul and Held (2011), and further refined by Meyer and Held (2014), is employed. Despite being very preliminary, the results are promising with respect to the possibility to gain useful insights about the spatial and temporal characteristics of both the endemic and epidemic behaviours of this new infectious disease. Moreover, the result also show that the model has the potential to provide good forecasts of the number of infections at local level while controlling for under-reporting.



page 1

page 2

page 3

page 4


Modelling and predicting the spatio-temporal spread of Coronavirus disease 2019 (COVID-19) in Italy

Official freely available data about the number of infected at the fines...

A Link Mixture Model for Spatio-temporal Infection Data, with Applications to the COVID Epidemic

Spatio-temporal models for infection counts generally follow themes of t...

Modelling provincial Covid-19 epidemic data in Italy using an adjusted time-dependent SIRD model

In this paper we model and predict the spread of COVID-19 infection at a...

Modelling COVID-19 – I A dynamic SIR(D) with application to Indian data

We propose an epidemiological model using an adaptive dynamic three comp...

Spatio-Temporal Analysis of Surveillance Data

In this chapter, we consider space-time analysis of surveillance count d...

No increase in COVID-19 mortality after the 2020 primary elections in the USA

We examined the impact of voting on the spread of COVID-19 after the US ...

Predicting Future Opioid Incidences Today

According to the Center of Disease Control (CDC), the Opioid epidemic ha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since its appearance in China, Coronavirus (COVID-19) has caught the attention as a serious disease. The virus, which has made inter-species jumps probably from bats, is a severe respiratory syndrome characterized by a strong person-to-person contagiousness

(Guan et al., 2020). It is indeed non surprising that Coronavirus has spread throughout and outside of China in a short time but not following a uniform spreading pattern. Italy is the second country in the world and the first in Europe in terms of number of infected, due to a number of reasons, many of which not yet clearly detected. In addition to medical and epidemiological reasons that are not covered the present paper, certainly a relevant role is played by the particular shape of Italian territory, characterized by the poor presence of uninhabited and isolated lands, and of Italian population, being the European country with the higher share of population over 65, which seems to be more exposed to the risk of contagion. In addition to the described peculiarities, the diffusion of Coronavirus in Italy has not seen a uniform pattern on the territory. After the first case not directly connected with China has been discovered the 20th February 2020 in the NUTS-2 region of Lodi (within the NUTS-3 region Lombardy, located in the north-west part of the country), it has been possible to observe the diffusion of the disease (or, more precisely, symptomatic cases connected to them) mainly in three northern NUTS-2 regions of Lombardy, Veneto and Emilia-Romagna. It took some days to record contagions in other regions, whether near or far, and even today that Coronavirus is spread throughout Italy, some provinces seems to be more affected than others. Measures to contain the epidemic (like those already taken in China) started from the cited regions and were later applied to the entire country and the effects of restrictions may be evaluated only after a certain period. Meanwhile, it may be interesting to try to understand the contagion phenomenon and to predict its spatial diffusion. The present contribution represents essentially an attempt to model and predict the number of Coronavirus infections in NUTS-3 Italian regions (provinces) using the statistical endemic-epidemic modelling framework introduced by Paul and Held (2011). Analyses and predictions at local level can indeed lead to important insights for the identification of the proper policies and measures to contain the contagion, that tend to be more effective when organized and planned at local level. Many scholars are trying to give a contribution to the problem, mostly using deterministic epidemiological models, (see Chinazzi et al., 2020; Castorina et al., 2020; Sun et al., 2020, among others) but a substantial contribution to the problem can also be given by statistics and probabilistic models. Section 2 will be devoted to illustrating the statistical methodological framework to model the spatio-temporal distribution of Coronavirus (COVID-19) infections at regional level. In Section 3 we describe the collected data and in Section 4 we illustrate the application to the NUTS-3 Italian regions. Section 5 concludes.

2 Statistical framework

The data about the spatio-temporal distribution of Coronavirus (COVID-19) infections at regional level consist on multivariate count time series whose spatial references are in the form of irregular spatial lattices. Therefore, the proper regression modelling framework for this empirical circumstance is the class of the so-called areal Generalized Linear Models (GLMs). By extending the seminal model originally introduced by Held et al. (2005), Paul and Held (2011) proposed an endemic-epidemic multivariate time-series mixed-effects GLM for areal disease counts, which proved to provide good predictions of infectious diseases (see Adegboye and Adegboye, 2017; Cheng et al., 2016, among others).

According to this model (here called PH model), conditionally on antecedent observations, the number of Coronavirus infections reported in a region at time ,

, follows a negative binomial distribution with mean parameter


and region-level overdispersion parameter . If

the conditional variance of

is , while if

the negative binomial distribution reduces to a Poisson distribution with parameter


The model in Equation (1) specifies that, in each region , the expected conditional number of Coronavirus infections is due to three additive subcomponents: the endemic effect, , that accounts for the exogenous incidence of the disease; the epidemic within autoregressive effect, , that captures the infectiousness of the disease inside region ; and the epidemic between autoregressive effect, , that quantifies the infectiousness of the disease from the neighbouring regions . The PH model also includes an offset, , accounting for the regions’ population size and spatial weights, , indicating how much two regions and are spatially connected.

Paul and Held (2011) suggested that the endemic and epidemic subcomponents can be modelled themselves through log-linear specifications:


where the parameters are region-specific intercepts and represent observed covariates that can affect both the endemic and epidemic occurrences of infections. The varying intercepts allow to control for unobserved heterogeneity in the disease incidence levels across regions due, for example, to under-reporting of actual infections (Paul and Held, 2011). Given the regionally decentralized health system in Italy, non-negligible differences in case reporting of Coronavirus infections among Italian regions are very likely, which make the opportunity to have region-specific intercepts very important. Following Paul and Held (2011) region-specific intercepts can be obtained through the inclusion of random effects. In particular, we assume here that

The PH model with normally distributed random effects can be estimated through penalized likelihood approaches that have been implemented in the

R package surveillance (Meyer et al., 2017). See Paul and Held (2011) for futher details.

2.1 Endemic submodel

Since some first recent empirical evidences suggest that the number of Coronavirus infections seems to essentially grows exponentially over time (Liu et al., 2020; Maier and Brockmann, 2020), the endemic component model assessing the temporal dynamic of disease incidence, Equation (2

), is specified as a second-order polynomial log-linear regression

111We do not include higher-order terms to avoid the risk of introducing artificial endemic patterns and overfitting.:

In the global model of Equation (1) the endemic predictor is multiplied by the offset , which in our case is the regional share of resident population.

2.2 Epidemic within submodel

Given the brevity of the observed time series, the epidemic within autoregressive parameter is assumed to be constant over time and, in absence of useful exogenous covariates, the model of Equation (3) takes the form

that is, the “internal” infectiousness depends only on a spatially varying intercept.

2.3 Epidemic between submodel

Following Meyer and Held (2014), the subcomponent model for the epidemic within autoregressive parameter, Equation (4), is here specified as

which accounts for the fact that the regions may have different propensities to be affected by the other neighbouring regions, and this may depend by their resident population share. The rationale is that the more populated regions tend to be more susceptible to transmission across regions.

Meyer and Held (2014) also proposed to estimate the spatial weights in Equation (1) using a power-law model:

where is the adjacency order between regions and . For example, if then region is a direct adjacent neighbour to region ; if then region is a direct adjacent neighbour to a direct adjacent neighbour to region . The parameter is a decay constant that has to be estimated. The higher is , the less strong is the spatial interaction with higher order neighboring regions.

3 Data

The Italian Department of Civil Protection according with the Italian Ministry of Health, has started to release a daily bulletin about Coronavirus (COVID-19) infections in Italy since the February 2020. By the March 2020 data were also released at NUTS-3 regional level (provinces), having been previously released only at the highly aggregated NUTS-2 level. Data are available from the websites of the main Italian newspapers, such as, Il Sole 24 ore ( Therefore, we were able to collect the infectious disease count time series at NUTS-2 level for the period from February 26 to March 12 using two sources. For the first six days, data were provided only to media outlets and published by newspapers in the form of interactive maps. For this period, we have used web scraping techniques to track raw data upon which maps have been constructed. For the remaining days, we simply downloaded NUTS-3 regional level data from the institutional website of the Department of Civil Protection. Clearly, at this stage, data suffer from several problems of quality, both because contagions are continuously being updated and because of under-reporting. Therefore, the results presented in the next Section may benefit from availability of more accurate data. Figure (1) depicts the overall time series of daily counts of Coronavirus infections, while Figure (2) shows the analogous region-specific data.222The overall time series here depicted is given by the aggregation of regional time series and may not correspond exactly to the national time series because some cases could not being assigned precisely to the relative NUTS-3 regions. The two plots indicate that while the growth of cases over time in Italy has been globally exponential, it is characterized by strong local heterogeneity. The map in Figure (3), representing the disease incidence per 1000 inhabitants, shows that local heterogeneity has also a relevant spatial pattern in the form of an hot-spot in the northern part of the country.

Figure 1: Daily counts of Coronavirus (COVID-19) infections in Italy
Figure 2: Daily counts of Coronavirus (COVID-19) infections in the NUTS-3 Italian regions
Figure 3: Coronavirus (COVID-19) incidence in the NUTS-3 Italian regions (February 26 2020 - March 12 2020)

4 Results

4.1 Estimation

The maximum penalized-likelihood estimates of the parameters for the PM model are reported in Table (1).333All analyses presented in this paper were performed using the package surveillance (Meyer et al., 2017) in the statistical environment R (R Core Team, 2018). The most striking evidence is that the between epidemic potential is, on average, very strong; however, it is also highly heterogeneous among NUTS-3 regions, as indicated by the high value of the variance of the associated random effects.

Parameter Estimate St. Error
2.081 0.190
0.980 0.004
0.955 0.468
0.132 0.037
21.004 32.065
1.144 0.450
1.076 0.312
0.668 0.048
Table 1: Estimates for the PH model

To better assess the degree of spatial heterogeneity of the three subcomponents we can look at the map of the estimated region-specific effects (Figure (4)). We can see that some NUTS-3 regions are more influenced by the endemic phenomenon while others by the epidemic transmission. As a way of illustration, it is also interesting to examine in details the model fitted sub-means for the NUTS-3 regions with the higher number of cases (Figure (5)). If we focus on the aggregated time series of the daily counts in Italy, the contribution of the three subcomponents seems quite balanced. For the individual NUTS-3 regions, instead, they can exert a very different role. For example, for NUTS-3 regions Brgamo, Pavia and Cremona almost all cases of infection are explained by the transmission from the other neighbouring NUTS-3 regions while NUTS-3 regions Lodi and Piacenza are basically solely affected by the within autoregressive effect.

Figure 4: Estimated multiplicative effects of the endemic, within epidemic and between epidemic subcomponents
Figure 5: Fitted means of the three submodels for the eight NUTS-3 regions with a higher number of infections. The dots represent the observed daily counts.

4.2 Prediction

As a simple first attempt to assess the ability of the model in predicting the future daily counts of infections, we re-estimate the model using the data between February 26, 2020 - March 11, 2020 and we make prediction for March 12, 2020. The observed total number of cases at March 11, obtained by aggregating the regional data, is 2485; the model provides a prediction of 2759 cases. Therefore, the model overestimates by 274 cases. However, the count of 2485 exclude the cases that could not be assigned precisely to a NUTS-3 region. The actual national count is 2651, which seems to suggest that the PH model can predict the number of non-assigned cases. This is probably due to the inclusion of random effects that allow to control for randomly driven under-reporting at regional level.

Naturally, the advantage of the PH model is that it provides predictions for each individual NUTS-3 region. Figure (6) shows three maps depicting the 80 prediction intervals444Given the brevity of the observed time series to obtain a reasonable degree of precision we cannot require higher confidence levels of the regional counts of infections at March 12, 2020. In particular, they report, respectively, the lower endpoints of the interval, the point prediction and the upper endpoints of the interval. Despite the fact that the intervals are quite wide, the results are promising if compared with the known observed counts. We expect that the level of precision will improve as the data are updated and the observed time series become longer.

Figure 6: Prediction intervals of NUTS-3 regional counts of infections at March 12, 2020

5 Conclusions

In this paper we analyzed the spatio-temporal distribution of Coronavirus (COVID-19) infections in Italy using data aggregated at NUTS-3 regional level. The main and most striking evidence is that both the endemic and epidemic characteristics of the disease are very heterogeneous across local territories, thus suggesting that policies and measures to contain its spreading may potentially be more effective if planned having in mind the peculiarities of local territories.


  • O. A. Adegboye and M. Adegboye (2017) Spatially correlated time series and ecological niche analysis of cutaneous leishmaniasis in afghanistan. International Journal of Environmental Research and Public Health 14 (3). Cited by: §2.
  • P. Castorina, A. Iorio, and D. Lanteri (2020) Data analysis on coronavirus spreading by macroscopic growth laws. External Links: 2003.00507 Cited by: §1.
  • Q. Cheng, X. Lu, J.T. Wu, Z. Liu, and J. Huang (2016) Analysis of heterogeneous dengue transmission in guangdong in 2014 with multivariate time series model. Scientific Reports 6 (33755). Cited by: §2.
  • M. Chinazzi, J. T. Davis, M. Ajelli, C. Gioannini, M. Litvinova, S. Merler, A. Pastore y Piontti, K. Mu, L. Rossi, K. Sun, C. Viboud, X. Xiong, H. Yu, M. E. Halloran, I. M. Longini, and A. Vespignani (2020) The effect of travel restrictions on the spread of the 2019 novel coronavirus (covid-19) outbreak. Science. Cited by: §1.
  • W. Guan, Z. Ni, Y. Hu, W. Liang, C. Ou, J. He, L. Liu, H. Shan, C. Lei, D. S.C. Hui, B. Du, L. Li, G. Zeng, K. Yuen, R. Chen, C. Tang, T. Wang, P. Chen, J. Xiang, S. Li, J. Wang, Z. Liang, Y. Peng, L. Wei, Y. Liu, Y. Hu, P. Peng, J. Wang, J. Liu, Z. Chen, G. Li, Z. Zheng, S. Qiu, J. Luo, C. Ye, S. Zhu, and N. Zhong (2020) Clinical characteristics of coronavirus disease 2019 in china. New England Journal of Medicine 0 (0), pp. null. External Links: Document Cited by: §1.
  • L. Held, M. Höhle, and M. Hofmann (2005) A statistical framework for the analysis of multivariate infectious disease surveillance counts. Statistical Modelling 5 (3), pp. 187–199. Cited by: §2.
  • Y. Liu, A. A. Gayle, A. Wilder-Smith, and J. Rocklöv (2020) The reproductive number of COVID-19 is higher compared to SARS coronavirus. Journal of Travel Medicine. Cited by: §2.1.
  • B. F. Maier and D. Brockmann (2020) Effective containment explains sub-exponential growth in confirmed cases of recent covid-19 outbreak in mainland china. External Links: 2002.07572 Cited by: §2.1.
  • S. Meyer, L. Held, and M. Höhle (2017) Spatio-temporal analysis of epidemic phenomena using the r package surveillance. Journal of Statistical Software, Articles 77 (11), pp. 1–55. Cited by: §2, footnote 3.
  • S. Meyer and L. Held (2014) Power-law models for infectious disease spread. The Annals of Applied Statistics 8 (3), pp. 1612–1639. Cited by: Modelling and predicting the spread of Coronavirus (COVID-19) infection in NUTS-3 Italian regions, §2.3, §2.3.
  • M. Paul and L. Held (2011) Predictive assessment of a non-linear random effects model for multivariate time series of infectious disease counts. Statistics in Medicine 30 (10), pp. 1118–1136. Cited by: Modelling and predicting the spread of Coronavirus (COVID-19) infection in NUTS-3 Italian regions, §1, §2, §2, §2.
  • R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. External Links: Link Cited by: footnote 3.
  • K. Sun, J. Chen, and C. Viboud (2020) Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study. The Lancet Digital Health, pp. . Cited by: §1.