Climate BCG: Effects on COVID-19 Death Growth Rates

by   Chris Finlay, et al.

Multiple studies have suggested the spread of COVID-19 is affected by factors such as climate, BCG vaccinations, pollution and blood type. We perform a joint study of these factors using the death growth rates of 40 regions worldwide with both machine learning and Bayesian methods. We find weak, non-significant (< 3σ) evidence for temperature and relative humidity as factors in the spread of COVID-19 but little or no evidence for BCG vaccination prevalence or PM_2.5 pollution. The only variable detected at a statistically significant level (>3σ) is the rate of positive COVID-19 tests, with higher positive rates correlating with higher daily growth of deaths.



There are no comments yet.


page 6

page 16


The Effects of Air Quality on the Spread of the COVID-19. An Artificial Intelligence Approach

The COVID-19 pandemic considerably affects public health systems around ...

Seasonal association between viral etiologies of hospitalized acute lower respiratory infections and meteorological factors in China

Background: Acute lower respiratory infections present pronounced season...

Bayesian Modeling of COVID-19 Positivity Rate – the Indiana experience

In this short technical report we model, within the Bayesian framework, ...

COVID-19 transmission risk factors

We analyze risk factors correlated with the initial transmission growth ...

Estimating County-Level COVID-19 Exponential Growth Rates Using Generalized Random Forests

Rapid and accurate detection of community outbreaks is critical to addre...

Unequal Impact and Spatial Aggregation Distort COVID-19 Growth Rates

The COVID-19 pandemic has emerged as a global public health crisis. To m...

Universal Masking is Urgent in the COVID-19 Pandemic: SEIR and Agent Based Models, Empirical Validation, Policy Recommendations

We present two models for the COVID-19 pandemic predicting the impact of...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The COVID-19 pandemic has triggered extensive efforts to predict the severity of COVID-19 to aid in decision making around interventions such as lockdown and the closure of schools 111 [1]. Regions hit hard by the pandemic, such as Wuhan, Lombardy and New York, where doubling times of 2-3 days and high crude mortality rates stand in stark contrast to other countries only mildly affected such as Hong Kong, South Korea and New Zealand.

Potential explanations for the apparent differences in the transmissivity (encoded by the time-dependent reproductive number, ) and lethality (encoded by the Infection Fatality Rate, IFR), currently fall into four broad categories. The first posits that differences are largely fictitious, driven by the heterogeneity of testing and reporting of cases and deaths; a known issue and one that we are particularly concerned with in this paper.

The next three categories posit that the differences are primarily real and are driven by (1) cultural and policy factors (swift lockdown, efficient contact tracing and quarantining, use of masks, obedient populations or social structures that are naturally distant or isolated), (2) local environmental factors (such as temperature and humidity variations, population density, comorbid factors, vaccinations, vitamin D levels, blood type etc…) or (3) existence of multiple strains with different transmissibility or lethality [4]. Our primary interest lies in disentangling the first category (testing) from a subset of potential factors in the second category.

Finding the relative contributions of each of these four categories is key in understanding and optimally fighting the pandemic. The widely different testing capabilities between countries, particularly between the developed and developing world, imply that if not correctly treated, testing variability will create spurious evidence that can lead to false hope and sub-optimal interventions.

In the wake of the spreading pandemic there have been a host of studies that have examined the possible impact of climate [8]-[19], blood type [20, 21], haplogroup [31], pollution [26]-[30] and BCG vaccination prevalence [22, 23, 24, 25] on the spread of COVID-19. Our main conclusion is that testing issues are significant and the factors above are likely not the main causes of variability in growth rates of deaths worldwide.

We note that both the basic reproductive number, , and the IFR will be affected by the four categories above in general. However, since we do not currently have access to the true number of infections we cannot address the potential effect of environmental factors on the IFR. Similarly the Case Fatality Rate cannot be used for this purpose due to testing differences around the world. We therefore focus on their potential effect on , which we quantify through the daily growth rate of deaths of countries around the world for which we have sufficiently good data.

2 Methodology

To explore the potential impact of climate (e.g. through temperature, relative humidity and UV Index), BCG vaccination prevalence, blood type and pollution (PM) on COVID-19, one must think carefully about choice of both data and methodology.

We choose to focus on deaths instead of the number of confirmed infections in the belief that these are significantly less affected by sampling and testing issues than infection numbers: the number of tests per 1000 population (Tests/1k) currently varies by more than three orders of magnitude across the world 222, making infection numbers highly biased and correlated with confounding variables such as GDP and healthcare. In countries with limited testing capability, only the more severe patients are typically tested.

Since these are also the patients most likely to die, it is likely that patients who die from COVID-19 are more likely to be tested than typical COVID-19 sufferers, who may be mostly asymptomatic. Without detailed knowledge of testing protocols in each country separating out the effects of the factors of interest (climate, BCG vaccination etc…) is extremely challenging. Although deaths are not immune to testing issues, with many missing deaths shown through excess mortality studies, especially in overwhelmed medical systems [47, 48], we argue that this is likely to be much less of an issue in the first month after initial COVID-19 deaths, which we focus on, since we are interested in .

The second key choice is whether to focus on absolute death counts or growth rates. Absolute death counts are also highly problematic. First, testing efficiencies may vary systematically from country to country. Second, the widely varying start dates of the pandemic in different countries are hard to deal with rigorously. As a result we focus on the daily growth rate of deaths, , for country during the initial phase of the epidemic. The initial daily growth correlates with , the base reproductive number in each country; in simple models , where is the period in days for which a person is infectious on average333Hence and days yields ..

Separating out the true causes of variation in or is still a highly complex problem, and we discuss our approach in detail in the additional material in section (5). Part of the complexity arises from the fact that depends explicitly both on properties of the virus and on social factors (e.g. average number and nature of contacts), as well as any of the other potential factors we wish to study. Hence, we expect there to be a large intrinsic variation in the growth of deaths from country to country, driven e.g. by the average number of people in a household, population density [5] etc… that may correlate with, and hence lead spurious evidence for, potential factors such as climate, vaccination coverage, blood type and pollution.

We model this complexity by allowing each country, , to have its own unique base growth rate, denoted

, that is estimated from its own data. This freedom allows the model to account for the myriad unmodelled factors specific to each country (population density, GDP, culture, health care quality etc…). However, we tie these base country growth rates,

, together through a parent distribution, expressing the prior belief that there is a single dominant strain of COVID-19 globally.

Our primary goal in this study is to examine worldwide data to investigate whether climate and the BCG vaccination prevalence are important drivers of COVID-19 spread. Because of the danger of confounding variables we used the base growth rate,

for each country in a machine learning feature extraction algorithm to pick the most important additional variables to include in our computationally intensive hierarchical Bayesian analysis. This lead us to exclude PM

 pollution. In addition we undertook a separate analysis including A+ blood type. The data for blood type came from unpublished online sources and is therefore kept separate since it is less trustworthy.

The remaining four most promising environmental factors were Temperature (T, C), Relative Humidity (RH, %), BCG vaccination coverage (BCG, %), Ultra-Violet Index (UVIndex); the latter included as a proxy for vitamin D production. To this set we added two testing-related variables to serve as diagnostics for potential contamination from testing issues: (1) the fraction of tests that return positive (Pos-Rate, %) and (2) the number of tests per 1,000 population (Tests/1k), yielding our final set of global parameters, (T, RH, BCG, UV, Pos-Rate, Tests/1k). Due to concerns over data integrity, we study the impact of A+ blood type separately in section (5.4.1).

We then explicitly fit for using our sample of 40 regions in 37 countries for which there is data for all of the parameters in . We cannot use all the death data for countries since a constant growth rate model fails quickly for most countries due to Non-Pharmaceutical Interventions (NPIs) or nonlinear effects. To address this we extract the initial pure exponential part of the data on a country-by-country basis, as illustrated in Fig. (1) and discussed in detail in section (5), which is the only data we use in our main analysis.

The early phase death data for each country , , are then fit to the model:


where converts our growths to percentages and the are the country-specific data corresponding to .

We use a hierarchical Bayesian analysis in which each country’s growth rate is drawn from a parent distribution which allows each country’s base growth rate, , to be intrinsically different rather than forcing any differences to be due only to the parameters in the factors encoded by .

Figure 1: Example fit to the log of deaths vs time for India, showing how we select the initial linear part of the logarithmic data for inclusion in our analysis. Data are shown as blue points. Data covered by red fits are those used in our main analysis, in this case at days after reaching three deaths. Green samples are drawn from the 5d sigmoid model which covers all the data and models the changes in the growth rate due to social interventions or nonlinear effects. This is one of the 40 countries/provinces in our analysis; the full set is shown in Fig. (9).

We use Monte Carlo Markov Chain (MCMC) methods to simultaneously fit the base growth rates,

, for all countries, our parameters of interest,

, and the parent distribution hyperpriors, implying a large Bayesian hierarchical model with 90 parameters in total. After marginalising over all country growth rates and parent distribution parameters, we are left with the marginal distributions on the

, our parameters of interest, providing our main results.

We assess the importance of each of the parameters both through standard model selection metrics such as the Bayesian evidence and Bayesian Information Criterion (see Table 1) and by computing the statistical significance with which the parameters deviate from zero in the marginalised posterior chains. We now discuss these results.

3 Results and Discussion

Figure (4) shows the main results of our paper: only the parameter associated to the positive rate or fraction of tests (Pos-Rate) is non-zero at more than . This result is stable to significant changes in our priors and hyperpriors and to including or removing the other parameters in in the analysis. In addition, the positive rate is selected as the most important variable by all model-selection metrics (see Table 1) and by our machine learning analysis, section (5.5). A plot of regional death growth rates versus positive rate is shown in Fig. (2) showing the clear correlation.

The simplest explanation of this result is that regions that experience the most rapid spread of the disease, i.e. those with the largest , were also the regions that were least able to keep up with testing demands and hence where the rate of Polymerase Chain Reaction (PCR) tests used for COVID-19 returning positive were the highest on average. This is then not a cause, but rather a result of, high growth rate. We find that Pos-Rate correlates positively with Tests per 1000 population (see Fig. 10): high death growth rate and growing positive rate likely spurred increased testing. Neither of these observations help us identify the cause of increased death growth rates, however.

Temperature is the only other variable which is non-zero at more than in our multivariate fits (see section 5.4.3 for more discussion) and relative humidity the only other parameter non-zero at more than , providing some weak evidence for climatic impact on the spread of the disease. Increasing temperature and humidity tends to decrease the spread of COVID-19, in agreement with previous studies.

In our model-selection metrics, where we compare fits with one variable at a time, relative humidity is preferred over temperature by the Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) and is the only variable other than Pos-Rate, that has a BIC value more than 2 units lower than the No Factor model (which has )444 BIC is the standard demarcation of positive evidence [40].. The remaining parameters have poor or mixed results relative to the No Factor model, and all perform very poorly relative to the model with just the positive rate.

In particular, our regression and machine learning analyses provide no evidence in support of BCG fraction. This is not surprising if we look at figure (3) which plots regional death growth rates versus BCG coverage: there is no discernible correlation. Our results are therefore in agreement with [24, 25]. Further we do not find any correlation between UV index and death growth rate. This is pertinent since UV index is relevant to natural production of vitamin D [38] and vitamin D has been suggested as a protective factor against the spread of the disease [35, 36, 37],

Now we briefly discuss results for the other testing variable, Tests / 1k. We find that this is not a significant correlate of growth rate. This is perhaps not unexpected. If testing efficiency in each country is approximately constant, i.e. the rate of true infections detected changes only slowly with time, then this has little impact on the growth rate of deaths. It is only if the testing efficiency varies rapidly with time that we would expect this to be significant. Our results suggest that this was not the case, at least within the initial phase of the epidemic in each countries. Neither the Bayesian, nor the machine learning analyses found the number of tests per 1000 population to be significant. We present technical details of the data, algorithms and analysis in section (5).

Figure 2: The growth rates with errorbars for the 40 regions taken from our No Factor model plotted against the positive rate data, are shown in blue. The red lines are posterior samples from the univariate positive rate model. The intercept was taken to be the parent level mean of the growth rate and the slope is the coefficient for positive rate of the same model; showing why this variable is detected at .

Figure 3: BCG population coverage estimate against our estimated base growth rates. We see that there is no clear trend with increasing BCG coverage. The growth rates with error bars are taken from our No Factors model fit.

Finally, in our separate analysis of blood types discussed in sections 5.4.1 and 5.5 we analyse the potential correlation of blood type with growth rate. We find marginal evidence for A+ blood type being relevant, subject to the caveats discussed in section (5.4.1).

How should our results be taken in the context of the many claims of climate, blood, BCG etc… being significant factors in COVID-19 spread? First, many studies were based on confirmed COVID-19 test cases which, as we discussed earlier, are affected by differences in testing capability and protocols between countries.

Secondly, many of the studies present regressions that do not allow for unmodelled confounding sources of variation in the growth. Hence, if a country shows high growth the algorithm will try to force one of the potential factors under study to explain it. Instead the hierarchical Bayesian framework allows the base growth rate of each country to be different, and hence potential factors will only be given credit for the difference in the growth rate if they provide a genuinely better fit.

Further, many studies do not model the intrinsic uncertainties associated with the data as we have done. We too find that the best-fits are non-zero (as can be seen by looking at the peaks of the posteriors in Figure (4) or at the last rows of Table (6). Hence our results are not in disagreement with regression results, the issue is about the statistical significance of such claims.

Finally, although we do not detect environmental factors at more than , it is interesting to examine how big the environmental factors would be if our best-fit parameter values describe reality: a increase (decrease) in temperature implies a decrease (increase) in the base daily growth rate while a increase (decrease) of 20% in relative humidity would mean a decrease (increase) in base daily growth rate. The decreased spread at higher temperature and humidity agrees with previous work [9].

This may not seem significant, but for a city such as Johannesburg, where both temperature and humidity drop significantly in winter, the combined effect could add more than to the base daily growth rate. For a daily growth rate of , which was approximately the value in May 2020, this would halve the doubling time of the disease, a significant impact.

4 Conclusions

Contrary to previous claims our analysis of growth rates for deaths from countries worldwide are consistent with no effect from climate, pollution or BCG vaccination. The only significant correlation detected is with the positive rate of tests: a country that intrinsically had a high (due to high population density etc…), would naturally tend to be more overwhelmed and hence run low on testing kits earlier, leading to increased fraction of positive tests. We did find some weak suggestive evidence, at , that temperature and relative humidity correlate with death growth rates.

A separate analysis of blood type data shows that A+ type is the most important blood type, though the significance is marginal, both because the data quality is low and the statistical significance is weak. Our combined statistical and machine learning analysis finds no evidence for PM pollution, other blood types or UV Index as drivers of COVID-19.

More data could be obtained by dropping the requirement that all countries in the sample have data for all the potential factors, which could potentially allow for some of the effects to be detected at higher statistical significance but at the cost of making model comparison significantly more difficult.

The data and code for our Bayesian analysis is available at

4.1 Acknowledgements

We thank Michael van Niekerk for extracting data from BCG ATLAS, Niayesh Afshordi, Ewan Cameron, Inger Fabris-Rotelli, Ben Holder and Nadeem Oozeer for discussions and comments.

This research has been conducted using resources provided by the United Kingdom Science and Technology Facilities Council (UK STFC) through the Newton Fund and the South African Radio Astronomy Observatory.

Figure 4: Marginal distributions for the various factors we consider in the case where all factors are considered simultaneously (blue; multivariate) and individually (red; univariate). The results in both cases are consistent: only the coefficient associated with the positive rate of tests is non-zero at more than , with temperature non-zero at in the multivariate case. Notice that the significance of both the temperature and relative humidity increase in the multivariate fits relative to the univariate fits where they are fit alone. Both BCG and UV Index are consistent with no effect.
Positive Rate 0.0 0.0 0.0
Relative Humidity 5.5 5.5 31.8
No Factors 15.3 11.6 22.9
Temperature 13.7 13.7 24.0
BCG Vaccine 14.7 14.7 13.6
UV Index 14.7 14.7 30.5
Tests / 1k 15.2 15.2 28.6
All Incl.Tests 114.5 132.9 113.2
All Excl. Tests 125.9 137.0 127.4
Table 1: Model selection rankings relative to the best model (first row) for four metrics estimated from the MCMC chains: Akaike Information Criteria (AIC), Bayesian Information Criteria (BIC) and Deviance Information Criteria (DIC). Here we consider each factor in separately (univariate), as well as the model with no factors, the model with all factors and a model with all factors excluding the two testing factors (Positive Rate and Tests / 1k). The Positive Rate is unanimously selected as the most important feature, followed by the Relative Humidity. Temperature does not perform well here but this is not surprising: the statistical significance of the temperature increased by in the joint multivariate fit relative to the univariate fit alone.A similar increase in significance is visible for the Relative Humidity; see Fig. (4). Note that the model including the two testing parameters (All Incl. Tests) outperforms the model excluding the testing parameters, reinforcing the fact that testing is important. A+ blood type results are presented separately in section (5.4.1).

5 Additional Material

5.1 Data

5.1.1 Data on deaths

We chose to look at the growth rate in deaths as they are less likely than confirmed cases to be affected by the widely varying testing protocols between countries. The cumulative death data for each country comes from the Centre for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) [44].

We cannot, however, take all data on deaths for all times since it is clear that a simple exponential model fails quickly: as soon as interventions occur the simple exponential model fails and we will obtain contaminated estimates of the growth rate due to the flattening of the curve which will skew our analysis; see e.g. Fig. (

1). As a result we want to know which data should be included in our analysis for each country. This reduces to knowing the time, , at which the growth of deaths deviates from a simple exponential. We then only use data at for country .

Doing this cut by hand introduces the possibility of human bias. Instead we compute for each country by fitting a non-linear sigmoid transition function to the growth rate of the death data of each country over time:


This function allows the growth rate to smoothly transition from an initial value (which we are interested in) to a final value, reflecting the impact of social interventions or nonlinearity in the system.

In this first phase of the analysis each country, , therefore has 5 parameters () to describe the trajectory of its deaths, where and represent the change in growth rate and suddenness of the transition with time from to for each country, (the index, , has been left out in Eq. (3) below):


To maximise the probability that this is a good fit to the data the non-linear model in Eq. (


) we only consider data up to the time when a country passes 1000 deaths. All five parameters are treated hierarchically with their own parent distribution where the means are Normally distributed and the standard deviation is HalfNormal distributed.

We only use data from in our main analysis (see section 5.2), where is given by the marginalised mean from the chains. Typical values for were around 15 days. We excluded countries where because of concerns about quality of the underlying model fit in such cases, i.e. a simple exponential model was not a good fit even at early times which could lead to spurious growth rates.

The resulting fits to country death data for both the full 5-d model (green) and just the region are given in Figure (9), showing that the technique performs well in isolating the initial exponential growth from a first-principles approach.

Once we have for each country we further cut our data by using the following rules:

  • We only use countries for which there were a minimum of 20 deaths by 25 April 2020.

  • We choose day zero as the first day a country passes 3 deaths. This leads to typical starting numbers of deaths of around 5.

  • Data for our variables of interest (Climate, BCG, etc…) were not available for all countries. To ensure the same amount of data for all models only countries with data for all our parameters in were included.

We did not model the potential correlations and interactions of our parameters with these data cuts. This could potentially alter our conclusions: if a parameter were extremely important in determining growth rates, then countries with very small numbers of deaths would systematically not make it past our data cuts and hence the signal from that data would be lost. However, it is unclear how to model this censorship rigorously and it is left to future work.

After these data cuts we were left with 40 provinces in 37 countries with 613 data points in total, shown in Figure (9). The only country with more than one province was Canada which include Alberta, Ontario, Quebec and British Columbia (BC).

5.1.2 Climate Data

For each region and country the climate data was gathered from the Dark Sky API, [43], where the locations sampled are taken from the latitudes and longitudes given for each country or region in the JHU COVID-19 data. For each country, , the mean temperature, mean relative humidity and mean UV Index are calculated as an average over a N day window starting 28 days prior to day 0 for each region, where N is the number of days of deaths data for that country. The latter is chosen as an estimate of the average time from infection to death 555 Since mean climate variables change relatively slowly the exact delay is not important. See Fig (9) for the distribution of times. We used UV Index as a proxy for natural vitamin D production. Data for all regions and countries is shown in Table (5).

5.1.3 BCG Vaccine Coverage

Bacillus Calmette–Guérin (BCG) vaccination policies have been in place in many countries across the world starting from widely differing start dates. We would like to estimate the percentage of the population in each country that has received a BCG vaccine in their lifetime. For this purpose we need the age demographics of each country as well as the dates when BCG vaccinations became/stopped being mandatory.

To estimate the percentage of the population vaccinated by BCG we need to draw on three sources of data. These are the BCG Atlas, World Health Organisation (WHO) BCG vaccination rates amongst 1-year olds and the age demographics for each country from the United Nations (UN). Firstly we looked to the BCG Atlas [45]. This is a heterogeneous dataset which implies that not all information was available for each country. We collected the following fields from the BCG Atlas:

  1. Current BCG vaccination?

  2. Which year was vaccination introduced?

  3. Year BCG stopped?

  4. Year of changes to BCG schedule

  5. Details of changes

  6. BCG coverage (%)

We did not use the last field directly due to missing information on this field, specifically if it was for a small age range of the population or its entirety. For some countries data was missing from fields 1-3. In these cases, where appropriate and possible, the missing data was obtained from fields 4 and 5.

Age demographics for each country were used to compute the expected fraction of the population who have had the BCG vaccination. In 15% of cases the BCG Atlas did not have data and we instead used WHO data [6] to perform the estimate. The derived BCG fractions and the origin of the data, are shown in Table (2) while a plot of the BCG fractions versus growth rates are shown in Fig. (3).

There is one caveat here: our BCG coverage is the estimate of the percentage of the population of each region and country that has had the BCG vaccination. This is arguably not the optimal quantity to use in our analyses however; it might be better to use the fraction of vaccinated population weighted by the probability of infection as a function of age. However, the latter is currently unknown and hard to compute even in the best of situations: how can we know how many people were exposed but never got infected? The large scatter in Figure (3) suggests that this is unlikely to make a significant difference to our conclusions that BCG vaccine is not important in the spread of COVID-19.

Country BCG Coverage Data Source
Argentina 58.5 % WHO
Australia / NSW 38.9 % BCG ATLAS
Austria 53.1 % BCG ATLAS
Canada / Alberta 0.0 % BCG ATLAS
Canada / BC 0.0 % BCG ATLAS
Canada / Ontario 0.0 % BCG ATLAS
Canada / Quebec 0.0 % BCG ATLAS
Chile 93.1 % BCG ATLAS
Colombia 86.4 % BCG ATLAS
Cuba 46.9 % WHO
Czechia 73.5 % BCG ATLAS
Denmark 50.7 % BCG ATLAS
Estonia 27.2 % WHO
Finland 80.4 % BCG ATLAS
France 71.4 % BCG ATLAS
Germany 49.7 % BCG ATLAS
Greece 13.9 % WHO
Hungary 82.3 % BCG ATLAS
India 96.9 % BCG ATLAS
Israel 29.0 % BCG ATLAS
Italy 0.0 % BCG ATLAS
Japan 70.0 % BCG ATLAS
Korea, South 51.3 % BCG ATLAS
Lithuania 28.2 % WHO
Luxembourg 0.0 % BCG ATLAS
Mexico 98.9 % BCG ATLAS
Netherlands 0.0 % BCG ATLAS
Norway 79.8 % BCG ATLAS
Pakistan 77.8 % BCG ATLAS
Peru 96.5 % BCG ATLAS
Philippines 61.8 % WHO
Poland 80.8 % BCG ATLAS
Slovenia 75.2 % BCG ATLAS
South Africa 79.2 % BCG ATLAS
Sweden 40.9 % BCG ATLAS
Switzerland 31.9 % BCG ATLAS
Thailand 53.4 % BCG ATLAS
Turkey 92.8 % BCG ATLAS
United Kingdom 67.2 % BCG ATLAS
Table 2: BCG vaccine coverage for populations for all our regions/countries estimating the fraction of the population who have received the vaccine.

5.1.4 Additional data

Data on prevalence of blood type for each country was taken from [32, 33] while PM pollution data came from [34]. We discuss the blood type data and results in section (5.4.1).

5.2 Model

As described earlier our basic regression model for the deaths in country at day is:


which is assumed valid for , as described before. is measured in percent and we model it’s potential dependence on our variables of interest as:


where and are the country-specific base growth rate and data for climate, BCG etc…, shown in Table (5) and are the global parameters we are interested in. In general the could be time-dependent. In this analysis, because of missing data and the fact that our data for each region typically spans a short period ( is less then 3 weeks as discussed in section 5.1.1), we use average values for .

Figure 5: Graphical model for our hierarchical Bayesian analysis showing the parameters, data and their connections. Rectangles with rounded corners represent parameters in the model with their priors shown as normal or half normal distributions with their respective parameters. Diamonds are fixed inputs/data. Squares represent repetition plates with repeated variables with index given in the top-left corner of the plate. are the observed deaths for country on day .

Our goal is to determine if any of the parameters are rigorously required to be non-zero by the data. One limitation of Eq. (5) is that it is linear in the country data . We justify this by noting that the growth rates and the underlying data vary over narrow ranges, so that retaining only the linear terms in the Taylor series expansion of is a reasonable step. For the temperature variables we have verified that assuming instead a step change in the growth at around C with two hierarchical growth parameters, one on each side of the step, did not lead to any increase in significance in the detection of temperature as an effect.

5.3 Posterior model and sampling

We write a hierarchical Bayesian probabilistic model by assuming a Poisson likelihood, suitable for count data, with mean given by the deterministic forward model defined in Eq. (5), together with priors and hyperpriors for all our 90 parameters, as shown in the schematic graphical representation in Fig. (5); see e.g. [39]. We do not try to model missing deaths due to testing irregularities. As long as the fraction of missing deaths remains approximately constant in the early phase of the spread within the country that we consider this should have little impact on our results.

Since our data covers 37 countries and 40 provinces worldwide and we know that the growth rate of the disease will depend both on properties of the disease (which are universal), and properties specific to each country (e.g. culture and population density) it is natural to model the data with a hierarchical Bayesian structure which allows growth rates to vary somewhat from country to country but to also be somewhat similar between countries.

The priors and hyperpriors for each variable to be estimated is chosen to be:

  • ; the hyperprior for the mean on the prior for initial deaths, , for each country ( as part of our data cuts).

  • ; the standard deviation on the prior for initial deaths for each country.

  • ; the mean (in percent) on the prior for the growth rate of each country, .

  • ; the standard deviation on the prior for the growth rate of each country.

  • , the prior on the initial deaths parameter for each country, . All countries together will be denoted by .

  • ; the prior on the growth rate parameter for each country, . All countries together will be denoted by .

  • ; the prior on the vector of parameters of key interest.

The complete vector of all parameters is therefore:


The joint prior over our full set of parameters is assumed factorizable, i.e. a product of the prior distributions listed above.

We assume our data to be statistically independent both between countries and from day-to-day. Our model simultaneously fits the death data from all countries. We use NumPyro[42] for Monte Carlo Markov Chain (MCMC) probabilistic sampling from our posterior using the No U-Turn Sampling (NUTS) algorithm [41]. NumPyro has diagnostic tools built in and allows for easy running on accelerators, such as GPUs, which were key in being able to iterate quickly through our many models including all the multivariate and univariate combinations, as well as looking at the effect of priors.

The MCMC simulations were generally run as 4 independent chains with each chain starting from a random position sampled from the prior for our parameters and run for 2000 steps in order for length scales (and other sampling options) to be determined automatically by NumPyro. After this initial “burn-in" each chain was typically run for 5000 samples. Chains were long enough to ensure that the Gelman-Rubin convergence test was always less than 1.01.

Each chain was then thinned by a factor of two in order to further improve independence of samples. This leads to the final 10k samples collected for each model and led to good, converged, trace plots and posteriors; see Fig. (4). We used ChainConsumer666 for the analysis of the chains, the model selection metrics in Table (1) and some of the plots [51].

5.4 Additional Bayesian Results

5.4.1 A+ Blood Type Analysis

The data for blood type prevalence comes from online heterogeneous collections of published and unpublished data sources covering a wide range of publication dates. As a result, data integrity is an issue and we have chosen to separate the blood analysis from our main Bayesian analysis.

A random forest analysis finds only A+ blood type as potentially relevant amongst the different ABO blood types (see section

5.5). Here we present the results of both univariate and multivariate A+ blood type Bayesian analyses using the formalism described in section (5.2).

Fig. (6) shows the result of the Bayesian analysis for the A+ blood type coefficient: it is consistent with zero in both the univariate and multivariate cases. If we look at the AIC, BIC and DIC for A+ blood type we find values of: compared to the no-factor models results of . The A+ AIC and BIC are better than the no-factor model, but the DIC value is worse. In addition, the random forest analysis found A+ blood type to be more important than both BCG and and Tests/1k.

As a result we conclude that there is somewhat conflicting evidence regarding the potential effect of A+ blood type, but that none of the evidence is strong.

Figure 6: The univariate and multivariate results for the coefficient of A+ blood type are consistent with zero, i.e. no effect.

5.4.2 Effect of on Base Parameters

In the results section we focused on the best-fit values of . The reverse question is interesting too: what is the impact of including the parameters on the hyperprior parameters, mean, , and standard deviation, , of the base growth rates, , for the 40 regions? If the

do explain some of the variation we would expect the standard deviation on the base growth rate (parent distribution) to get smaller, which is captured by the hyperparameter

. The mean of the base growth rates (parent distribution), given by , are nothing more than an offset dictating the value when all factors are zero i.e. .

This is what we see: in the No-Factor model which changes to when we allow all the to vary, a significant shrinkage. On the other hand, if we include all parameters except for the two testing parameters, returns to , while when we add the positive rate (Tests/1k) parameters alone; showing that it is primarily the positive rate parameter that drives the shrinkage in the uncertainty in the parent distribution on the base growth rates.

5.4.3 Comparison of univariate and multivariate fits

To assess the stability of our results we fit each of our key parameters in both alone (univariate) and simultaneously with all the other parameters (multivariate). As shown in Table (3) and Fig. (4), the results are quite stable. Positive rate and temperature are the still the most significantly parameters in both cases: the positive rate is non-zero at more than in both cases while the significance for temperature increases from to when going from the univariate to multivariate fit. Relative Humidity also increases in significance in the multivariate case.

To access correlations between parameters we show in Fig. (10) the one and two-dimensional marginalised posterior plots. Correlations are typically small, though there are is some small positive correlation between Tests per 1k and Positive Rate, and a small negative correlation between Temperature and UV Index.

Our full results for all the hyperpriors, the base country growth rates, and the parameters for each of the different univariate and multivariate models are shown in Table (6).

Parameter Multivariate Univariate
BCG Vaccine 0.01 0.04 0.00 0.04
Temperature -0.29 0.14 -0.19 0.13
Relative Humidity -0.10 0.08 -0.09 0.09
Tests / 1k -0.11 0.12 -0.14 0.11
Positive Rate 0.25 0.08 0.27 0.08
UV Index -0.01 0.95 -0.47 0.89
Table 3: Comparison of univariate and multivariate fits to the data. Parameters whose mean are more than 2 away from zero (temperature and positive rate) are shown in bold. Only Postive Rate is nonzero at more than .

5.4.4 Varying Hyperpriors

Since our goal in this analysis is to assess whether there is evidence for external factors such as climate, blood type etc… in addition to known country-specific factors, an important internal check is to look for potential sensitivity to our priors and hyperpriors.

In our hierarchical analysis, data for each country tries to pull the measured growth rates of countries to their own best-fit values, while the hierarchical nature of our model described in section (5.3) attempts to pull them all together. Between this tug of war, the algorithm searches for joint values of that will improve the fits to all the data. One concern might be that if we make the hyperpriors on the parent distributions much stronger or much weaker, we allow the algorithm to given less or more freedom to the base growth rates which in turn may affect the best-fit parameters and our main conclusions.

To test this we tightened the following hyperpriors and priors listed in section (5.3) by an order of magnitude to:

  • ; the mean (in percent) on the prior for the growth rate of each country, .

  • ; the standard deviation on the prior for the growth rate of each country.

  • ; the prior on the vector of parameters of key interest.

The results of these changes are shown in Fig. (8). As expected tightening the prior on the pulled most parameters slightly closer to zero but also tighten the posterior so that none of our conclusions were altered: the statistical significance of all variables was unaltered.

5.5 Machine Learning Analysis

To provide a largely independent test of our Bayesian results we also undertook a machine learning analysis of the deaths data using random forest regression [52]

. Random forests are a powerful ensemble method that naturally provide feature selection capability, and are hundreds of times faster to run than our computationally intensive hierarchical Bayesian framework. While random forest does not provide estimates of statistical significance of factors it does allow us to rank additional factors in terms of importance and hence to explore the potential of additional explanatory features for inclusion in the main set of parameters,

, for the Bayesian analysis.

To undertake the random forest regression we used the results from our No-Factor Bayesian run (i.e. ) to obtain the base growth rates for each country and used these as the target for the random forest regression with the data as features. In the random forest analysis we augmented the data both with PM pollution and extra blood types data (namely A–, AB+, O–, AB–, B–).

Feature selection was done using the standard Random Forest impurity method[52] and by ranking variables by the impact they had on the average Root-Mean-Square-Error (RMSE) of the regression: leaving out important explanatory variables is expected to significantly degrade the performance of the algorithm, leaving out irrelevant features does not. All feature importance scores were averages over 500 random 70-30 training-test splits of the data. We found that the only variables that made more than one percent difference to the RMSE value were the Postive Rate (3.98%) and A+ (2.28%). This lead us to select A+ blood type as a variable in our separate full Bayesian analysis shown in section (5.4.1).

The results from the impurity-based feature selection are shown in Table (4). Again Positive Rate and A+ were the most significant features, followed by Tests per 1k and BCG vaccine coverage, PM and the other blood types at significantly less importance. As a result blood types, other than A+, were not included in the Bayesian analysis since additional parameters significantly increased the computational complexity of the analysis, both because of increased time to convergence and an increase in the number of models to run (since we fit both multivariate and univariate models in all cases). The distribution of a selection of feature importances is shown in Fig. (7)

In summary the machine learning analysis confirms, to the extent that they overlap, the results of our much more intensive Bayesian analysis: Positive Rate is the most important feature, followed by A+ (subject to the caveats discussed in section (5.4.1), while BCG is not relevant.

Feature Mean Std. Dev
Pos-Rate 0.36 0.14
A+ 0.20 0.11
Tests/1k 0.08 0.04
BCG 0.08 0.05
B+ 0.08 0.06
PM 0.06 0.03
A– 0.04 0.03
AB+ 0.03 0.02
B– 0.02 0.01
AB– 0.02 0.02
O– 0.02 0.02
Table 4: Mean and standard deviation of impurity-based random forest feature importances from 500 runs with 100 estimators and a test-training split. The sample density of feature importances is shown in Fig. (7). Positive Fraction is selected as the most important feature, as with the Bayesian analysis. Climatic variables (temp, humidity and UV Index) were not included in the analysis.
Figure 7: Histograms of feature importances from 500 random forest runs showing, from most important to least: Positive-Fraction (beige), A+ blood type (teal), BCG (salmon), Tests per 1k (red) and PM (dark brown).

Figure 8: Marginal distributions for the various factors in the case of the standard priors/hyperpriors and in the case where all the priors are tightened by a factor of ten. We see that the significance of the best-fit changes by less than although the means are typically shifted towards zero, other than BCG and Positive Rate. Our main conclusions are unchanged and stable to changing the priors dramatically.
Country/Region BCG (%) T (C) RH (%) Tests/1k Pos-Rate (%) A+ (%) UV
Argentina 58.5 21.6 51.0 0.8 12.7 34.3 1.3
Australia / NSW 38.9 19.7 74.2 17.3 3.4 31.0 1.4
Austria 53.1 1.3 60.3 21.5 15.0 33.0 1.0
Canada / Alberta 0.0 -3.8 74.5 15.1 7.2 36.0 0.7
Canada / BC 0.0 7.9 70.9 15.1 4.4 36.0 0.8
Canada / Ontario 0.0 -2.9 72.5 15.1 10.5 36.0 0.8
Canada / Quebec 0.0 -10.8 68.8 14.8 27.3 36.0 0.7
Chile 93.1 15.8 62.8 6.6 13.7 8.7 1.5
Colombia 86.4 14.2 90.5 1.2 12.9 26.1 2.4
Cuba 46.9 22.3 79.3 2.6 7.8 32.8 2.5
Czechia 73.5 3.0 56.7 16.8 6.7 36.0 1.0
Denmark 50.7 4.8 61.6 17.3 16.1 37.0 0.7
Estonia 27.2 1.9 65.1 32.5 5.1 30.8 0.6
Finland 80.4 0.2 68.1 11.0 10.4 38.0 0.4
France 71.4 8.6 71.6 1.5 53.3 37.0 0.7
Germany 49.7 0.8 48.0 13.8 12.0 37.0 0.8
Greece 13.9 7.6 80.1 4.8 8.8 32.9 1.1
Hungary 82.3 6.2 57.4 5.2 8.9 33.0 0.9
India 96.9 30.1 29.3 0.3 14.5 20.8 2.4
Israel 29.0 15.3 58.3 28.4 10.4 34.0 1.7
Italy 0.0 4.5 46.5 1.2 40.0 36.0 1.2
Japan 70.0 4.4 64.8 1.4 16.8 39.8 1.2
Lithuania 28.2 2.7 54.9 24.8 4.5 33.0 0.7
Luxembourg 0.0 6.7 55.5 57.9 12.5 37.0 1.0
Mexico 98.9 20.9 29.7 0.3 57.0 29.9 2.8
Netherlands 0.0 4.5 44.9 4.0 29.9 35.0 0.8
Norway 79.8 -1.8 63.0 26.7 7.3 42.5 0.6
Pakistan 77.8 16.5 63.1 0.5 22.2 20.6 1.6
Peru 96.5 24.5 86.5 4.7 35.5 18.4 1.9
Philippines 61.8 27.5 82.5 0.5 14.7 28.9 2.4
Poland 80.8 4.3 51.1 5.6 11.6 31.3 0.8
Slovenia 75.2 6.3 47.1 20.6 4.4 33.0 1.1
South Africa 79.2 21.2 43.3 2.2 4.5 32.0 1.9
South Korea 51.3 8.2 50.7 11.1 3.3 32.8 1.3
Sweden 40.9 -0.1 62.4 7.7 25.8 37.0 0.3
Switzerland 31.9 -1.0 80.3 20.6 18.1 37.0 0.8
Thailand 53.4 31.1 60.8 0.6 10.0 16.9 2.4
Turkey 92.8 5.9 57.9 3.3 40.3 37.8 1.2
US 0.0 11.8 85.0 0.9 61.0 35.7 1.2
United Kingdom 67.2 3.5 71.3 1.6 19.6 35.0 0.5
Table 5: Factor data, for all regions and countries in our study. The columns are BCG fraction, temperature (T), relative humidity (RH), tests per 1000 population (Tests/1k), positive test rate (Pos-Rate), A+ blood type prevalence and UV Index (UV). We analyse blood type separately as discussed in section (5.4.1).
Figure 9: Example fits to all 40 regions from 37 countries in our data sample. Posterior samples from the simple exponential model is shown in red, plotted up to the cutoff date, showing the data we use in our analysis. The full model with time varying growth rate is shown in green while data points are shown in blue. Only the USA had an exponent larger after than before.
Figure 10: 1, 2 and 3- contours from the marginalised posterior samples of the parameters showing that most parameters are weakly correlated aside from one positive (Pos-Rate & Tests/1k) and one negative (Temp & UVIndex) correlation. Zero values for the parameters are shown by dashed lines to help assess statistical significance.
Parameter No Factors
(68% CI) BCG Vaccine
(68% CI) Temperature
(68% CI) Relative Humidity
(68% CI) Tests / 1k
(68% CI) Positive Rate
(68% CI) A+ Blood Type
(68% CI) UV Index
(68% CI) Excl. Tests
(68% CI) Incl. Tests
(68% CI)