The COVID-19 pandemic has triggered extensive efforts to predict the severity of COVID-19 to aid in decision making around interventions such as lockdown and the closure of schools 111https://www.imperial.ac.uk/media/imperial-college/medicine/sph/ide/gida-fellowships/Imperial-College-COVID19-NPI-modelling-16-03-2020.pdf . Regions hit hard by the pandemic, such as Wuhan, Lombardy and New York, where doubling times of 2-3 days and high crude mortality rates stand in stark contrast to other countries only mildly affected such as Hong Kong, South Korea and New Zealand.
Potential explanations for the apparent differences in the transmissivity (encoded by the time-dependent reproductive number, ) and lethality (encoded by the Infection Fatality Rate, IFR), currently fall into four broad categories. The first posits that differences are largely fictitious, driven by the heterogeneity of testing and reporting of cases and deaths; a known issue and one that we are particularly concerned with in this paper.
The next three categories posit that the differences are primarily real and are driven by (1) cultural and policy factors (swift lockdown, efficient contact tracing and quarantining, use of masks, obedient populations or social structures that are naturally distant or isolated), (2) local environmental factors (such as temperature and humidity variations, population density, comorbid factors, vaccinations, vitamin D levels, blood type etc…) or (3) existence of multiple strains with different transmissibility or lethality . Our primary interest lies in disentangling the first category (testing) from a subset of potential factors in the second category.
Finding the relative contributions of each of these four categories is key in understanding and optimally fighting the pandemic. The widely different testing capabilities between countries, particularly between the developed and developing world, imply that if not correctly treated, testing variability will create spurious evidence that can lead to false hope and sub-optimal interventions.
In the wake of the spreading pandemic there have been a host of studies that have examined the possible impact of climate -, blood type [20, 21], haplogroup , pollution - and BCG vaccination prevalence [22, 23, 24, 25] on the spread of COVID-19. Our main conclusion is that testing issues are significant and the factors above are likely not the main causes of variability in growth rates of deaths worldwide.
We note that both the basic reproductive number, , and the IFR will be affected by the four categories above in general. However, since we do not currently have access to the true number of infections we cannot address the potential effect of environmental factors on the IFR. Similarly the Case Fatality Rate cannot be used for this purpose due to testing differences around the world. We therefore focus on their potential effect on , which we quantify through the daily growth rate of deaths of countries around the world for which we have sufficiently good data.
To explore the potential impact of climate (e.g. through temperature, relative humidity and UV Index), BCG vaccination prevalence, blood type and pollution (PM) on COVID-19, one must think carefully about choice of both data and methodology.
We choose to focus on deaths instead of the number of confirmed infections in the belief that these are significantly less affected by sampling and testing issues than infection numbers: the number of tests per 1000 population (Tests/1k) currently varies by more than three orders of magnitude across the world 222http://worldometers.info/coronavirus/, making infection numbers highly biased and correlated with confounding variables such as GDP and healthcare. In countries with limited testing capability, only the more severe patients are typically tested.
Since these are also the patients most likely to die, it is likely that patients who die from COVID-19 are more likely to be tested than typical COVID-19 sufferers, who may be mostly asymptomatic. Without detailed knowledge of testing protocols in each country separating out the effects of the factors of interest (climate, BCG vaccination etc…) is extremely challenging. Although deaths are not immune to testing issues, with many missing deaths shown through excess mortality studies, especially in overwhelmed medical systems [47, 48], we argue that this is likely to be much less of an issue in the first month after initial COVID-19 deaths, which we focus on, since we are interested in .
The second key choice is whether to focus on absolute death counts or growth rates. Absolute death counts are also highly problematic. First, testing efficiencies may vary systematically from country to country. Second, the widely varying start dates of the pandemic in different countries are hard to deal with rigorously. As a result we focus on the daily growth rate of deaths, , for country during the initial phase of the epidemic. The initial daily growth correlates with , the base reproductive number in each country; in simple models , where is the period in days for which a person is infectious on average333Hence and days yields ..
Separating out the true causes of variation in or is still a highly complex problem, and we discuss our approach in detail in the additional material in section (5). Part of the complexity arises from the fact that depends explicitly both on properties of the virus and on social factors (e.g. average number and nature of contacts), as well as any of the other potential factors we wish to study. Hence, we expect there to be a large intrinsic variation in the growth of deaths from country to country, driven e.g. by the average number of people in a household, population density  etc… that may correlate with, and hence lead spurious evidence for, potential factors such as climate, vaccination coverage, blood type and pollution.
We model this complexity by allowing each country, , to have its own unique base growth rate, denoted
, that is estimated from its own data. This freedom allows the model to account for the myriad unmodelled factors specific to each country (population density, GDP, culture, health care quality etc…). However, we tie these base country growth rates,, together through a parent distribution, expressing the prior belief that there is a single dominant strain of COVID-19 globally.
Our primary goal in this study is to examine worldwide data to investigate whether climate and the BCG vaccination prevalence are important drivers of COVID-19 spread. Because of the danger of confounding variables we used the base growth rate,
for each country in a machine learning feature extraction algorithm to pick the most important additional variables to include in our computationally intensive hierarchical Bayesian analysis. This lead us to exclude PMpollution. In addition we undertook a separate analysis including A+ blood type. The data for blood type came from unpublished online sources and is therefore kept separate since it is less trustworthy.
The remaining four most promising environmental factors were Temperature (T, C), Relative Humidity (RH, %), BCG vaccination coverage (BCG, %), Ultra-Violet Index (UVIndex); the latter included as a proxy for vitamin D production. To this set we added two testing-related variables to serve as diagnostics for potential contamination from testing issues: (1) the fraction of tests that return positive (Pos-Rate, %) and (2) the number of tests per 1,000 population (Tests/1k), yielding our final set of global parameters, (T, RH, BCG, UV, Pos-Rate, Tests/1k). Due to concerns over data integrity, we study the impact of A+ blood type separately in section (5.4.1).
We then explicitly fit for using our sample of 40 regions in 37 countries for which there is data for all of the parameters in . We cannot use all the death data for countries since a constant growth rate model fails quickly for most countries due to Non-Pharmaceutical Interventions (NPIs) or nonlinear effects. To address this we extract the initial pure exponential part of the data on a country-by-country basis, as illustrated in Fig. (1) and discussed in detail in section (5), which is the only data we use in our main analysis.
The early phase death data for each country , , are then fit to the model:
where converts our growths to percentages and the are the country-specific data corresponding to .
We use a hierarchical Bayesian analysis in which each country’s growth rate is drawn from a parent distribution which allows each country’s base growth rate, , to be intrinsically different rather than forcing any differences to be due only to the parameters in the factors encoded by .
We use Monte Carlo Markov Chain (MCMC) methods to simultaneously fit the base growth rates,, for all countries, our parameters of interest,
, and the parent distribution hyperpriors, implying a large Bayesian hierarchical model with 90 parameters in total. After marginalising over all country growth rates and parent distribution parameters, we are left with the marginal distributions on the, our parameters of interest, providing our main results.
We assess the importance of each of the parameters both through standard model selection metrics such as the Bayesian evidence and Bayesian Information Criterion (see Table 1) and by computing the statistical significance with which the parameters deviate from zero in the marginalised posterior chains. We now discuss these results.
3 Results and Discussion
Figure (4) shows the main results of our paper: only the parameter associated to the positive rate or fraction of tests (Pos-Rate) is non-zero at more than . This result is stable to significant changes in our priors and hyperpriors and to including or removing the other parameters in in the analysis. In addition, the positive rate is selected as the most important variable by all model-selection metrics (see Table 1) and by our machine learning analysis, section (5.5). A plot of regional death growth rates versus positive rate is shown in Fig. (2) showing the clear correlation.
The simplest explanation of this result is that regions that experience the most rapid spread of the disease, i.e. those with the largest , were also the regions that were least able to keep up with testing demands and hence where the rate of Polymerase Chain Reaction (PCR) tests used for COVID-19 returning positive were the highest on average. This is then not a cause, but rather a result of, high growth rate. We find that Pos-Rate correlates positively with Tests per 1000 population (see Fig. 10): high death growth rate and growing positive rate likely spurred increased testing. Neither of these observations help us identify the cause of increased death growth rates, however.
Temperature is the only other variable which is non-zero at more than in our multivariate fits (see section 5.4.3 for more discussion) and relative humidity the only other parameter non-zero at more than , providing some weak evidence for climatic impact on the spread of the disease. Increasing temperature and humidity tends to decrease the spread of COVID-19, in agreement with previous studies.
In our model-selection metrics, where we compare fits with one variable at a time, relative humidity is preferred over temperature by the Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) and is the only variable other than Pos-Rate, that has a BIC value more than 2 units lower than the No Factor model (which has )444 BIC is the standard demarcation of positive evidence .. The remaining parameters have poor or mixed results relative to the No Factor model, and all perform very poorly relative to the model with just the positive rate.
In particular, our regression and machine learning analyses provide no evidence in support of BCG fraction. This is not surprising if we look at figure (3) which plots regional death growth rates versus BCG coverage: there is no discernible correlation. Our results are therefore in agreement with [24, 25]. Further we do not find any correlation between UV index and death growth rate. This is pertinent since UV index is relevant to natural production of vitamin D  and vitamin D has been suggested as a protective factor against the spread of the disease [35, 36, 37],
Now we briefly discuss results for the other testing variable, Tests / 1k. We find that this is not a significant correlate of growth rate. This is perhaps not unexpected. If testing efficiency in each country is approximately constant, i.e. the rate of true infections detected changes only slowly with time, then this has little impact on the growth rate of deaths. It is only if the testing efficiency varies rapidly with time that we would expect this to be significant. Our results suggest that this was not the case, at least within the initial phase of the epidemic in each countries. Neither the Bayesian, nor the machine learning analyses found the number of tests per 1000 population to be significant. We present technical details of the data, algorithms and analysis in section (5).
Finally, in our separate analysis of blood types discussed in sections 5.4.1 and 5.5 we analyse the potential correlation of blood type with growth rate. We find marginal evidence for A+ blood type being relevant, subject to the caveats discussed in section (5.4.1).
How should our results be taken in the context of the many claims of climate, blood, BCG etc… being significant factors in COVID-19 spread? First, many studies were based on confirmed COVID-19 test cases which, as we discussed earlier, are affected by differences in testing capability and protocols between countries.
Secondly, many of the studies present regressions that do not allow for unmodelled confounding sources of variation in the growth. Hence, if a country shows high growth the algorithm will try to force one of the potential factors under study to explain it. Instead the hierarchical Bayesian framework allows the base growth rate of each country to be different, and hence potential factors will only be given credit for the difference in the growth rate if they provide a genuinely better fit.
Further, many studies do not model the intrinsic uncertainties associated with the data as we have done. We too find that the best-fits are non-zero (as can be seen by looking at the peaks of the posteriors in Figure (4) or at the last rows of Table (6). Hence our results are not in disagreement with regression results, the issue is about the statistical significance of such claims.
Finally, although we do not detect environmental factors at more than , it is interesting to examine how big the environmental factors would be if our best-fit parameter values describe reality: a increase (decrease) in temperature implies a decrease (increase) in the base daily growth rate while a increase (decrease) of 20% in relative humidity would mean a decrease (increase) in base daily growth rate. The decreased spread at higher temperature and humidity agrees with previous work .
This may not seem significant, but for a city such as Johannesburg, where both temperature and humidity drop significantly in winter, the combined effect could add more than to the base daily growth rate. For a daily growth rate of , which was approximately the value in May 2020, this would halve the doubling time of the disease, a significant impact.
Contrary to previous claims our analysis of growth rates for deaths from countries worldwide are consistent with no effect from climate, pollution or BCG vaccination. The only significant correlation detected is with the positive rate of tests: a country that intrinsically had a high (due to high population density etc…), would naturally tend to be more overwhelmed and hence run low on testing kits earlier, leading to increased fraction of positive tests. We did find some weak suggestive evidence, at , that temperature and relative humidity correlate with death growth rates.
A separate analysis of blood type data shows that A+ type is the most important blood type, though the significance is marginal, both because the data quality is low and the statistical significance is weak. Our combined statistical and machine learning analysis finds no evidence for PM pollution, other blood types or UV Index as drivers of COVID-19.
More data could be obtained by dropping the requirement that all countries in the sample have data for all the potential factors, which could potentially allow for some of the effects to be detected at higher statistical significance but at the cost of making model comparison significantly more difficult.
The data and code for our Bayesian analysis is available at https://github.com/chrisfinlay/covid19/.
We thank Michael van Niekerk for extracting data from BCG ATLAS, Niayesh Afshordi, Ewan Cameron, Inger Fabris-Rotelli, Ben Holder and Nadeem Oozeer for discussions and comments.
This research has been conducted using resources provided by the United Kingdom Science and Technology Facilities Council (UK STFC) through the Newton Fund and the South African Radio Astronomy Observatory.
|Tests / 1k||15.2||15.2||28.6|
|All Excl. Tests||125.9||137.0||127.4|
5 Additional Material
5.1.1 Data on deaths
We chose to look at the growth rate in deaths as they are less likely than confirmed cases to be affected by the widely varying testing protocols between countries. The cumulative death data for each country comes from the Centre for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) .
We cannot, however, take all data on deaths for all times since it is clear that a simple exponential model fails quickly: as soon as interventions occur the simple exponential model fails and we will obtain contaminated estimates of the growth rate due to the flattening of the curve which will skew our analysis; see e.g. Fig. (1). As a result we want to know which data should be included in our analysis for each country. This reduces to knowing the time, , at which the growth of deaths deviates from a simple exponential. We then only use data at for country .
Doing this cut by hand introduces the possibility of human bias. Instead we compute for each country by fitting a non-linear sigmoid transition function to the growth rate of the death data of each country over time:
This function allows the growth rate to smoothly transition from an initial value (which we are interested in) to a final value, reflecting the impact of social interventions or nonlinearity in the system.
In this first phase of the analysis each country, , therefore has 5 parameters () to describe the trajectory of its deaths, where and represent the change in growth rate and suddenness of the transition with time from to for each country, (the index, , has been left out in Eq. (3) below):
To maximise the probability that this is a good fit to the data the non-linear model in Eq. (3
) we only consider data up to the time when a country passes 1000 deaths. All five parameters are treated hierarchically with their own parent distribution where the means are Normally distributed and the standard deviation is HalfNormal distributed.
We only use data from in our main analysis (see section 5.2), where is given by the marginalised mean from the chains. Typical values for were around 15 days. We excluded countries where because of concerns about quality of the underlying model fit in such cases, i.e. a simple exponential model was not a good fit even at early times which could lead to spurious growth rates.
The resulting fits to country death data for both the full 5-d model (green) and just the region are given in Figure (9), showing that the technique performs well in isolating the initial exponential growth from a first-principles approach.
Once we have for each country we further cut our data by using the following rules:
We only use countries for which there were a minimum of 20 deaths by 25 April 2020.
We choose day zero as the first day a country passes 3 deaths. This leads to typical starting numbers of deaths of around 5.
Data for our variables of interest (Climate, BCG, etc…) were not available for all countries. To ensure the same amount of data for all models only countries with data for all our parameters in were included.
We did not model the potential correlations and interactions of our parameters with these data cuts. This could potentially alter our conclusions: if a parameter were extremely important in determining growth rates, then countries with very small numbers of deaths would systematically not make it past our data cuts and hence the signal from that data would be lost. However, it is unclear how to model this censorship rigorously and it is left to future work.
After these data cuts we were left with 40 provinces in 37 countries with 613 data points in total, shown in Figure (9). The only country with more than one province was Canada which include Alberta, Ontario, Quebec and British Columbia (BC).
5.1.2 Climate Data
For each region and country the climate data was gathered from the Dark Sky API, , where the locations sampled are taken from the latitudes and longitudes given for each country or region in the JHU COVID-19 data. For each country, , the mean temperature, mean relative humidity and mean UV Index are calculated as an average over a N day window starting 28 days prior to day 0 for each region, where N is the number of days of deaths data for that country. The latter is chosen as an estimate of the average time from infection to death 555https://www.thelancet.com/journals/laninf/article/PIIS1473-3099(20)30243-7/fulltext. Since mean climate variables change relatively slowly the exact delay is not important. See Fig (9) for the distribution of times. We used UV Index as a proxy for natural vitamin D production. Data for all regions and countries is shown in Table (5).
5.1.3 BCG Vaccine Coverage
Bacillus Calmette–Guérin (BCG) vaccination policies have been in place in many countries across the world starting from widely differing start dates. We would like to estimate the percentage of the population in each country that has received a BCG vaccine in their lifetime. For this purpose we need the age demographics of each country as well as the dates when BCG vaccinations became/stopped being mandatory.
To estimate the percentage of the population vaccinated by BCG we need to draw on three sources of data. These are the BCG Atlas, World Health Organisation (WHO) BCG vaccination rates amongst 1-year olds and the age demographics for each country from the United Nations (UN). Firstly we looked to the BCG Atlas . This is a heterogeneous dataset which implies that not all information was available for each country. We collected the following fields from the BCG Atlas:
Current BCG vaccination?
Which year was vaccination introduced?
Year BCG stopped?
Year of changes to BCG schedule
Details of changes
BCG coverage (%)
We did not use the last field directly due to missing information on this field, specifically if it was for a small age range of the population or its entirety. For some countries data was missing from fields 1-3. In these cases, where appropriate and possible, the missing data was obtained from fields 4 and 5.
Age demographics for each country were used to compute the expected fraction of the population who have had the BCG vaccination. In 15% of cases the BCG Atlas did not have data and we instead used WHO data  to perform the estimate. The derived BCG fractions and the origin of the data, are shown in Table (2) while a plot of the BCG fractions versus growth rates are shown in Fig. (3).
There is one caveat here: our BCG coverage is the estimate of the percentage of the population of each region and country that has had the BCG vaccination. This is arguably not the optimal quantity to use in our analyses however; it might be better to use the fraction of vaccinated population weighted by the probability of infection as a function of age. However, the latter is currently unknown and hard to compute even in the best of situations: how can we know how many people were exposed but never got infected? The large scatter in Figure (3) suggests that this is unlikely to make a significant difference to our conclusions that BCG vaccine is not important in the spread of COVID-19.
|Country||BCG Coverage||Data Source|
|Australia / NSW||38.9 %||BCG ATLAS|
|Austria||53.1 %||BCG ATLAS|
|Canada / Alberta||0.0 %||BCG ATLAS|
|Canada / BC||0.0 %||BCG ATLAS|
|Canada / Ontario||0.0 %||BCG ATLAS|
|Canada / Quebec||0.0 %||BCG ATLAS|
|Chile||93.1 %||BCG ATLAS|
|Colombia||86.4 %||BCG ATLAS|
|Czechia||73.5 %||BCG ATLAS|
|Denmark||50.7 %||BCG ATLAS|
|Finland||80.4 %||BCG ATLAS|
|France||71.4 %||BCG ATLAS|
|Germany||49.7 %||BCG ATLAS|
|Hungary||82.3 %||BCG ATLAS|
|India||96.9 %||BCG ATLAS|
|Israel||29.0 %||BCG ATLAS|
|Italy||0.0 %||BCG ATLAS|
|Japan||70.0 %||BCG ATLAS|
|Korea, South||51.3 %||BCG ATLAS|
|Luxembourg||0.0 %||BCG ATLAS|
|Mexico||98.9 %||BCG ATLAS|
|Netherlands||0.0 %||BCG ATLAS|
|Norway||79.8 %||BCG ATLAS|
|Pakistan||77.8 %||BCG ATLAS|
|Peru||96.5 %||BCG ATLAS|
|Poland||80.8 %||BCG ATLAS|
|Slovenia||75.2 %||BCG ATLAS|
|South Africa||79.2 %||BCG ATLAS|
|Sweden||40.9 %||BCG ATLAS|
|Switzerland||31.9 %||BCG ATLAS|
|Thailand||53.4 %||BCG ATLAS|
|Turkey||92.8 %||BCG ATLAS|
|US||0.0 %||BCG ATLAS|
|United Kingdom||67.2 %||BCG ATLAS|
5.1.4 Additional data
As described earlier our basic regression model for the deaths in country at day is:
which is assumed valid for , as described before. is measured in percent and we model it’s potential dependence on our variables of interest as:
where and are the country-specific base growth rate and data for climate, BCG etc…, shown in Table (5) and are the global parameters we are interested in. In general the could be time-dependent. In this analysis, because of missing data and the fact that our data for each region typically spans a short period ( is less then 3 weeks as discussed in section 5.1.1), we use average values for .
Our goal is to determine if any of the parameters are rigorously required to be non-zero by the data. One limitation of Eq. (5) is that it is linear in the country data . We justify this by noting that the growth rates and the underlying data vary over narrow ranges, so that retaining only the linear terms in the Taylor series expansion of is a reasonable step. For the temperature variables we have verified that assuming instead a step change in the growth at around C with two hierarchical growth parameters, one on each side of the step, did not lead to any increase in significance in the detection of temperature as an effect.
5.3 Posterior model and sampling
We write a hierarchical Bayesian probabilistic model by assuming a Poisson likelihood, suitable for count data, with mean given by the deterministic forward model defined in Eq. (5), together with priors and hyperpriors for all our 90 parameters, as shown in the schematic graphical representation in Fig. (5); see e.g. . We do not try to model missing deaths due to testing irregularities. As long as the fraction of missing deaths remains approximately constant in the early phase of the spread within the country that we consider this should have little impact on our results.
Since our data covers 37 countries and 40 provinces worldwide and we know that the growth rate of the disease will depend both on properties of the disease (which are universal), and properties specific to each country (e.g. culture and population density) it is natural to model the data with a hierarchical Bayesian structure which allows growth rates to vary somewhat from country to country but to also be somewhat similar between countries.
The priors and hyperpriors for each variable to be estimated is chosen to be:
; the hyperprior for the mean on the prior for initial deaths, , for each country ( as part of our data cuts).
; the standard deviation on the prior for initial deaths for each country.
; the mean (in percent) on the prior for the growth rate of each country, .
; the standard deviation on the prior for the growth rate of each country.
, the prior on the initial deaths parameter for each country, . All countries together will be denoted by .
; the prior on the growth rate parameter for each country, . All countries together will be denoted by .
; the prior on the vector of parameters of key interest.
The complete vector of all parameters is therefore:
The joint prior over our full set of parameters is assumed factorizable, i.e. a product of the prior distributions listed above.
We assume our data to be statistically independent both between countries and from day-to-day. Our model simultaneously fits the death data from all countries. We use NumPyro for Monte Carlo Markov Chain (MCMC) probabilistic sampling from our posterior using the No U-Turn Sampling (NUTS) algorithm . NumPyro has diagnostic tools built in and allows for easy running on accelerators, such as GPUs, which were key in being able to iterate quickly through our many models including all the multivariate and univariate combinations, as well as looking at the effect of priors.
The MCMC simulations were generally run as 4 independent chains with each chain starting from a random position sampled from the prior for our parameters and run for 2000 steps in order for length scales (and other sampling options) to be determined automatically by NumPyro. After this initial “burn-in" each chain was typically run for 5000 samples. Chains were long enough to ensure that the Gelman-Rubin convergence test was always less than 1.01.
Each chain was then thinned by a factor of two in order to further improve independence of samples. This leads to the final 10k samples collected for each model and led to good, converged, trace plots and posteriors; see Fig. (4). We used ChainConsumer666https://samreay.github.io/ChainConsumer/ for the analysis of the chains, the model selection metrics in Table (1) and some of the plots .
5.4 Additional Bayesian Results
5.4.1 A+ Blood Type Analysis
The data for blood type prevalence comes from online heterogeneous collections of published and unpublished data sources covering a wide range of publication dates. As a result, data integrity is an issue and we have chosen to separate the blood analysis from our main Bayesian analysis.
A random forest analysis finds only A+ blood type as potentially relevant amongst the different ABO blood types (see section5.5). Here we present the results of both univariate and multivariate A+ blood type Bayesian analyses using the formalism described in section (5.2).
Fig. (6) shows the result of the Bayesian analysis for the A+ blood type coefficient: it is consistent with zero in both the univariate and multivariate cases. If we look at the AIC, BIC and DIC for A+ blood type we find values of: compared to the no-factor models results of . The A+ AIC and BIC are better than the no-factor model, but the DIC value is worse. In addition, the random forest analysis found A+ blood type to be more important than both BCG and and Tests/1k.
As a result we conclude that there is somewhat conflicting evidence regarding the potential effect of A+ blood type, but that none of the evidence is strong.
5.4.2 Effect of on Base Parameters
In the results section we focused on the best-fit values of . The reverse question is interesting too: what is the impact of including the parameters on the hyperprior parameters, mean, , and standard deviation, , of the base growth rates, , for the 40 regions? If the
do explain some of the variation we would expect the standard deviation on the base growth rate (parent distribution) to get smaller, which is captured by the hyperparameter. The mean of the base growth rates (parent distribution), given by , are nothing more than an offset dictating the value when all factors are zero i.e. .
This is what we see: in the No-Factor model which changes to when we allow all the to vary, a significant shrinkage. On the other hand, if we include all parameters except for the two testing parameters, returns to , while when we add the positive rate (Tests/1k) parameters alone; showing that it is primarily the positive rate parameter that drives the shrinkage in the uncertainty in the parent distribution on the base growth rates.
5.4.3 Comparison of univariate and multivariate fits
To assess the stability of our results we fit each of our key parameters in both alone (univariate) and simultaneously with all the other parameters (multivariate). As shown in Table (3) and Fig. (4), the results are quite stable. Positive rate and temperature are the still the most significantly parameters in both cases: the positive rate is non-zero at more than in both cases while the significance for temperature increases from to when going from the univariate to multivariate fit. Relative Humidity also increases in significance in the multivariate case.
To access correlations between parameters we show in Fig. (10) the one and two-dimensional marginalised posterior plots. Correlations are typically small, though there are is some small positive correlation between Tests per 1k and Positive Rate, and a small negative correlation between Temperature and UV Index.
Our full results for all the hyperpriors, the base country growth rates, and the parameters for each of the different univariate and multivariate models are shown in Table (6).
|BCG Vaccine||0.01 0.04||0.00 0.04|
|Temperature||-0.29 0.14||-0.19 0.13|
|Relative Humidity||-0.10 0.08||-0.09 0.09|
|Tests / 1k||-0.11 0.12||-0.14 0.11|
|Positive Rate||0.25 0.08||0.27 0.08|
|UV Index||-0.01 0.95||-0.47 0.89|
5.4.4 Varying Hyperpriors
Since our goal in this analysis is to assess whether there is evidence for external factors such as climate, blood type etc… in addition to known country-specific factors, an important internal check is to look for potential sensitivity to our priors and hyperpriors.
In our hierarchical analysis, data for each country tries to pull the measured growth rates of countries to their own best-fit values, while the hierarchical nature of our model described in section (5.3) attempts to pull them all together. Between this tug of war, the algorithm searches for joint values of that will improve the fits to all the data. One concern might be that if we make the hyperpriors on the parent distributions much stronger or much weaker, we allow the algorithm to given less or more freedom to the base growth rates which in turn may affect the best-fit parameters and our main conclusions.
To test this we tightened the following hyperpriors and priors listed in section (5.3) by an order of magnitude to:
; the mean (in percent) on the prior for the growth rate of each country, .
; the standard deviation on the prior for the growth rate of each country.
; the prior on the vector of parameters of key interest.
The results of these changes are shown in Fig. (8). As expected tightening the prior on the pulled most parameters slightly closer to zero but also tighten the posterior so that none of our conclusions were altered: the statistical significance of all variables was unaltered.
5.5 Machine Learning Analysis
To provide a largely independent test of our Bayesian results we also undertook a machine learning analysis of the deaths data using random forest regression 
. Random forests are a powerful ensemble method that naturally provide feature selection capability, and are hundreds of times faster to run than our computationally intensive hierarchical Bayesian framework. While random forest does not provide estimates of statistical significance of factors it does allow us to rank additional factors in terms of importance and hence to explore the potential of additional explanatory features for inclusion in the main set of parameters,, for the Bayesian analysis.
To undertake the random forest regression we used the results from our No-Factor Bayesian run (i.e. ) to obtain the base growth rates for each country and used these as the target for the random forest regression with the data as features. In the random forest analysis we augmented the data both with PM pollution and extra blood types data (namely A–, AB+, O–, AB–, B–).
Feature selection was done using the standard Random Forest impurity method and by ranking variables by the impact they had on the average Root-Mean-Square-Error (RMSE) of the regression: leaving out important explanatory variables is expected to significantly degrade the performance of the algorithm, leaving out irrelevant features does not. All feature importance scores were averages over 500 random 70-30 training-test splits of the data. We found that the only variables that made more than one percent difference to the RMSE value were the Postive Rate (3.98%) and A+ (2.28%). This lead us to select A+ blood type as a variable in our separate full Bayesian analysis shown in section (5.4.1).
The results from the impurity-based feature selection are shown in Table (4). Again Positive Rate and A+ were the most significant features, followed by Tests per 1k and BCG vaccine coverage, PM and the other blood types at significantly less importance. As a result blood types, other than A+, were not included in the Bayesian analysis since additional parameters significantly increased the computational complexity of the analysis, both because of increased time to convergence and an increase in the number of models to run (since we fit both multivariate and univariate models in all cases). The distribution of a selection of feature importances is shown in Fig. (7)
In summary the machine learning analysis confirms, to the extent that they overlap, the results of our much more intensive Bayesian analysis: Positive Rate is the most important feature, followed by A+ (subject to the caveats discussed in section (5.4.1), while BCG is not relevant.
|Country/Region||BCG (%)||T (C)||RH (%)||Tests/1k||Pos-Rate (%)||A+ (%)||UV|
|Australia / NSW||38.9||19.7||74.2||17.3||3.4||31.0||1.4|
|Canada / Alberta||0.0||-3.8||74.5||15.1||7.2||36.0||0.7|
|Canada / BC||0.0||7.9||70.9||15.1||4.4||36.0||0.8|
|Canada / Ontario||0.0||-2.9||72.5||15.1||10.5||36.0||0.8|
|Canada / Quebec||0.0||-10.8||68.8||14.8||27.3||36.0||0.7|
|(68% CI)||BCG Vaccine|
|(68% CI)||Relative Humidity|
|(68% CI)||Tests / 1k|
|(68% CI)||Positive Rate|
|(68% CI)||A+ Blood Type|
|(68% CI)||UV Index|
|(68% CI)||Excl. Tests|
|(68% CI)||Incl. Tests|