Systematic Biases in Aggregated COVID-19 Growth Rates

The COVID-19 pandemic has emerged as one of the greatest public health challenges of modern times. Policy makers rely on measuring how quickly the disease is spreading to make decisions about mitigation strategies. We analyze US county-level data about confirmed COVID-19 infections and deaths to show that its impact is heterogeneous. A small fraction of counties represent the majority of all infections and deaths. These hot spots are correlated with populous areas where the disease arrives earlier and grows faster. When county-level data is aggregated to create state-level and national statistics, these hot spots systematically bias the growth rates. As a result, infections and deaths appear to grow faster at those larger scales than they do within typical counties that make up those larger regions. Public policy, economic analysis and epidemic modeling have to account for potential distortions introduced by spatial aggregation.



There are no comments yet.


page 9


Unequal Impact and Spatial Aggregation Distort COVID-19 Growth Rates

The COVID-19 pandemic has emerged as a global public health crisis. To m...

A COVINDEX based on a GAM beta regression model with an application to the COVID-19 pandemic in Italy

Detecting changes in COVID-19 disease transmission over time is a key in...

Impact of COVID-19 on Public Transit Accessibility and Ridership

Public transit is central to cultivating equitable communities. Meanwhil...

α-Satellite: An AI-driven System and Benchmark Datasets for Hierarchical Community-level Risk Assessment to Help Combat COVID-19

The novel coronavirus and its deadly outbreak have posed grand challenge...

Estimation in emerging epidemics: biases and remedies

When analysing new emerging infectious disease outbreaks one typically h...

Geo-clustered chronic affinity: pathways from socio-economic disadvantages to health disparities

Our objective was to develop and test a new concept (affinity) analogous...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The COVID-19 pandemic has spread rapidly around the globe, claiming hundreds of thousands of lives and wreaking havoc on the economy. Public health experts and policy makers often consider how quickly the disease is spreading within their communities when they decide when and how to enforce mitigation strategies, such as lock downs and business closures. When the disease is spreading rapidly, with new cases doubling every two to three days, municipalities and states are likely to order residents to shelter at home to slow the spread of the disease, but after the growth has slowed sufficiently, they may consider lifting restrictions.

We show that estimated COVID-19 growth rates can be biased due to data aggregation. Specfically, using county-level COVID-19 infections and deaths data gathered by the New York Times


, we show that spatial aggregation can systematically skew the estimated growth rates. As a result, the disease appears to grow faster at a state-level than it does within most counties within that state. Similarly, the growth rate at a national level may overestimate how quickly the disease is spreading in most places within the country.

We explain the origins of the aggregation bias. Because the disease begins spreading in different places within the country at different times, the varying ages of outbreaks, coupled with differences in growth rates, create a heavy-tailed distribution of the number of infections and deaths. A small fraction of counties where the disease is growing faster—hot spots—represent the majority of all infections and deaths. When data is aggregated to create state-level and national statistics, these hot spots systematically bias the calculated growth rates. The growth rates in the number of deaths and infections are correlated with population density, and as a result, the hot spots are typically large cities. New York City is one hot spots that has seen a heavy toll. However, the city is not exceptional. Hot spots at finer spatial resolutions, such as prisons and nursing homes, may skew the county-level disease data.

Epidemic modeling and public health policy need to focus on understanding and mitigating the effects of local hot spots. Similarly, economic analysis has to consider biases introduced by data aggregation. When calculating the costs and benefits of lock downs, for example, analysts must account for aggregation bias potentially magnifying deaths and infection rates in some states.


Figure 1: Infection prevalence and growth. (a) The number of infections is a heavy-tailed distribution for both states and counties, with the most cases in New York and New York City, respectively. A stochastic model discussed in the main text that captures the behavior of the distribution. (b) The growth of COVID-19 in New York state and ten counties with the most cases as of April 14. (c) Growth rate in states is higher on average than growth rates in counties. This finding is also captured by the simulation. Findings are qualitatively similar in data of infections (see Supplementary Figure 3).

Figure 1 demonstrates the impact of COVID-19 in the US. The number of deaths in each county has a heavy-tailed distribution (Figure 1a). This means that the disease’s impact varies enormously between places, with some regions almost unaffected and others hit hard by the pandemic. For example, New York City accounts for the bulk of all confirmed infections in New York state, which accounts for the bulk of all US cases.

How does the large variation in the impact arise? Figure 1b shows the growing toll of the disease in New York state and its ten hardest-hit counties. The growth in the number of deaths within each county (and state) can be roughly modeled by an exponential, which allows us to estimate the growth rate (agreement with exponential fits has been checked in Supplementary Figure 6). There is a fairly broad distribution of death rates for counties and states (Fig. 1c). However, this by itself is not enough to explain the heterogeneity of COVID-19 impact. In addition, we must consider that disease appears in each county at a different time and has been growing for different period of time. As a result of different ages of local outbreaks across the US, the number of deaths has a long-tailed distribution (and similarly for the number of infections, Fig. 3, despite differences in the availability of testing).

This phenomenon is related to the Reed-Hughes mechanism [2]

, which explains how exponentially growing populations of different age produce a power-law distribution of population sizes. However, the Reed-Hughes mechanism specifies that populations have the same growth rate and begin growing uniformly in time. In contrast, the start time of the outbreak in each county is approximately normally distributed, as is the growth rate. To validate the changes in the mechanism, we create synthetic data in which simulated counties have outbreaks that start at times drawn at random from a normal distribution, with growth rates chosen from another normal distribution, and coefficients drawn from a log-normal distribution (all distribution parameters are fit to data). Synthetic outbreaks within our simulated counties follow a heavy-tail distribution (blue line in Fig. 

1a) similar to the empirical distribution for counties. The situation is somewhat more complex for states. Simply dividing counties across states at random, so that each simulated state ends up aggregating cases over counties (this is the mean number of counties in a state), creates a very sharply-peaked distribution, unlike what we observe in data. Instead, we divide up counties non-uniformly across states such that the number of counties in these simulated states matches the true distribution. We then re-create the heavy-tailed distribution of the number of deaths for states (orange line in Fig. 1a). These results demonstrate that large heterogeneity in the data creates a qualitatively similar outcome as the Reed-Hughes mechanism, but we now show this heterogeneity has an enormous impact on the current crisis within each county and state.

Figure 2: Correlates of COVID-19-related death rates. (a) Death rates are positively correlated with the number of deaths as of April 14 ( and for counties and states, respectively, p-value ). (b) Death rates are correlated with population density ( and for counties and states, respectively, p-value ).

Figure 2a shows that growth rate is correlated with the total number of deaths (as of April 14), with Spearman rank correlation and for counties and states, respectively, p-value . As a result of this correlation, the fastest-growing hot spots dominate the statistics, making deaths appear to grow faster in a state than a typical county within that state. Similarly, national death rate appears to be higher than death rates within constituent states and counties (Fig. 1c). The systematic overestimation in aggregated death counts is an example of Modifiable Areal Unit Problem [3], a statistical bias similar to Simpson’s paradox  [4], that results in varying statistical trends at different levels of aggregation. A similar discrepancy can be observed in the correlations between death growth rate and how long the disease has been spreading: correlation is not statistically significant for counties (p-value), yet it is positive for states (, p-value=).

Figure 2b shows that death rates are correlated with population density ( and , for counties and states, respectively, p-value ). Alike to deaths, growth rates of new infections (confirmed cases) are well correlated with population density ( and for counties and states, respectively, p-value). While it seems intuitively that denser places with more interpersonal interaction are at a greater risk for spreading the disease, this may not be the entire explanation. Instead, denser places are more likely to have local hot spots. In the US, prisons and nursing homes have proven to be local hot spots where the disease spreads unchecked. Denser regions could simply be more likely to have such local hot spots.

The results are quantitatively similar if we explore population (see Supplementary Figure 5), alike to what was shown in another paper [5]. We also find that the number of infections (instead of the growth rate) correlate with population density ( and 0.70 for counties and states, respectively, p-value ), and to a larger extent population size ( and for counties and states, respectively, p-values ), and do not correlate as well with the age of the outbreak ( and for counties and states, respectively, p-values ).


The impact of COVID-19 in the US is highly heterogeneous, with many regions seeing few infections and deaths, while a handful of regions are greatly affected. The heavy-tailed distribution of impact has important implications for policy makers. First, aggregating data at the state or country level only tells us what is happening in a few infection hot spots where the disease is far more prevalent, typically large counties. As a result, the disease will appear to grow systematically faster at the state and country level than within individual counties. Since the order in which infections appear is likely dictated by patterns of mobility [6] and therefore difficult to control, the best way to reduce the overall infection and death rates is to reduce growth rate in hot spots, e.g., through early social distancing measures.

Analysis of the effects of interventions, such as lock downs and other mitigation strategies, has to account for potential biases introduced by data aggregation. Local hot spots may effectively amplify the rates of the disease for some regions (i.e., states and countries), obscuring the benefits of early interventions.

From the modeling perspective, since epidemic statistics are driven by a few hot spots (typically large, dense cities), compartmental models [7] may be most effective for modeling the disease. The assumptions made by compartmental models, namely uniform mixing of populations, are best aligned with mobility patterns in cities that regularly bring people in contact with one another  [8]. Compartmental models typically have fewer fitting parameters than spatio-temporal models [6, 9, 10], and therefore, may be better at making intermediate-range forecasts [11].

Future work is needed to understand how these results generalize to other diseases. We see, for example, that growth rate is negatively correlated with population density for disease such as Ebola [9], potentially due to lack of healthcare infrastructure. But this may a special case, due to the impoverished countries that were infected. Moreover, it is important to test the Reed-Hughes-like statistical model for other diseases and countries to see the degree to which it can help explain infection hot spots.

Methods and Materials

Data on cumulative COVID-19 infections is obtained from the New York Times [1] as of April 14, 2020. We also collect population and area within each county and state from the US Census (, where population estimates are as of July, 2019. States are defined as those with official statehood as well as the District of Columbia. Counties are defined the same as in the census except for New York City, where all boroughs are combined, and in Kansas City, where the population and area are calculated separately. Because Kansas city overlaps with other Missouri counties, we do not remove the city area from our estimates of county areas. We do not expect a significant change in our results due to this decision.

Growth rates were calculated by taking the log base 10 of the cumulative infections (and deaths) and fitting a line. For these fits, data below 11 infections (deaths) are removed to reduce effects by outliers. In addition, we only fit data with more than 5 datapoints. Calculations of

are based on this log-scaling of data.

Supplementary Figures

Figure 3: Infection prevalence and growth. (a) The number of infections (confirmed cases of COVID-19) is a heavy-tailed distribution for both states and counties. We point out the largest number of infections in New York state and New York city. (a) We observe initial exponential-like growth in COVID-19-related infections, although they begin at different times for difference counties. New York is shown as an example of what we observe throughout the US: the vast majority of cases are highly localized, such as within New York City. Nine other hard-hit counties are also shown. (c) The estimated infection growth rates in states and the country as a whole are higher on average than growth rates in counties. Compare to Fig. 1 in the main text.
Figure 4: Correlates of infection growth rates. (a) Death growth rates correlate strongly with the number of infections (Spearman correlation and for counties and states, respectively). (b) Infection growth rates correlate with population density ( for counties, for states). P-values .
Figure 5: Correlates of COVID-19 infections and deaths with population. (a For COVID-19-related deaths, correlation is also statistically significant for counties () or states (). (b) For infections, for counties and for states. P-values.
Figure 6: Quality of the exponential fits. We measure fit quality using for a linear fit of log-scaled data. For (a) deaths and (b) infections, we see a large majority of fits have extremely high (insets) that improve as we fit to infections that are larger (number of infections or deaths as of April 14).



This work is funded by DARPA TAILOR program (Award #HR00111990114).