The COVID-19 pandemic has spread rapidly around the globe, claiming hundreds of thousands of lives and wreaking havoc on the economy. Public health experts and policy makers often consider how quickly the disease is spreading within their communities when they decide when and how to enforce mitigation strategies, such as lock downs and business closures. When the disease is spreading rapidly, with new cases doubling every two to three days, municipalities and states are likely to order residents to shelter at home to slow the spread of the disease, but after the growth has slowed sufficiently, they may consider lifting restrictions.
We show that estimated COVID-19 growth rates can be biased due to data aggregation. Specfically, using county-level COVID-19 infections and deaths data gathered by the New York Times
, we show that spatial aggregation can systematically skew the estimated growth rates. As a result, the disease appears to grow faster at a state-level than it does within most counties within that state. Similarly, the growth rate at a national level may overestimate how quickly the disease is spreading in most places within the country.
We explain the origins of the aggregation bias. Because the disease begins spreading in different places within the country at different times, the varying ages of outbreaks, coupled with differences in growth rates, create a heavy-tailed distribution of the number of infections and deaths. A small fraction of counties where the disease is growing faster—hot spots—represent the majority of all infections and deaths. When data is aggregated to create state-level and national statistics, these hot spots systematically bias the calculated growth rates. The growth rates in the number of deaths and infections are correlated with population density, and as a result, the hot spots are typically large cities. New York City is one hot spots that has seen a heavy toll. However, the city is not exceptional. Hot spots at finer spatial resolutions, such as prisons and nursing homes, may skew the county-level disease data.
Epidemic modeling and public health policy need to focus on understanding and mitigating the effects of local hot spots. Similarly, economic analysis has to consider biases introduced by data aggregation. When calculating the costs and benefits of lock downs, for example, analysts must account for aggregation bias potentially magnifying deaths and infection rates in some states.
Figure 1 demonstrates the impact of COVID-19 in the US. The number of deaths in each county has a heavy-tailed distribution (Figure 1a). This means that the disease’s impact varies enormously between places, with some regions almost unaffected and others hit hard by the pandemic. For example, New York City accounts for the bulk of all confirmed infections in New York state, which accounts for the bulk of all US cases.
How does the large variation in the impact arise? Figure 1b shows the growing toll of the disease in New York state and its ten hardest-hit counties. The growth in the number of deaths within each county (and state) can be roughly modeled by an exponential, which allows us to estimate the growth rate (agreement with exponential fits has been checked in Supplementary Figure 6). There is a fairly broad distribution of death rates for counties and states (Fig. 1c). However, this by itself is not enough to explain the heterogeneity of COVID-19 impact. In addition, we must consider that disease appears in each county at a different time and has been growing for different period of time. As a result of different ages of local outbreaks across the US, the number of deaths has a long-tailed distribution (and similarly for the number of infections, Fig. 3, despite differences in the availability of testing).
This phenomenon is related to the Reed-Hughes mechanism 
, which explains how exponentially growing populations of different age produce a power-law distribution of population sizes. However, the Reed-Hughes mechanism specifies that populations have the same growth rate and begin growing uniformly in time. In contrast, the start time of the outbreak in each county is approximately normally distributed, as is the growth rate. To validate the changes in the mechanism, we create synthetic data in which simulated counties have outbreaks that start at times drawn at random from a normal distribution, with growth rates chosen from another normal distribution, and coefficients drawn from a log-normal distribution (all distribution parameters are fit to data). Synthetic outbreaks within our simulated counties follow a heavy-tail distribution (blue line in Fig.1a) similar to the empirical distribution for counties. The situation is somewhat more complex for states. Simply dividing counties across states at random, so that each simulated state ends up aggregating cases over counties (this is the mean number of counties in a state), creates a very sharply-peaked distribution, unlike what we observe in data. Instead, we divide up counties non-uniformly across states such that the number of counties in these simulated states matches the true distribution. We then re-create the heavy-tailed distribution of the number of deaths for states (orange line in Fig. 1a). These results demonstrate that large heterogeneity in the data creates a qualitatively similar outcome as the Reed-Hughes mechanism, but we now show this heterogeneity has an enormous impact on the current crisis within each county and state.
Figure 2a shows that growth rate is correlated with the total number of deaths (as of April 14), with Spearman rank correlation and for counties and states, respectively, p-value . As a result of this correlation, the fastest-growing hot spots dominate the statistics, making deaths appear to grow faster in a state than a typical county within that state. Similarly, national death rate appears to be higher than death rates within constituent states and counties (Fig. 1c). The systematic overestimation in aggregated death counts is an example of Modifiable Areal Unit Problem , a statistical bias similar to Simpson’s paradox , that results in varying statistical trends at different levels of aggregation. A similar discrepancy can be observed in the correlations between death growth rate and how long the disease has been spreading: correlation is not statistically significant for counties (p-value), yet it is positive for states (, p-value=).
Figure 2b shows that death rates are correlated with population density ( and , for counties and states, respectively, p-value ). Alike to deaths, growth rates of new infections (confirmed cases) are well correlated with population density ( and for counties and states, respectively, p-value). While it seems intuitively that denser places with more interpersonal interaction are at a greater risk for spreading the disease, this may not be the entire explanation. Instead, denser places are more likely to have local hot spots. In the US, prisons and nursing homes have proven to be local hot spots where the disease spreads unchecked. Denser regions could simply be more likely to have such local hot spots.
The results are quantitatively similar if we explore population (see Supplementary Figure 5), alike to what was shown in another paper . We also find that the number of infections (instead of the growth rate) correlate with population density ( and 0.70 for counties and states, respectively, p-value ), and to a larger extent population size ( and for counties and states, respectively, p-values ), and do not correlate as well with the age of the outbreak ( and for counties and states, respectively, p-values ).
The impact of COVID-19 in the US is highly heterogeneous, with many regions seeing few infections and deaths, while a handful of regions are greatly affected. The heavy-tailed distribution of impact has important implications for policy makers. First, aggregating data at the state or country level only tells us what is happening in a few infection hot spots where the disease is far more prevalent, typically large counties. As a result, the disease will appear to grow systematically faster at the state and country level than within individual counties. Since the order in which infections appear is likely dictated by patterns of mobility  and therefore difficult to control, the best way to reduce the overall infection and death rates is to reduce growth rate in hot spots, e.g., through early social distancing measures.
Analysis of the effects of interventions, such as lock downs and other mitigation strategies, has to account for potential biases introduced by data aggregation. Local hot spots may effectively amplify the rates of the disease for some regions (i.e., states and countries), obscuring the benefits of early interventions.
From the modeling perspective, since epidemic statistics are driven by a few hot spots (typically large, dense cities), compartmental models  may be most effective for modeling the disease. The assumptions made by compartmental models, namely uniform mixing of populations, are best aligned with mobility patterns in cities that regularly bring people in contact with one another . Compartmental models typically have fewer fitting parameters than spatio-temporal models [6, 9, 10], and therefore, may be better at making intermediate-range forecasts .
Future work is needed to understand how these results generalize to other diseases. We see, for example, that growth rate is negatively correlated with population density for disease such as Ebola , potentially due to lack of healthcare infrastructure. But this may a special case, due to the impoverished countries that were infected. Moreover, it is important to test the Reed-Hughes-like statistical model for other diseases and countries to see the degree to which it can help explain infection hot spots.
Methods and Materials
Data on cumulative COVID-19 infections is obtained from the New York Times  as of April 14, 2020. We also collect population and area within each county and state from the US Census (data.census.gov), where population estimates are as of July, 2019. States are defined as those with official statehood as well as the District of Columbia. Counties are defined the same as in the census except for New York City, where all boroughs are combined, and in Kansas City, where the population and area are calculated separately. Because Kansas city overlaps with other Missouri counties, we do not remove the city area from our estimates of county areas. We do not expect a significant change in our results due to this decision.
Growth rates were calculated by taking the log base 10 of the cumulative infections (and deaths) and fitting a line. For these fits, data below 11 infections (deaths) are removed to reduce effects by outliers. In addition, we only fit data with more than 5 datapoints. Calculations ofare based on this log-scaling of data.
-  M. Smith, et al., [Online] https://github.com/nytimes/covid-19-data (2020).
-  W. J. Reed, B. D. Hughes, Phys. Rev. E 66, 067103 (2002).
-  C. E. Gehlke, K. Biehl, Journal of the American Statistical Association 29, 169–170 (1934).
-  E. H. Simpson, Journal of the Royal Statistical Society, Series B 13, 238–241 (1951).
-  A. J. Stier, M. G. Berman, L. M. A. Bettencourt, arXiv preprint:2003.10376 (2020).
-  D. Brockmann, D. Helbing, Science 342, 1337 (2013).
-  B. Coburn, B. Wagner, S. Blower, BMC Med 7 (2009).
-  L. M. A. Bettencourt, Science 340, 1438 (2013).
-  K. Burghardt, et al., Sci. Rep. 6, 34598 (2016).
-  D. Liu, et al., arXiv preprint:2004.04019 (2020).
-  I. C.-. health service utilization forecasting team, C. J. Murray, medrXiv preprint (2020).
This work is funded by DARPA TAILOR program (Award #HR00111990114).