Exposure Density and Neighborhood Disparities in COVID-19 Infection Risk: Using Large-scale Geolocation Data to Understand Burdens on Vulnerable Communities

08/04/2020 ∙ by Boyeong Hong, et al. ∙ NYU college 0

This study develops a new method to quantify neighborhood activity levels at high spatial and temporal resolutions and test whether, and to what extent, behavioral responses to social distancing policies vary with socioeconomic and demographic characteristics. We define exposure density as a measure of both the localized volume of activity in a defined area and the proportion of activity occurring in non-residential and outdoor land uses. We utilize this approach to capture inflows/outflows of people as a result of the pandemic and changes in mobility behavior for those that remain. First, we develop a generalizable method for assessing neighborhood activity levels by land use type using smartphone geolocation data over a three-month period covering more than 12 million unique users within the Greater New York area. Second, we measure and analyze disparities in community social distancing by identifying patterns in neighborhood activity levels and characteristics before and after the stay-at-home order. Finally, we evaluate the effect of social distancing in neighborhoods on COVID-19 infection rates and outcomes associated with localized demographic, socioeconomic, and infrastructure characteristics in order to identify disparities in health outcomes related to exposure risk. Our findings provide insight into the timely evaluation of the effectiveness of social distancing for individual neighborhoods and support a more equitable allocation of resources to support vulnerable and at-risk communities. Our findings demonstrate distinct patterns of activity pre- and post-COVID across neighborhoods. The variation in exposure density has a direct and measurable impact on the risk of infection.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When Wuhan, China was struck by an outbreak of a novel coronavirus (2019-nCoV or COVID-19) in late 2019, scientists were concerned about the potential for rapid spread of the virus Callaway (2020). By the end of June 2020, a total of 10 million people had been infected and 500,000 people had lost their lives in more than 200 countries Organization et al. (2020); Van Bavel et al. (2020). The COVID-19 pandemic is considered the most severe public health crisis since the 1918 Spanish flu due to the transmission and infection characteristics of the disease, with few effective treatments at the time this article was written Organization et al. (2020); Sen-Crowe et al. (2020); Courtemanche et al. (2020); Gao et al. (2020). Given the global scale of the pandemic, a coordinated response is necessary to mitigate the spread of the virus both within and across country borders Remuzzi and Remuzzi (2020).

Social distancing (also referred to as physical distancing) is one of the most effective behavioral strategies to reduce the transmission rate of COVID-19 Sen-Crowe et al. (2020); Courtemanche et al. (2020); Chudik et al. (2020); Patrick et al. (2020); Gao et al. (2020); Jay et al. (2020)

. Social distancing reduces the probability of contacts between individuals who might be infected, resulting in reduced exposure risk

Jay et al. (2020); Adolph et al. (2020). Governments have implemented a range of social distancing policies, including travel bans, restrictions on gatherings, school closures, non-essential business closures, and restaurant restrictions. In particularly hard-hit locations, mandatory “stay-at-home” orders have been issued to limit or avoid unnecessary close contacts outside of the home Adolph et al. (2020); Jay et al. (2020); Chinazzi et al. (2020).

Studies have found social distancing measures help to prevent transmission of the virus and reduce the reproduction (R0) number Gao et al. (2020); Chudik et al. (2020); Jay et al. (2020); Abouk and Heydari (2020); Hatchett et al. (2007); Farboodi et al. (2020); Matrajt and Leung (2020). These practices help to avoid overwhelming hospital intensive care units (ICUs) and health care systems, control doubling time of infections, and ultimately save lives Gao et al. (2020); Greenstone and Nigam (2020); Adolph et al. (2020). Although not without potentially significant hardship to individuals and communities, social distancing is an important public health tool to flatten the epidemic curve and support longer-term economic and public health benefits Sen-Crowe et al. (2020); Greenstone and Nigam (2020); Thunstrom et al. (2020).

However, the impact of, and response to, stay-at-home orders, and social distancing guidelines more broadly, is not uniform across neighborhoods and communities. In order to maximize the positive effects of social distancing, individuals need to change their typical behavior, often dramatically Van Bavel et al. (2020); Sen-Crowe et al. (2020). Despite government-mandated social distancing policies (e.g. New York State’s PAUSE order), socio-behavioral responses vary across neighborhoods, further contributing to disparities in risk of infection Courtemanche et al. (2020); Jay et al. (2020). Disparities in social distancing practices, namely geographic or population subgroup differences in adopting behavior changes in response to the same policy context, may stem from varying levels of awareness, perception, or belief in the severity of the virus threat; differences in social and cultural norms; or the ability of households and communities to alter normal activity patterns given economic constraints or other existing responsibilities Van Bavel et al. (2020); Jay et al. (2020); Wise et al. (2020); Caley et al. (2008). For example, lower-income households typically do not have the option to work from home and going to a place of work (often in essential services) is unavoidable, meaning higher risk of exposure to COVID-19 for themselves as well as their families and communities Jay et al. (2020); Atchison et al. (2020). Within specific neighborhoods and communities, norms can also be reinforcing; if large numbers of residents are essential workers and not socially distancing, other residents may have similar behavioral responses Van Bavel et al. (2020).

A growing number of outbreaks are occurring in densely populated areas Desai (2020), with disproportionate impacts on low-income and minority communities. Measuring and understanding social distancing and behavior change across neighborhoods during a pandemic can provide critical insight to the design and implementation of more effective - and equitable - public health policy. Given the potential heterogeneity in localized responses to social distancing recommendations, quantifying local patterns of activity represents an emerging tool to understand and eventually reduce local exposure risk and limit community outbreaks Jay et al. (2020). Although there has been increasing awareness of the troubling disparities in infection rates and outcomes in vulnerable communities, the effectiveness of behavioral interventions at the scale of individual neighborhoods has not been fully studied. Often, studies that do attempt to observe effects at higher spatial resolutions rely on simulations or are limited to relatively coarse areal units (e.g. county or state) due to data limitations and computational constraints.

This study develops a new method to quantify neighborhood activity levels at high spatial and temporal resolutions. We define exposure density (

) as a measure of both the localized volume of activity in a defined area and the proportion of activity occurring in non-residential and outdoor land uses, areas that are associated with an increased risk of exposure to the virus. We utilize this approach to capture inflows/outflows of people as a result of the pandemic and changes in mobility behavior for those that remain. We test whether, and to what extent, behavioral responses to social distancing policies vary with socioeconomic and demographic characteristics. We focus on New York City (NYC), the first epicenter of the pandemic in the U.S., which enacted a statewide “stay-at-home” order (NY on PAUSE) on March 22. Through June 30, NYC had more than 212,000 confirmed cases of COVID-19, accounting for 8% of the nationwide total, resulting in at least 18,492 confirmed deaths and 4,604 probable deaths. Our methodology proceeds in three steps. First, we develop a generalizable method for assessing neighborhood activity levels using smartphone geolocation data over a three-month period covering more than 12 million unique users within the Greater New York area, together with land use information at 1 m grid resolution. Second, we measure and analyze disparities in community social distancing by estimating variations neighborhood activity and associated patterns in community characteristics before and after the stay-at-home order. Finally, we evaluate the effect of social distancing in neighborhoods on COVID-19 infection rates and outcomes associated with localized demographic, socioeconomic, and infrastructure characteristics in order to identify disparities in health outcomes related to exposure risk. Our findings provide insight into the timely evaluation of the effectiveness of social distancing for individual neighborhoods and support a more equitable allocation of resources to support vulnerable and at-risk communities.

2 Background

Emerging research has begun to leverage a variety of new data resources to measure and evaluate social distancing practices. For example, the World Bank COVID-19 Mobility Task Force and Emergency Operation Center (EOC) uses anonymized Cell Detail Record (CDR) data from Mobile Network Operators (MNOs) to provide technical, operational, and decision-making support to mitigate the spread of COVID-19, particularly for data-poor countries Viotti (2020). Data fields include the date, number of devices, number of trips, and mean distance traveled for a given region. Specifically, the dataset contains origin–destination information aggregated to cell tower capture areas, providing insights into movement patterns and human activity density on a given day.

Social media, such as Twitter or Facebook, has been another popular source to measure social distancing Facebook (2020). Kayes et al. (2020) use Twitter data to develop an automated social distancing measurement method and sentiment detection model in Australia. The study finds that more than 80% of Twitter users have a positive sentiment and they are willing to accept social distancing practices to protect their communities Kayes et al. (2020). Also, Xu et al. (2020) develop a social mobility index as a measure of social distancing and travel patterns by using public geotagged Twitter data. This research defines the social mobility index based on total travel distance between identified home locations and other visited locations for each user, and compares this metric before and after the pandemic declaration. The authors observe a decrease in mobility of 62% in the U.S on average, ranging from 39% to 77% across different states.

Using a range of urban sensing modalities, Das and James (2020)

at the Newcastle University Urban Observatory in the United Kingdom developed a dashboard to provide social distancing status and vehicle mobility patterns in real time. This group used thousands of sensors from vehicular and pedestrian environments, such as parking lot occupancy sensors and public transportation GPS trackers, to monitor urban dynamics. They analyzed more than 1.8 billion data points to compare daily mobility patterns after the lockdown against the same day from the previous year. The research observed peaks of pedestrian movement disappeared and mobility volume decreased by more than 90% after the stay-at-home order. Other sensing approaches include the use of computer vision and synoptic imagery to quantify pedestrian and motorized activity.

Many researchers have turned to Point-of-Interest (POI) data, which is location information primarily used by retailers for customer tracking (e.g. visitor counts) and commercial space planning, extracted from mobile phone. As POI data typically provides the location of stores or restaurants and the number of visitors, activity levels at POI locations can be used to measure social distancing at specific places. Many studies rely on POI data provided by SafeGraph, in partnership with ESRI. For example, Bayham et al. (????) analyzed POI data to quantify mobility patterns in Colorado during the early stage of the COVID-19 epidemic at the state and county level. The report found that residents reduced their social activities by 80% after the statewide stay-at-home order, and residents of higher-income counties reduced their activity levels more than lower-income counties. Other recent studies, including Allcott et al. (2020) and Painter and Qiu (2020), use POI data to understand potential disparities in social distancing practices based on political affiliation.

Finally, cell phone GPS data, primarily collected from smartphone applications, has gained widespread attention for its potential to understand large-scale population dynamics Coven and Gupta (2020). Google, for instance, uses anonymized and aggregated mobility data of users based on their Google Location History to measure mobility patterns and demonstrate the effectiveness of various social distancing policies Google (2020); Wellenius et al. (2020); Aktay et al. (2020). Based on these data, research has quantified changes in the amount of time individuals spend away from their place of residence and changes in the number of visits to non-residential places, such as retail stores, parks, and transit stops, aggregated to the county level. Similarly, Apple extracts user requests for directions in Apple Maps to create anonymized, daily mobility data for metropolitan areas Apple (2020). In addition to Google and Apple, there are third party providers that collect users’ location with timestamp information from individual smartphone application and sell (or share) these data with a range of end-users researchers. As an example, the New York Times acquired data on 15 million users in the U.S. from Cuebiq, a location data provider, to quantify the number of people staying home and changes in travel patterns across the country. Similarly, Unacast has developed a social distancing score across the U.S. at the county level using their location data products, which includes an interactive dashboard UNACAST (2020).

Despite the range of data sources used to measure social distancing, many studies face nontrivial constraints caused by data limitations and methodological challenges. First, previous work is limited to relatively coarse geographical units, such as city, county, or state. In order to quantify and understand the range of effects of social distancing across socioeconomic and demographic diversity, observation of changes in activity are needed at higher spatial resolutions. Second, previous work does not account for land use classifications in determining where activity is occurring, and how those location shift over time. While POI data can capture visits to certain types of establishments, it does not, by itself, allow for measurement of activity taking place outdoors, in residential settings, or other places where check-ins may not occur. Therefore, a scalable metric of neighborhood activity change and exposure risk is necessary to understand the heterogeneous effects of social distancing on vulnerable communities, to model localized spread of disease, and to help develop more equitable and effective public health interventions targeted to at-risk communities.

3 Data and Methods

Dataset Time range Resolution
(spatial/temporal) Source and description
Mobility data 2020-02-01 2020-04-30 (X,Y)/second Geotaggged data points collected from more than 200 mobile applications provided by VenPath, Inc.
NYC Primary Land Use Tax Lot Output (PLUTO) updated 2020-02-24 Parcel/NaN Land use and building type information provided by the NYC Department of City Planning
NYC Building Footprints updated 2020-07-06 Footprint/NaN perimeter outline of more than 1 million buildings in NYC provided by the Department of Information Technology & Telecommunications (DoITT).
Road Network Data (LION) updated 2020-04-28 Street segment/NaN Single line street base map provided by the NYC Department of Transportation
NYC COVID-19 data 2020-04-01 2020-06-04 Zipcode/daily COVID-19 confirmed cases, deaths, and positivity rates information provided by the NYC Health Department
American Community Survey (ACS) 2018 5-year estimates Zipcode/NaN Household demographic and socioeconomic characteristics from the U.S. Census Bureau
NYC Hospital locations updated 2017-09-08 (X,Y)/NaN List of hospitals of the NYC Health and Hospital Corporation and public hospital system
Nursing home data updated 2020-05-24 (X,Y)/NaN Nursing home information including the number of beds and occupancy across the country provided by the Centers for Disease Control’s National Healthcare Safety Network system
Table 1: Data Sources and descriptions

Our primary data are anonymized smartphone geolocations collected by VenPath, Inc. – a data marketplace company providing mobile application data and business analytics consulting based on more than 200 various smartphone applications across the United States. The approximately 5 TB dataset covers the period from February through April 2020 and contains more than 127 billion geotagged data points associated with 120 million unique devices every month across the country. For this study, we extract a subset of the data falling within the Greater New York area bounding box extent ( : ) and adjust timestamps to the Eastern Standard Time (EST) zone, resulting in 12,858,781 unique devices. After filtering for devices active for at least 14 days over the study period, the processed dataset includes 744,147 unique devices, representing approximately 8.9% of the NYC population.

To complement our mobility data, we use a range of ancillary data as described in Table 1 for data analysis and modeling. NYC Primary Land Use Tax Lot Output (PLUTO) data are used to obtain land use and building type information for every property in the city. The building footprint shapefile is used to identify the exact boundaries of individual buildings. NYC LION data – a single line street base map – are used to extract street segment geometries. We use daily NYC COVID-19 information by zipcode, which includes confirmed cases, deaths, and positive test rates, provided by the NYC Department of Health and Mental Hygiene (DOHMH). In order to contextualize neighborhood demographic, socioeconomic, housing, and public health-related characteristics, we use American Community Survey (ACS) data from the U.S. Census Bureau, NYC hospital locations from NYC OpenData, and nursing home data provided by the U.S. Centers for Disease Control (CDC). With the exception of the smartphone geolocation data, all data are publicly available and extracted from NYC or federal open data platforms.

We explore three hypotheses. First, large-scale mobility data can represent neighborhood activity levels over time, and neighborhood social distancing can be measured by changes in this observed activity. Second, disparities in community activity changes before and after a stay-at-home order are associated with neighborhood socioeconomic and demographic characteristics. Third, variations in neighborhood social distancing result in disparities in COVID-19 infections and outcomes, controlling for differences in population health risk.

3.1 Building the Exposure Density Metric

We introduce exposure density () as a high spatiotemporal resolution social distancing metric using large-scale mobility data without identifying and tracking individual devices. The goal of social distancing is to reduce the probability of contact between potentially infected and non-infected people; therefore, it can be defined mathematically as the inverse proportion of human activity density, represented by the number of people in a given area at a given time. Naively, a lower activity volume, holding spatial area constant, results in lower population density, thus decreasing the probability of close contacts. However, this metric needs to account for both the volume of activity in an area and the type of land use where activities occur. For example, activities in residential buildings can be a measure of people staying at home, while activities outside of residential buildings are more likely to increase exposure risk by raising the likelihood of contact with those outside of the family or household unit. Here, is measured as the number of unique devices in a given geographical and temporal unit by land use type, specified as:


where is the selected geographical unit (e.g. grid cell or census block group), is the temporal unit (e.g. hourly or daily), and is the land use class.

In order to maintain a scalable and uniform areal unit that can be applied across different cities and regions, we divide the NYC study area into a 250 m grid (187 x 186 cells) which we use for aggregation of the mobility data. To enrich the mobility data with land use information, we create a 1 m resolution raster with the extents and the coordinate system matching the aforementioned 250 m grid. The land use raster combines the geographical city limits and land use classification derived from PLUTO data, together with street and sidewalk boundaries and building footprints for more than 1,000,000 buildings. Each category of land cover is then classified by an integer (e.g. 10 for residential property, 50 for outdoor open space, and so on). Each 1 m cell is therefore identified by its index, location and associated land cover category. This allows us to assign each geolocation data point from the mobility dataset to a specific land use cell. To estimate population density, we count the hourly number of unique devices by each 250 m grid cell and the corresponding land use category based on the raster cell. Our data processing workflow is visualized in Figure

1. The rasterization process was implemented in Python and deployed on NYU Center for Urban Science and Progress’ Research Computing Facility and the activity computation was performed with PySpark on a Hadoop distributed computing cluster using NYU’s High Performance Computing platform.

Figure 1: Visualization of data processing workflow starting from land use rasterization on 1m x 1m grid (left), spatial join of mobility data with land use information by mapping activity location on the raster (center) and aggregation of hourly activity by type on each 250m x 250m grid (right).

Our 250 m grid cell level measurement can be aggregated into larger geospatial units in order to estimate neighborhood activities at different scales. In this work, we use zipcode aggregation to compute neighborhood activity to align with COVID-19 infection data provided by the DOHMH. The zipcode aggregated () is defined as:


where is the average number of hourly unique devices in a 250m250m grid cell by land use type in a given zipcode , and is the number of grid cells in zipcode . Based on our social distancing metric, changes in mobility activity by residential, non-residential, and outdoor land uses in a neighborhood over the study time period are examined. We filter out activities from major roads used exclusively by motor vehicles (those without sidewalks or pedestrian access) to remove vehicular activity within a given neighborhood.

3.2 Analyzing disparities in exposure density

To understand disparities in exposure density and behavioral responses to social distancing mandates across neighborhoods, we apply an unsupervised machine learning clustering algorithm based on a

pre/post comparative analysis. We extract () subsets for two, two-week periods, defined as the pre-impact period (February 16 through February 29 2020) and the impact period (March 29 through April 11 2020), to measure changes in (

) before and after the state-mandated stay-at-home order. In order to take into account both the absolute change in activity volume and the change in the proportion of activity type, we create six (6) input variables for the zipcode clustering analysis specified as:


where is average hourly activity volume change for residential, non-residential, and outdoor land uses in zipcode based on the pre-impact period activity level () and the impact period level (). is the average hourly change in activity based on the proportion of those activities occurring in different land use types. Neighborhood activity by land use classification is defined as the proportion of activity in a given land use (residential, non-residential, and outdoors) grid cell.

To identify similarities in the change in () , we use a Wards’ metric-based agglomerative clustering algorithm, which a widely-used bottom-up hierarchical unsupervised clustering methods. It begins with each data point considered as an individual cluster. At each iteration, the closest two clusters merge with each other based on the proximity matrix measured by Euclidean distance until all data points form a single cluster Hastie et al. (2009). Input data is in the form of a 177

6 vector – 177 zipcode neighborhoods and 6 features – and the optimized number of clusters is determined by the corresponding dendrogram (hierarchical tree diagram) based on the similarities and dissimilarities of the objects. This clustering process is specified as:


where is a merging cost of combining clusters and (distance between clusters), is the centroid of cluster , and

is an individual element within a cluster. The resultant clustered neighborhood groups are then integrated with demographic and socioeconomic characteristics, housing and urban form features, and COVID-19 infection and outcome data. By using a one-way ANOVA (analysis of variance) test and a Tukey’s test for post-hoc analysis, we identify statistically significant differences between classified groups regarding behavioral responses and associated neighborhood characteristics.

3.3 Identifying the impact of exposure density and neighborhood behavior change on infectivity

In order to evaluate the effect of neighborhood behavior changes on COVID-19 infection rates for the 177 neighborhoods included in the study, we first measure Pearson correlation coefficients for observed community activity changes before and after the stay-at-home order and disease infection case rates – daily new confirmed cases per 100,000 people and cumulative cases per 100,000 people – while accounting for an incubation period.

Infection rate indicators Description
Case rate Number of confirmed cases per 100,000 by zipcode
Death rate Number of confirmed deaths per 100,000 by zipcode
Positivity rate Percentage of positive tests using a polymerase chain reaction (PCR) test
Deaths per case Number of deaths per confirmed case
Table 2: Description of COVID-19 infection rates
Variable Description Statistics
Exposure density change % change in activities outside of residential buildings -0.18 (0.20)
Neighborhood clusters binary variables for each identified cluster -
White % non-Hispanic White population 0.47 (0.27)
Black % Black population 0.21 (0.24)
Hispanic % Hispanic population 0.26 (0.20)
Asian % Asian population 0.15 (0.14)
Age group 25-34 % of population 25-34 years old 0.18 (0.06)
Age group over 65 % of population over 65 years old 0.14 (0.05)
Household size Average household size in zipcode 2.64 (0.50)
Household with children % of households with children under 18 0.25 (0.08)
Educational attainment % of population with Bachelor’s degree 0.23 (0.10)
No health insurance % of households without health insurance 0.08 (0.04)
Public health insurance % of households with public insurance 0.39 (0.14)
Commute time Average commute time (minutes) 40.76 (7.12)
Median income Median income in zipcode 74K (37K)
Unemployment rate % of labor force unemployed 0.07 (0.03)
Owner occupied units % of housing units occupied by owner 0.37 (0.22)
One or two family home % of one or two family home units 0.30 (0.31)
Public housing % of public housing units 0.05 (0.08)
Residential area Residential building area as % of total built area 0.65 (0.20)
Office area Office building area as % of total built area 0.09 (0.15)
Commercial area Commercial (non-office) building area as % of total built area 0.28 (0.17)
Hospital Hospital(s) located in zipcode (yes=1) 0.21 (0.41)
Nursing home Number of occupied nursing home beds 507 (837)
Note: Standard deviations are in parentheses for continuous variables with mean values.
Table 3: Summary statistics of input variables

Then, we develop bivariate and multivariate log-transformed regression models to identify any statistical significant effect of

on infections, controlling for neighborhood characteristics. Ordinary least squares (OLS) regression models are applied with four (4) dependent variables, representing four measures of COVID-19 infectivity (Table

2), including case rate, death rate, positivity rate, and deaths per case. Table 3

provides descriptive statistics for the included independent variables. The bivariate models take

change (as a percent) as a continuous variable to measure the marginal effects of activity change on infection rates. The multivariate models, on the other hand, use dummy variables for each clustered neighborhood group to evaluate disparities between groups. The linear models are specified as:


where is the logarithmic transformed zipcode-level COVID-19 outcome variable, for the bivariate model is exposure density change, for the multivariate model includes the cluster group dummy variables the set of neighborhood demographic, socioeconomic, and built environment features, and

is the error term. We also consider interaction terms between the neighborhood groups and other social determinants of health. We use correlation tests and Variance Inflation Factors (VIFs) analysis to identify multicollinearity as part of the feature selection process. The coefficients

evaluate the effects of neighborhood on disparities in disease risk.

4 Results and Findings

4.1 Changes in exposure density by neighborhood

Figure 2: Neighborhood activity measurements by land use types at 250m grid cell level before and after the stay-at-home order (A) citywide average neighborhood activity volume changes (B) citywide average neighborhood activity proportion changes (C) Example neighborhoods - zipcode 10001 (Chelsea, Manhattan) and zipcode 11364 (Bayside, Queens)
Figure 3: Neighborhood exposure density. Left: Percentage change in exposure density by zipcode. Right: Individual time series representing exposure density change over time by zipcode with citywide moving average reference. Citywide, there is an approximately 20% decrease in exposure density after the stay-at-home order.

Citywide and neighborhood hourly activity change by land use type are presented in Figure 2. The citywide overall activity volume (Figure 2A) decreased, especially outdoors (sidewalks or parks) and in non-residential spaces (office and retail), which show significant reductions (20% and 33% decrease, respectively) after the stay-at-home order when compared to the pre-COVID baseline. We also observe changes to typical activity peak periods before and after the stay-at-home order. In the pre-COVID period, there are two peaks (around 8am and 6pm) each weekday, a result of a commute patterns during rush hour periods. The morning peak, however, is no longer discernible in the post-COVID period as remote work and school closures reduce commuting activity. In addition to population volume changes, the analysis of where activity is occurring reveals significant behavior changes across the City (Figure 2B). We observe residential activities increased by three percentage points (a +10% change) and non-residential activities decreased by three percentage points (a -13% change) after the stay-at-home order. This demonstrates both a decrease in overall population, as many residents left the city, and a shift in activities from non-residential and outdoor areas to residential buildings for those that remained. Figure 2C illustrates this change in activity (both volume and proportion) for two exemplar neighborhoods. In the case of zipcode 10001 in Midtown Manhattan, the number of people staying in the neighborhood dramatically decreased by more than 60% with activities of those who remain becoming more evenly distributed between residential, non-residential, and outdoor land uses. This highlights the exodus of residents from the city and the reduction in the number of visitors to the neighborhood. Of those who stayed, more remained at home than would be expected. Another example neighborhood – zipcode 11364 at the eastern edge of Queens – shows a substantial increase in residential activity volume and in the proportion of activities occurring in residential areas.

As transmission risk increases with a greater probability of close contacts outside of the household or family unit, we quantify based on activities in non-residential buildings and outdoor areas. We measure the average number of hourly users per grid cell outside of residential buildings for 177 zipcodes during the pre-impact period and after stay-at-home order. Figure 3 illustrates the percentage change in . There is a 20% citywide decrease in after the stay-at-home order, but we observe significant disparities in behavior change across neighborhoods (see Figure 3). For instance, there is a 50% decrease, on average, in in Manhattan, Long Island City, and Downtown Brooklyn, while activities in the south area of Staten Island, South Brooklyn, and the east side of Queens exhibit only minor variations.

4.2 Disparities in neighborhood exposure density

We classify neighborhoods into distinct groups using a hierarchical clustering algorithm based on changes in community activity levels and proportions before and after the stay-at-home order. Figure

4 illustrates the spatial patterns of the clustering output with associated time series of neighborhood activity and where (by land use type) that activity is occurring. Descriptive statistics of input variables and neighborhood features for each group, shown in Table 5 and Table 5 respectively, reveal distinct neighborhood profiles based on changes in over time.

Figure 4: Spatial patterns of the agglomerative clustering results and associated neighborhood activity change (top time series: activity volumes by land use, bottom time series: activity proportions by land use)

Based on this analysis, we identify five (5) neighborhood clusters. Group 1 (21 zipcodes) and Group 2 (21 zipcodes), which we label ”outflow” neighborhoods, are primarily located in Manhattan and Downtown Brooklyn and represent substantial changes in after the stay-at-home order. As shown in Table 5, the average activity volume change for Group 1 and Group 2 is -56.5% and -33.5%, respectively, meaning these two neighborhood groups experienced nontrivial declines in normal activity levels – across all land use types – during the pandemic. Most neighborhoods in Group 1 and Group 2 have a higher percentage of younger, non-Hispanic white residents, relatively smaller average household size, and higher incomes and educational attainment. This indicates residents in these clusters are among the least vulnerable population groups and can afford to leave their home neighborhoods (or stay at home) by shifting to remote working environments to avoid the exposure risk, resulting in a decrease exposure density. Even though these two clusters present similar outflow patterns with respect to neighborhood activity volume, the activity proportion changes exhibit some notable differences. While the proportion of residential activities in Group 1 increased by 12% without any significant changes in non-residential and outdoor activities, Group 2 shows a 14% increase in non-residential activity, a function of the relative pre-COVID resident population size. Therefore, we refine the labels for Group 1 and Group 2 as “outflow-mixed use” and “outflow-residential”, respectively.

activity change (%) Group 1
(yellow) Group 2
(blue) Group 3
(orange) Group 4
(green) Group 5
Residential volume -0.52 -0.37 -0.20 -0.01 0.20
Residential proportion 0.12 0.01 -0.01 0.07 0.09
Non-residential volume -0.60 -0.28 -0.19 -0.13 -0.00
Non-residential proportion -0.01 -0.14 0.00 -0.07 -0.09
Outdoor volume -0.61 -0.42 -0.18 -0.07 0.07
Outdoor proportion -0.04 -0.07 0.02 -0.01 -0.03
Table 5: Neighborhood cluster characteristics. Statistically significant differences between groups based on one-way ANOVA and Tukey’s multi-comparison method. Mean values with standard deviation in parentheses
Feature Group 1
(yellow) Group 2
(blue) Group 3
(orange) Group 4
(green) Group 5
Demographic and
socioeconomic features
    Age group 25-34 (%) 0.28 (0.08) 0.22 (0.05) 0.19 (0.04) 0.16 (0.06) 0.13 (0.02)
    Age group over 65 (%) 0.12 (0.07) 0.15 (0.06) 0.12 (0.04) 0.14 (0.05) 0.17 (0.07)
    Black (%) 0.05 (0.04) 0.14 (0.17) 0.27 (0.22) 0.31 (0.28) 0.16 (0.25)
    Non Hispanic (%) 0.90 (0.05) 0.77 (0.19) 0.60 (0.21) 0.72 (0.07) 0.82 (0.11)
    Foreign born (%) 0.16 (0.08) 0.14 (0.05) 0.18 (0.08) 0.15 (0.07) 0.13 (0.08)
    Avg household size 1.92 (0.26) 2.21 (0.35) 2.61 (0.31) 2.90 (0.45) 2.91 (0.37)
    College degree (%) 0.40 (0.07) 0.31 (0.09) 0.20 (0.08) 0.19 (0.07) 0.20 (0.04)
    Unemployment rate 0.04 (0.01) 0.05 (0.03) 0.08 (0.03) 0.08 (0.04) 0.06 (0.02)
    Healthcare support workers (%) 0.01 (0.01) 0.03 (0.02) 0.06 (0.04) 0.07 (0.04) 0.05 (0.03)
    Retail service workers (%) 0.03 (0.01) 0.04 (0.02) 0.06 (0.01) 0.05 (0.02) 0.05 (0.02)
    Median Income ($) 133K 90K 54K 62K 72K
    Avg commute time (minute) 27.05 (3.00) 33.83 (4.15) 41.86 (3.23) 44.7 (3.87) 45.30 (3.73)
    No health insurance (%) 0.04 (0.02) 0.06 (0.03) 0.09 (0.03) 0.09 (0.04) 0.07 (0.04)
    Owner occupied units (%) 0.26 (0.12) 0.23 (0.12) 0.22 (0.14) 0.41 (0.21) 0.59 (0.20)
Urban form features
    Residential area (%) 0.30 (0.20) 0.71 (0.13) 0.69 (0.14) 0.69 (0.14) 0.71 (0.18)
    Office area (%) 0.43 (0.24) 0.05 (0.06) 0.05 (0.03) 0.04 (0.03) 0.03 (0.02)
    Commercial area (%) 0.57 (0.22) 0.24 (0.10) 0.25 (0.12) 0.25 (0.13) 0.21 (0.13)
    One or two family units (%) 0.00 (0.00) 0.03 (0.05) 0.15 (0.15) 0.41 (0.27) 0.64 (0.26)
COVID-19 features
    Case rate 1166.60 (431.88) 1570.96 (621.38) 2475.90 (786.84) 2790.36 (777.17) 2534.96 (630.57)
    Death rate 91.12 (76.79) 150.63 (84.10) 219.87 (83.11) 224.46 (97.73) 195.78 (116.87)
    Positivity rate 0.11 (0.03) 0.15 (0.05) 0.22 (0.05) 0.24 (0.04) 0.23 (0.04)
Table 4: Descriptive statistics of neighborhood clusters

Group 3 (43 zipcodes) neighborhoods exhibit a 19% decrease, on average, in exposure density. This is driven by a reduction in population density from those leaving the city. These neighborhoods maintain a stable proportion between the different land uses, indicating that the residents who remain in these communities maintain their regular behavior patterns. When compared to the “outflow” groups, these “stable-outflow” communities have higher proportions of racial and ethnic minorities, foreign born residents, lower median income, as well as significantly higher proportions of renter households and those without health insurance. Additionally, a greater percentage of employees in these neighborhoods work in retail services and healthcare support occupations, essential businesses that were not required to close during the outbreak. Like Group 3 neighborhoods, communities in the Group 4 cluster have stable activity patterns over time; however, these neighborhoods did not see a significant out-mover population. These communities, which we label “stable-stable”, are comprised of socioeconomically vulnerable households and the highest proportion of racial minorities, coupled with the second lowest income, large average household size, high unemployment rate, low educational attainment, and a large share of healthcare support workers. Such socially and economically vulnerable neighborhoods are less likely to be able to work from home as the character of their occupation requires physical presence at the workplace, leading to fewer opportunities to reduce exposure to others. We also find that the relatively modest change in exposure density in these “stable” groups (18% and 10% decrease in non-residential activity density for Group 3 and Group 4, respectively) is associated with significantly higher infection rates. Particularly, the “stable-stable” neighborhood group shows the highest case rate (2790.36), death rate (224.46), and positivity rate (0.24) in the City.

In comparison to other clusters, Group 5 (“shelter-in-place”) neighborhoods demonstrate a 20% increase in local activity volume for residential activities and a 7% increase for outdoor activities. In addition to increasing neighborhood activity volume, residents staying in these neighborhoods are found to shift activity to residential buildings (by 10%) and away from non-residential and outdoor activities (by 6%). While non-residential activities are found to decrease as a proportion of the three activity types, the increase in the overall volume of activity leads to a net increase in exposure density. This group has the highest proportion of elderly population, the largest household size, moderate incomes, a relatively lower percentage of racial and ethnic minorities, and a significantly higher homeownership rate. This indicates that activity in these neighborhoods, where population density is the lowest in the city, became more localized. As a result, Group 5 experienced the second-highest infection rate (2534.96 case rate) in the city.

4.3 Disparities in health outcomes: Effects of exposure density on infection rates

Figure 5: Scatter plots of exposure density versus COVID-19 case rates, with incubation period. Left: cumulative cases per 100,000 people, Right: daily new confirmed cases on April 15. Colors represent individual clusters and black curves are exponential best-fit lines.
Figure 6: Scatter plot of exposure density versus the log–transformed cumulative COVID-19 1) case rate, 2) death rate, 3) positivity rate, and 4) deaths per case – with linear best fit lines.

The number of confirmed cases per 100,000 people (daily new cases and accumulated cases) on April 15 (five (5) days after the impact period to account for the virus incubation period) is plotted against the net change in exposure density before and after the social distancing order, as shown in Figure 5. We observe an exponential relationship based on the scatter plots with statistically significant positive correlations between exposure density and infection rates ( = 0.52 and = 0.47, respectively).

The results of the bivariate regression model are shown in Figure 6. Exposure density is found to be correlated with most infection metrics, namely case rate, death rate, and positivity rate, while, as expected, not being a statistically significant determinant of mortality rate. These three models explain 34%, 15%, and 42% of the variance in outcomes, respectively, with the highest explanatory power exhibited in the positivity rate model. According to the model outputs, a one percentage point decrease in exposure density is associate with a 1.33% reduction in case rate, a 1.59% in death rate, and a 1.16% decrease in positivity rate in NYC. Based on these results, if all neighborhoods reduced activities away from home by 10% as compared to normal activity levels prior to the stay-at-home order, the City could have avoided more than 28,000 COVID-19 cases and saved 2,000 lives.

Model 1:
case rate Model 2:
death rate Model 3:
positivity rate Model 4:
deaths per case
Num of obs.=177 Num of obs.=177 Num of obs.=177 Num of obs.=177
F-stats.=35.69 F-stats.=16.59 F-stats.=53.26 F-stats.=10.90
ProbF=0 ProbF=0 ProbF=0 ProbF=0
Feature R-squared=0.77 R-squared=0.61 R-squared=0.83 R-squared=0.50
Intercept 7.040(0.171)*** 2.862(0.403)*** 2.359(0.116)*** -3.848(0.249)***
Group “outflow-mixed” and “outflow-residential” -.0632(0.135)*** -0.716(0.318)*** -0.443(0.091)*** 0.128(0.197)
Group “outflow-stable” -0.436(0.142)*** -0.003(0.335) -0.228(0.096)** 0.426(0.207)**
Group “influx” -0.051(0.115) -0.010(0.273) -0.130(0.078)* 0.050(0.169)
% Black 0.005(0.001)*** 0.007(0.002)*** 0.004(0.001)*** 0.002(0.001)*
% Hispanic 0.009(0.001)*** 0.003(0.003)*** 0.005(0.001)*** 0.006(0.002)*
% units occupied by owner 0.002(0.001)* -0.005(0.003) 0.003(0.001)*** -0.007(0.002)***
% household with kids 0.012(0.003)*** 0.028(0.008)*** 0.013(0.002)*** 0.014(0.005)***
% employees working from home -0.018(0.008)** 0.010(0.019) -0.016(0.005)*** 0.015(0.011)
Num of occupied nursing home beds per 100 people 0.036(0.010)*** 0.086(0.024)*** 0.008(0.007) 0.059(0.015)***
% household without health insurance -0.018(0.011)* 0.046(0.025)* -0.003(0.007) 0.056(0.015)***
Insurance group effect 1 0.062(0.017)*** 0.088(0.041)*** 0.046(0.012)*** 0.001(0.025)*
Insurance group effect 2 0.042(0.014)*** 0.010(0.033) 0.021(0.010)** -0.031(0.021)
Insurance group effect 3 0.001(0.013) -0.008(0.031) 0.018(0.009)** -0.008(0.019)
Age group over 65 0.014(0.005)*** 0.069(0.011)*** 0.008(0.003)*** 0.043(0.007)***
% Public housing area -0.005(0.003)* 0.006(0.007) -0.002(0.002) 0.009(0.004)**
Note: Standard errors are in parentheses with regression coefficients.
*** p-value 0.01, ** p-value 0.05, * p-value 0.1
Table 6: Results of the multivariate regression models

The results of multivariate regression models, which control for neighborhood socioeconomic and demographic covariates, are described in Table 6. We combine both “outflow” groups (“outflow-mixed” and “outflow-residential”) and use the “stable” neighborhood group as the reference case. After accounting for various demographic characteristics, we continue to observe statistically significant coefficients for the exposure density variables. The positivity rate model (model 3), shows the most dramatic effects of behavior change and measured health incomes. Neighborhoods that reduced exposure density are shown to have 13% lower positivity rate compared to the reference group. For outflow neighborhoods that maintain the distribution of activities across land use types (classified as the “outflow-stable” group), the output is a 23% lower positivity rate. A similar pattern is also found in the case rate model (model 1).

Additionally, race and ethnicity, age group, and socioeconomic status are also found to have statistically significant effects on neighborhood infection rates and outcomes. Communities with larger proportions of minority and lower income populations are more likely to be at risk for virus transmission. For example, for every 10% increase of Hispanic residents in the community, the positivity rate increases by 5%, the case rate increases by 9%, and the death rate increases by 6%. This finding holds after accounting for changes in exposure density. As expected, exposure density is not shown to be a statistically significant feature in the death rate model (model 2), while the variables related to the presence of vulnerable populations have significant negative impact on the survival probability. We find that elderly population, lack of health insurance coverage, and a high proportion of people living in public housing have positive and statistically significant associations with death rates across the city. Thus, the mortality risk of the virus in socially vulnerable neighborhoods are higher than in other communities, resulting from pre-existing health conditions and the lack of an adequate healthcare access. This also explains, in part, why the “outflow-stable” group, with the highest proportion of lower income residents without health insurance, experienced an approximately 43% higher fatality rate compared to the reference group, despite observed lower infection rates.

5 Discussion and Conclusion

We present a novel approach to measuring exposure density at high spatial and temporal resolution. By integrating geolocation data and land use classifications, we are able to establish both the extent of activity in a particular area and the nature of that activity across residential, non-residential, and outdoor activities. This approach is scalable to any areal unit of interest: here we utilize a 250m grid and aggregate to the zipcode level to match geography of reported health data. However, it is possible to apply the same methodology to point locations or grids of any size, and then aggregate the units to other common administrative or political boundaries, such as census tracts, counties, and metropolitan areas. We normalize our data to enable comparative studies between regions and to scale the analysis to other cities with similar land use data resources.

Our findings demonstrate distinct patterns of activity before and after the stay-at-home order across neighborhoods. These neighborhood patterns are clustered into five unique groups, each exhibiting statistically significant variations in socioeconomic and demographic characteristics of residents. In wealthier neighborhoods of Manhattan and Brooklyn, we observe an exodus of residents leaving for other areas around NYC or regions further afield. Presumably, these residents have the means to relocate to second homes or rental homes that provide a greater degree of (perceived) safety from the virus. In addition to the financial ability to make such a move, residents in these neighborhoods also had, in many cases, the option to work remotely, thus reducing the transaction costs of leaving their primary residence. Conversely, we observed clusters of poor, minority neighborhoods that faced greater infection risk. While some residents in neighborhoods in the “stable” groups did relocate, the large majority stayed in their communities and continued on with their typical (pre-COVID) routines. As a result, we found that the exposure density in these neighborhoods remain relatively constant over the study period, reflecting the need to commute to work and other places of responsibility, especially given that many of those employed worked in occupations deemed essential services. Finally, we find a cluster of neighborhoods that increased their exposure density due to an increased amount of localized activity. These neighborhoods, characterized by lower density, single-family homes in areas further from the Manhattan central business district, are found to have both a greater volume of activity and more activity taking place in non-residential and outdoor areas than normal. The effect of this local activity was an increase, compared to pre-COVID levels, in the probability of coming in contact with others outside of the household or family unit.

The variation in exposure density has a direct and measurable impact on the risk of infection. In neighborhoods where exposure density decreased the most, we find lower rates of infection, positivity rates, and death rate per capita, controlling for other covariates associated with social determinants of health. The communities hardest hit by the virus were in the “stable-stable” neighborhoods, where residents faced multiple challenges and risk factors. In addition to continuing their normal activity patterns, and thus exposing themselves to greater risk of infection while commuting and in their place of work, these communities have the largest proportion of minorities, among the lowest median incomes, and the lowest rate of health insurance coverage. These compound risks resulted in these vulnerable communities facing the burden of the highest rate of infection, death rate, and positivity rate in the City during the study period.

Our work highlights the importance of understanding neighborhood activity patterns in evaluating the determinants of health outcomes and risk factors for future infection outbreaks. By measuring exposure density at the community scale, we are able to determine the differential behavioral response to social distancing policies based on local risk factors and socioeconomic inequality. Our results expose the significant disparities in health outcomes for racial and ethnic minorities and lower income households. Exposure density provides an additional metric to further explain and understand the disparate impact of COVID-19 on vulnerable communities.


The authors would like to thank the New York University Center for Urban Science and Progress (NYU CUSP) Research Computing Facility (RCF) for providing and managing database infrastructure and VenPath, Inc. for providing data.


This work was supported, in part, by grants from the National Science Foundation, No. 1653772 and No. 2028687, and from NYU C2SMART, a USDOT Tier 1 University Transportation Center. Any opinions, findings, and conclusions expressed in this paper are those of authors and do not necessarily reflect the views of any supporting institution. All errors remain the authors.

Competing interests

The authors declare that they have no competing interests.