Nowcasting Gentrification Using Airbnb Data

01/15/2021 ∙ by Shomik Jain, et al. ∙ 0

There is a rumbling debate over the impact of gentrification: presumed gentrifiers have been the target of protests and attacks in some cities, while they have been welcome as generators of new jobs and taxes in others. Census data fails to measure neighborhood change in real-time since it is usually updated every ten years. This work shows that Airbnb data can be used to quantify and track neighborhood changes. Specifically, we consider both structured data (e.g. number of listings, number of reviews, listing information) and unstructured data (e.g. user-generated reviews processed with natural language processing and machine learning algorithms) for three major cities, New York City (US), Los Angeles (US), and Greater London (UK). We find that Airbnb data (especially its unstructured part) appears to nowcast neighborhood gentrification, measured as changes in housing affordability and demographics. Overall, our results suggest that user-generated data from online platforms can be used to create socioeconomic indices to complement traditional measures that are less granular, not in real-time, and more costly to obtain.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 19

page 20

Code Repositories

airbnb_gentrification

Nowcasting Gentrification Using Airbnb Data


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Gentrification is a revitalization process characterized by physical and socioeconomic changes in urban neighborhoods. These changes usually involve in-movers that are affluent, educated, or younger compared to out-movers that are poor, uneducated, or older (Freeman, 2005; Lees et al., 2008; Zuk et al., 2018). Disadvantaged neighborhoods are especially vulnerable to gentrification because their residents resemble these out-movers and because these neighborhoods have experienced disinvestment from the public or private sectors (Freeman, 2005). Indeed, over 20% of disadvantaged neighborhoods across the US have gentrified since 2000 (Maciag, 2015). Of particular concern in these neighborhoods is gentrification-induced displacement, a phenomenon in which out-movers are forced to move for reasons beyond their control (Zuk et al., 2018). To mitigate these negative effects of gentrification, governments and municipalities have tried many strategies such as rent-control and public investment (Levy et al., 2007).

Implementing gentrification policies requires first identifying gentrifying neighborhoods. Since gentrification is associated with socioeconomic changes, governments have measured gentrification using demographic data from public agencies such as the US Census Bureau and UK Office of National Statistics (ONS) (Zuk et al., 2018). However, these agencies rely on survey-based methods for obtaining demographic data, which pose several problems. First, government surveys are expensive: The 2020 US Census will cost over $15 billion (Government Accountability Office, 2019), and the 2021 UK Census will cost at least $1 billion (Cope, 2016)

. This amounts to $50-$100 just to survey a single household on average. Second, government data are quickly outdated because they represent a fixed point in time and have a delayed-release. The main Census in both the US and UK occurs every 10 years. In addition, the US Census Bureau reports 5-year estimates of demographic data obtained through the American Community Surveys (ACS), and the UK ONS reports the Indices of Multiple Deprivation (IMD) every 4 years. Due to these problems, both the US and UK governments have expressed concerns about the future of survey-based methods 

(US Census Bureau, 2016; Shaw, 2020).

The nowcasting of socioeconomic indices related to gentrification would constitute a major improvement over the status quo of outdated government data. With nowcasted information, policymakers could make data-driven decisions to address the negative effects of gentrification in the present. For these reasons, there has been a growing interest in using alternative sources of data to measure and nowcast important urban and economic outcomes, as we shall see in Section 2(Glaeser et al., 2018a) use Yelp data to quantify neighborhood changes, (Naik et al., 2014)

propose “Streetscore”, a scene-understanding algorithm that predicts the perceived safety of a streetscape using Google Maps Street View data and, finally, 

(Glaeser et al., 2018b)

show that they can predict the median income of residents in New York City from Google Maps images using a computer vision model. It is worth pointing out that prior work has focused on nowcasting gentrification – as opposed to forecasting it – mainly because alternative data sources have grown substantially only in recent years, making the validation of any forecasting model very challenging.

In this paper, we use user-generated data from Airbnb – a popular peer-to-peer short term rental platform – to nowcast gentrification. Specifically, our work contributes to the growing scientific and public debate about short-term rentals and their urban impact (Quattrone et al., 2018; Wachsmuth and Weisler, 2018; Yrigoy, 2016). We use a combination of structured data (e.g., listing information) and unstructured data (e.g., the textual content of reviews) processed with a variety of machine learning techniques. Given gentrification is most prevalent in large cities (Zuk et al., 2018), this work focuses on two major US cities, New York City and Los Angeles, and one major European city, London. We nowcast gentrification measured as changes in socioeconomic variables between two temporal windows, 2013–2017 and 1998–2002, from Airbnb data in 2013–2017, and make two main contributions:

  1. We mine a variety of data sources and profile neighborhoods in terms of gentrification scores and Airbnb features (Section 3). For each neighborhood, we construct a gentrification score based on changes in socioeconomic measures of age, education, housing affordability, and income (Section 3.1). We then collect both structured Airbnb data (e.g., number of listings, number of reviews, listing information) and unstructured Airbnb data (e.g., user-generated reviews), which we analyze and process using machine learning algorithms such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) and representation learning (Bengio et al., 2013) (Section 3.2).

  2. We then study the ability of Airbnb to nowcast gentrification (Section 4). We start by showing that there is a high correlation between gentrification and Airbnb data (Section 4.1). We also find that unstructured data processed with machine learning tools tend to have more explanatory power (Section 4.2), suggesting that unstructured data can capture aspects of gentrification beyond those captured by structured data alone. Given these strong correlations, we develop models using Airbnb data that show the potential to nowcasting gentrification with user-generated information (Section 4.3).

2. Related Work

Recent years have seen a rapid growth in online platforms and user-generated information. Social media platforms such as Facebook, Twitter, and Instagram allow users to freely share content and opinions with both friends and strangers; online review platforms such as TripAdvisor and Yelp allow anyone to write reviews about any kind of service, from hotels to restaurants to even health-care; and sharing economy platforms such as Airbnb allow people to review the homes and neighborhoods of strangers. The value of user-generated information is growing as the size of the data increases, as more people use these digital platforms, and as more people rely on such data to make decisions. For example, a growing body of research provides evidence suggesting that online reviews and ratings can affect firms’ sales and revenues (Chevalier and Mayzlin, 2006; Luca, 2016). Turning to social media, researchers have exploited user-generated content to predict consumer preferences and behavior (Zhang and Pennacchiotti, 2013), and brand perception (Liu et al., 2018; Culotta and Cutler, 2016), among other applications.

Additionally, there has been a growing interest in using alternative sources of data to measure and predict important urban and economic outcomes. This is for two reasons. First, user-generated content is readily available for free, it is available at any geographical granularity (zipcode, city, or state level) and, more importantly, it is available at high frequency and in real-time. Second, in the last few years, there have been tremendous improvements in computation power and algorithms to manage and process large amounts of structured and unstructured data. Within this literature, we find a diverse set of papers with different goals. In terms of macroeconomic indices, Antenucci et al. (Antenucci et al., 2014) and Proserpio et al. (Proserpio et al., 2016) use Twitter data to predict labor market outcomes such as the unemployment rate. With respect to cities and more granular measures, Naik et al. (Naik et al., 2014) propose “Streetscore”, a scene-understanding computer vision algorithm that measures the perceived safety of neighborhoods using Google Maps Street View data. Their subsequent work (Glaeser et al., 2018b) shows that Google Maps images can predict the median income of residents in New York City neighborhoods. Cranshaw et al. (Cranshaw et al., 2012) and Venerandi et al. (Venerandi et al., 2015) also use Foursquare data to quantify urban trends, and Hristova et al. (Hristova et al., 2018) use Flickr data to quantify cultural capital and predict changes in house prices. Finally, the work closest to ours is the work of Glaeser et al. (Glaeser et al., 2018a), who show that the number of businesses listed on Yelp is highly correlated to changes in socioeconomic variables related to gentrification including age, education, and housing. The main difference between (Glaeser et al., 2018a) and our work is that we use a combination of structured and unstructured data to predict gentrification.

Moreover, several papers have explored the characteristics of the neighborhoods in which Airbnb enters and the effects of this platform on economic activity. Quattrone et al. (Quattrone et al., 2018, 2016) analyze the spatial penetration of Airbnb in the US and UK and show that Airbnb listings tend to appear more often in neighborhoods occupied by the “talented and creative” classes, which resemble the in-movers of the gentrification process. Their subsequent work (Quattrone et al., 2020) analyzes Airbnb reviews and shows how this unstructured data contains important nuances that are neighborhood dependent. Furthermore, Basuroy et al. (Basuroy et al., 2020) show that increases in Airbnb listings in Texas zipcodes are associated with increases in economic activity in these zipcodes.

While previous work considers unstructured image data to measure urban outcomes, in this work, we focus on textual information and the process of gentrification. Our focus on text is driven by the hypothesis that user-generated data related to short-term rentals could contain latent valuable information about neighborhoods and their economic conditions, thus helping cities and municipalities better measure and understand the process of gentrification.

3. Profiling Neighborhoods

To nowcast gentrification, we create a gentrification score that captures changes in neighborhood socioeconomic conditions and create features from Airbnb data to predict such changes. We define neighborhoods as zipcodes in the US and as wards in the UK because these represent the most granular administrative divisions for which governments provide socioeconomic data. Zipcodes and wards are defined based on population sizes: a zipcode contains 8,000 people on average whereas a ward contains about 5,500 people. Thus, the geographical sizes of each vary based on population density. In total, there are about 200 zipcodes in New York City, 130 zipcodes in Los Angeles, and 630 wards in Greater London.

3.1. Gentrification Score

We construct an overall gentrification score for each neighborhood (zipcode or ward) based on changes in four socioeconomic measures: age, education, housing affordability, and income (Table 1). Since these measures are not published in the same year and come from a variety of data sources, we group them by two temporal windows, 1998–2002 and 2013–2017. Then, between the two windows, a gentrification score can be computed.

Our gentrification score definition is similar to prior work that quantifies gentrification by aggregating changes in socioeconomic conditions between two time periods (Freeman, 2005; Lees et al., 2008; Zuk et al., 2018; Bousquet, 2017). Historically, governments have also measured gentrification using similar demographic data from public agencies such as the US Census Bureau and UK Office of National Statistics (ONS) (Bousquet, 2017). We specifically use the socioeconomic variables of age, education, housing affordability, and income because public agencies in both the US and UK collect these variables. While race has also been found to be associated with gentrification (Zuk et al., 2018), this variable is only available in the US. Therefore, we decided to omit this variable in our main analysis for consistency. However, in Table 7 in the Appendix, we show that including race in the gentrification score of US cities leads to similar results.

Country Measure Definition Data Source Data Source
1998–2002 2013–2017
US Age Percent aged between 25 and 34 2000 Decennial Census (Census Bureau American Community Surveys, 2013) 2013-2017
Census American Community Surveys (Census Bureau American Community Surveys, 2013)
Education Percent with a bachelors degree
Housing Median gross rent
Income Median household income
UK Age Percent aged between 25 and 34 2002 ONS (Office of National Statistics, 2002) 2014 ONS (Office of National Statistics, 2002)
Education Attainment and skills in the population 2000 ONS
Indices of Multiple Deprivation (Office of National Statistics, 2000)
2015 & 2019 ONS Indices of Multiple Deprivation (Office of National Statistics, 2000)
Housing Lack of physical and financial
accessibility to housing
Income Percent not deprived from low income
The ONS collects data for the Indices of Multiple Deprivation two years prior to release (Office of National Statistics, 2015).
Table 1. Socioeconomic Measures for US Zipcodes and UK Wards.

First, we aggregate our socioeconomic measures to create a neighborhood index for each zipcode or ward. To standardize each measure, we consider the percentile within a given city instead of the raw value. We also use percentiles because the raw Indices of Multiple Deprivation (IMD) values are not comparable across different years (Office of National Statistics, 2015). Then, we construct a neighborhood index for each neighborhood and temporal window :

(1)

A lower neighborhood index indicates that a neighborhood is more disadvantaged on the basis of more old, uneducated, or poor residents as well as cheaper housing. Similar to prior work (Freeman, 2005; Lees et al., 2008; Zuk et al., 2018), we define (and limit our analysis to) disadvantaged neighborhoods as those having a neighborhood index in the bottom percentile in the first time window (1998–2002). In doing so, we are left with 83 disadvantaged zipcodes in New York City, 68 disadvantaged zipcodes in Los Angeles, and 230 disadvantaged wards in London.

After standardizing each neighborhood index using its percentile, we define the gentrification score for disadvantaged neighborhoods in line with previous definitions (Freeman, 2005; Lees et al., 2008; Zuk et al., 2018):

(2)

where is 2013–2017 and is 1998–2002. A higher gentrification score indicates that a neighborhood has experienced more gentrification on the basis of an influx of young, educated, or wealthy residents as well as decreased housing affordability.

In defining gentrification, we focus solely on disadvantaged neighborhoods for two reasons. First, non-disadvantaged neighborhoods experience significantly less change (t-test,

) in the neighborhood index between 1998–2002 and 2013–2017. Second, and more importantly, the concept of gentrification is usually discussed only in the context of disadvantaged neighborhoods. Even though affluent neighborhoods may experience some change in socioeconomic measures, this is typically not considered to be gentrification (Zuk et al., 2018).

Figure 1 shows the distribution of the gentrification score among disadvantaged neighborhoods. We observe that each distribution is centered around zero, indicating that neighborhoods experience no change on average in their gentrification score. However, there is significant variation in the gentrification score across all cities. Table 2 reports summary statistics for the gentrification score, neighborhood index, and each socioeconomic measure.

To further validate our gentrification score, we performed a sensitivity analysis to test the extent to which our results depend on its definition. First, we found a strong correlation () with an existing gentrification score for the city of Los Angeles (Bousquet, 2017). Second, we performed a cross-correlation test among the four socioeconomic measures composing it (age, education, income, rent), and found an average cross-correlation of (see Figure 7 in the Appendix). The strong cross-correlation scores also help justify our decision to equally weight the four socioeconomic measures, despite the possibility that each measure could have different levels of importance in each neighborhood. Third, to test the robustness of our results, we replicated our analysis for each socioeconomic variable individually and obtained similar results (see Tables 9-12 in the Appendix).

Figure 1. Distribution of the Gentrification Score.
Measure Time Period New York Los Angeles London
(lr)3-4 (lr)5-6 (lr)7-8 Median SD Median SD Median SD
Age 1998-2002
2013-2017
Education 1998-2002
2013-2017
Income 1998-2002
2013-2017
Rent 1998-2002
2013-2017
Neighborhood Index 1998-2002
2013-2017
Gentrification Score 1998-2017
Disadvantaged Neighborhoods 79 58 186
Table 2. Summary Statistics for Gentrification Measures in Disadvantaged Neighborhoods.

3.2. Airbnb Features

Airbnb launched in 2008 as a peer-to-peer platform for short-term rental accommodations. Hosts can list their properties for rent on the platform, and guests can book these properties for a few days to multiple weeks. Airbnb has experienced exponential growth during the past decade, starting in 2008 with just 100 listings in San Francisco to now offering over 6 million listings in 192 countries. More than 500 million guests have stayed with Airbnb to date.

From the Airbnb website, we collected the complete set of listings and reviews data in New York City, Los Angeles, and London. We chose these cities because they have a history of gentrification (Lees et al., 2008; Maciag, 2015) as well as the most Airbnb data in the US and UK. Specifically, we consider Airbnb data for the 5-year period 2013–2017 so that it aligns with the most recent period for which we have socioeconomic data. The collected data amounts to over 180K listings and 3M reviews in New York City, Los Angeles, and Greater London. In our analysis, we focus on disadvantaged neighborhoods and exclude those with less than 5 listings (10th percentile), leaving us with 49,765 listings and 768,450 reviews in 79 New York City zipcodes; 22,908 listings and 477,758 reviews in 58 Los Angeles zipcodes; and 8,617 listings and 181,037 reviews in 186 London wards.

We create two types of features from the Airbnb data (Table 4): those obtained from structured data (e.g., number of listings, number of reviews, listing information) and those obtained from unstructured data (e.g., user-generated reviews). All features are aggregated at the neighborhood level (zipcode or ward) over the 5-year period for which they were collected, which helps reduce potential noise in Airbnb data and also avoids the use of any Personally Identifiable Information (PII)111While the Airbnb website publicly displays some PII in listing details and reviews, in our analyses we do not rely on any individual user data, but aggregate data at the neighborhood level..

Structured Data Features

Structured data features consist of the following information about listings, aggregated at the neighborhood level over 2013–2017:

  • # Listings and # Reviews: the total number of listings and reviews.

  • Price: the average price per listing.

  • # Bedrooms: the average number of bedrooms available for rent per listing.

  • Star-Rating and Location Star-Rating: the average overall and location star-rating per listing. We exclude star-ratings for other topics such as cleanliness, accuracy, value, communication, and check-in because we did not find these features to be correlated () with gentrification.

Unstructured Data Features

Unstructured data features are constructed from the review text using Natural Language Processing (NLP) techniques. We first preprocess222We calculate the sentiment feature without preprocessing review text because VADER accounts for unorthodox text such as punctuation, slang, and acronyms. the review text to remove punctuation and commonly used words, and stem each remaining word to its root form. Then, we compute the following features, aggregated over all English reviews in a neighborhood over 2013–2017.

  • Review Length: The average number of words in each review.

  • Location Words: The average percentage of location-related words in each review. Quattrone et al. (Quattrone et al., 2020) analyze Airbnb reviews and create a vocabulary of commonly-used social and business words in Airbnb reviews. Social words are those focusing on the interaction between guests and hosts (e.g., words like “sharing”, “talking”, “chatting”, “conversation”), whereas business words are those focusing on the business transaction between guests and hosts (e.g., words concerning the property, its location, or the professional conduct of the host). We use their shared and public dictionary to analyze different word categories and find that the frequency of location-related business words (e.g., location, place, neighborhood, area) is the most relevant for gentrification.333Location Words Dictionary: https://figshare.com/s/991c8677e3e9ce013774

  • Sentiment: The average sentiment of each review. We calculate sentiment on a scale of (most extreme negative) to (most extreme positive) using the Valence Aware Dictionary and Sentiment Reasoner (VADER) (Hutto and Gilbert, 2015)

    . VADER is a lexicon and rule-based sentiment analysis tool specifically attuned to sentiments expressed in social media.

  • Sentiment in Location Reviews: The average sentiment of location-related reviews. We define location-related reviews as those which 10% of the words are location words.

  • LDA Components: The average presence of each LDA topical component in each review. LDA (Blei et al., 2003)

    is an unsupervised topic extraction model that uses word frequencies to group text samples into latent topical components. First, LDA determines the associated words for a given number of latent topics. Then, for each text sample, it outputs topic scores to represent the probabilities that the sample corresponds to each topical component. Following standard practice, we determine the optimal number of topics (five in our case) using the perplexity score. We report the top-15 words for each topic in Table 

    3. Based on these words, we determine the latent topics in all cities to be related to four common subjects: “check-in”, “listing characteristics”, “location”, and “stay/host”.444Additionally, New York City and London have a public transportation topic that does not appear in the city of Los Angeles where public transportation is neither well developed or widely used.

  • Doc2Vec Components

    : The average Doc2Vec vector coordinates for each review. Doc2Vec 

    (Le and Mikolov, 2014) is an unsupervised representation learning method (Bengio et al., 2013) that maps text to vectors in an -dimensional space. We use Doc2Vec with 25 dimensions and obtain a vector representation of each review. The output vectors of Doc2Vec preserve semantic information about the input text; in particular, reviews that have similar word frequencies are closer in the -dimensional vector space.

We report summary statistics for the Airbnb features in Table 4.

New York Los Angeles London
Topic: Check-In
arrive, host, day, cancel, reserve, post, automatic, check, flexible, late, smooth, process, early, flight, key arrive, day, host, reserve, cancel, post, automatic, check, late, airbnb, instruct, time, book, text, didn’t, key day, arrive, host, room, bed, bathroom, night, kitchen, check, reserve, cancel, post, shower, automatic, work
Topic: Listing Characteristics
room, apartment, bed, bathroom, night, stay, clean, place, good, kitchen, time, didn’t, bedroom, check, sleep room, bed, park, apartment, clean, place, bathroom, stay, nice, night, good, kitchen, work, bedroom, towel great, location, nice, good, clean, place, stay, room, apartment, host, easy,, flat, communication, close, check
Topic: Location
apartment, great, restaurant, location, stay, love, walk, place, shop, perfect, view, bar, close, park, subway, area walk, beach, location, santa, monica, great, close, apartment, place, distance, restaurant, park, stay, shop, hollywood great, flat, location, restaurant, love, stay, london, apartment, shop, close, area, walk, park, perfect, recommend
Topic: Stay/Host
great, stay, place, location, host, clean, apartment, recommend, nice, definitely, help, comfort, friend, perfect, time great, stay, place, location, clean, host, nice, apartment, recommend, definitely, easy, help, comfort, good, perfect stay, place, great, host, love, london, recommend, clean, help, location, house, friend, room, definitely, comfort
Topic: Public Transportation Topic: Stay/Host Topic: Public Transportation
subway, place, close, walk, great, location, nice, good, stay, apartment, minute, clean, station, time, manhattan stay, love, house, place, host, time, beautiful, comfort, friend, feel, wonder, perfect, recommend, amazing, la station, walk, minute, london, tube, bus, place, close, min, 5, 10, train, house, stay, underground
Table 3. LDA Topics.
Airbnb Data Features New York Los Angeles London
(lr)2-3 (lr)4-5 (lr)6-7 Median SD Median SD Median SD
Structured Data Features
      # Listings
      # Reviews
      Price (USD)
      # Bedrooms
      Star-Rating
      Location Star-Rating
Unstructured Data Features
      Review Length (words)
      Location Words (%)
      Sentiment
      Sentiment in Location Reviews
      LDA Components (Altogether)
      Doc2Vec Components (Altogether)
Disadvantaged Neighborhoods 79 58 186
Table 4. Summary Statistics for Airbnb Data in 2013–2017 from Disadvantaged Neighborhoods.

4. Nowcasting Gentrification Using Airbnb Data

We now test the extent to which we can nowcast gentrification from Airbnb data in disadvantaged neighborhoods. First, we examine the correlations between the gentrification score and both structured and unstructured Airbnb data features (Section 4.1). Next, we discuss the insights that we can obtain from unstructured data (Section 4.2

). Finally, we nowcast the gentrification score using both in-sample linear regression and out-of-sample random forest regression (Section 

4.3).

(a)
(b)
Figure 2. Comparison of the Gentrification Score and Number of Airbnb Listings.

4.1. Correlation between Airbnb Data and Gentrification

We start by analyzing the correlation between gentrification and Airbnb data across disadvantaged neighborhoods. Table 5 reports the linear correlation coefficients () between each Airbnb data feature and the gentrification score.

Structured Data Features

Among structured data features, we find high and positive correlation () in all cities for the number of listings (), the number of reviews (), and the listing price (). In other words, gentrifying neighborhoods have more Airbnb listings (as Figure 2 confirms) as well as more reviews and higher listing prices. The overall star-rating does not have a significant correlation () in any city. However, the location star-rating (which rates the Airbnb listings’ location) has a significant () positive correlation in New York City () and London (). The number of bedrooms available for rent has different effects across cities, being negatively correlated to gentrification in Los Angeles and positively correlated in London; we argue this is likely due to the feature capturing the unique geography and housing availability of each city.

Unstructured Data Features

Most unstructured data features are highly correlated with the gentrification score. For all cities, the location words feature has a correlation of () or higher, and the review length has a correlation of () or higher. These positive correlations suggest that users write longer reviews that contain more location-related words in gentrifying neighborhoods. In addition, we find that the sentiment in reviews mentioning location has higher correlation than the overall sentiment for all cities. This suggests that, if a review talks more positively about location, then it is more likely that the corresponding neighborhood is gentrifying. Finally, we find that most of the LDA topics are highly correlated with the gentrification score, with the location topic having the highest correlation in all three cities (, ). Similarly, most of the Doc2Vec components are also correlated with the gentrification score, with the highest correlated component having a correlation (in absolute value) above 0.43 () in all three cities. Next, we proceed to discuss the interpretation of the Doc2Vec and LDA features.

Airbnb Data Variables New York Los Angeles London
Structured Data Features
      # Listings ***  0.682   ***  0.397   ***  0.547  
      # Reviews ***  0.637   ***  0.370   ***  0.473  
      Price ***  0.431   ***  0.298   ***  0.353  
      # Bedrooms  0.027   ** -0.279   **  0.145  
      Star-Rating  0.056   -0.214    0.081  
      Location Star-Rating **  0.277    0.204   ***  0.300  
Unstructured Data Features
      Review Length ***  0.519   **  0.271   ***  0.344  
      Location Words ***  0.412   ***  0.446   ***  0.427  
      Sentiment ***  0.391    0.108   ***  0.209  
      Sentiment in Location Reviews ***  0.469   *  0.235   ***  0.231  
      LDA Component (Check-In) -0.024   -0.012   ** -0.169  
      LDA Component (Listing Characteristics) *** -0.332   -0.187    0.083  
      LDA Component (Location) ***  0.436   ***  0.464   ***  0.423  
      LDA Component (Stay/Host) *** -0.320   ** -0.324   *** -0.287  
      LDA Component (Public Transportation)  0.009   N/A  0.069  
      Doc2Vec Component (top correlated comp.) *** -0.655   ***  0.476   *** -0.437  
Disadvantaged Neighborhoods 79 58 186
Significance levels: p0.1; p0.05; p0.01
Table 5. Linear Correlation () between Gentrification Score and Airbnb Data.
(a)
(b)
(c)
Figure 3. The relationship between LDA and Doc2Vec components and the gentrification score. Each point represents a neighborhood, and points are colored with their level of gentrification, from lower (blue) to higher (red).

4.2. Insights from Unstructured Airbnb Data

To analyze and interpret our unstructured data features, we select the LDA and Doc2Vec components that have the highest correlation with the gentrification score. For each city, Figure 3 plots each neighborhood (as a point) according to its values for the two components. The neighborhoods are colored with their level of gentrification score, from lower (blue) to higher (red). We observe that the resulting arrangement highly corresponds with the coloring of the points; for example in Panel (c), London wards with a higher gentrification score cluster in the top-left. The ability to partition neighborhoods by gentrification score using these two unstructured features suggests that strong markers of gentrification exist in the actual content of Airbnb reviews for all of our three cities.

To further interpret these Doc2Vec and LDA latent components, we measure the correlation of these components with the location words feature, and find high correlation for both of them (, ).555The interested reader can find the whole cross-correlation table for all considered features in the Appendix. This suggests that these components are capturing the location information contained in the reviews.

Next, we look at the distribution of topics in reviews for neighborhoods in the upper quartile of the gentrification score and reviews in the lower quartile of the gentrification score. We find that the location topic is present in a significantly larger (t-test,

) proportion of reviews, again confirming that latent location information contained in the text of the reviews is crucial to identifying gentrifying neighborhoods.

Figure 4. Distribution of LDA topics for neighborhoods in the upper quartile of the gentrification score and neighborhoods in the lower quartile of the gentrification score.
Figure 5. Comparison of location words usage between neighborhoods in the upper quartile of the gentrification score and neighborhoods in the lower quartile of the gentrification score.

Finally, to understand which location words matter the most, we compare the frequency of these words for gentrifying and non-gentrifying neighborhoods. To do so, we first find the top-10 location words in all reviews. We then compute their frequency in reviews from neighborhoods in the upper quartile of the gentrification score and in reviews from neighborhoods in the lower quartile of the gentrification score. Figure 5 reports these results. We observe that location words do indeed appear significantly more frequently (t-test, ) in reviews from upper quartile neighborhoods. Moreover, we find that the most used location words describe the neighborhood (e.g., walk, subway, restaurant, or park) suggesting that Airbnb guests not only describe the listings in which they stay but also the surrounding neighborhood.

4.3. Nowcasting Gentrification

In this section, we use regression models to predict the gentrification score with Airbnb data. We start with a simple linear regression to predict in-sample gentrification. Then, we turn to random forest regression for out-of-sample predictions.666We opt for random forest regression due to high multicollinearity among our predictors and the fact that linear regression tends to suffer from it. See Figure 8 in the Appendix for the cross-correlations among our Airbnb features. Throughout this section, we compare our results across models using only structured features, only unstructured features, and a combination of both. Further, we compare these results to a baseline model that predicts no gentrification (gentrification score ). We use this baseline for two reasons. First, on average, neighborhoods in our dataset do not experience gentrification, i.e., the difference in socioeconomic variables between the time windows that we consider is zero (see Figure 1); second, this baseline is equivalent to a scenario in which we would only have access to socioeconomic data from the early period (1998–2002), which is often the case given the large gaps and delays in government data release.

In-Sample Predictions

To predict in-sample gentrification scores, we use a linear regression model estimated using ordinary least squares (OLS). We report the Root Mean Square Errors (RMSE) for the four specifications (baseline, structured, unstructured, and all features) in Panel (a) of Figure 

6. In-sample linear regression yields RMSE ranging from 5.63 to 12.09 using all features and across all cities. These results significantly outperform the baseline model, which obtains RMSE from 14.70 to 17.82. Interestingly, we achieve similar performance for New York and Los Angeles but much lower performance for London. The smaller geographical size of wards compared to zipcodes may explain this difference. For example, wards contain only 28 listings on average, while zipcodes contain about 150 listings (Table 4), suggesting the importance of having a sufficient number of listings and reviews for the geographic unit of analysis.

The specifications using only structured or unstructured data features also outperform the baseline, with unstructured features (6.32 RMSE 12.73) obtaining better results than structured features (9.16 RMSE 13.72) across all cities. However, the specification using all features outperforms those using structured or unstructured features alone. This result suggests that structured and unstructured data are complimentary in nature, and that unstructured data can capture aspects of gentrification that structured data alone cannot. This also confirms our hypothesis that the text of the reviews written by Airbnb guests contain important latent information that can help cities and municipalities better measure and understand the process of gentrification.

Out-of-Sample Predictions

To predict out-of-sample gentrification scores, we use a random forest regression. In order to test the robustness of our model, we use default hyper-parameters, and average results for random 50%–50% train-test splits across 100 simulations. Through these simulations, we aim to show the generalizability of our model. In other words, we want to test whether we can predict the gentrification score by knowing the underlying socioeconomic measures only in certain neighborhoods. Ideally, if these types of analyses achieve good results (low errors), then they could be used to complement or substitute for traditional analyses based on expensive and outdated survey data.

(a)
(b)
Figure 6.

Results from in-sample linear regression (left) and out-of-sample random forest regression (right). We compare our results to a baseline model in which we predict no gentrification. Random forests results represent the average over 100 simulations with 50%-50% train-test split, and error bars represent the standard deviation.

New York Los Angeles London
(lr)2-3 (lr)4-5 (lr)6-7 Feature MDI Feature MDI Feature MDI
1 # Listings % # Listings % # Listings %
2 # Reviews % # Reviews % # Reviews %
3 Doc2Vec (1) % Location Star-Rating % Doc2Vec (1) %
4 Doc2Vec (2) % LDA (Location) % Location Words %
5 Doc2Vec (3) % Doc2Vec (1) % Price %
Note: Features are ranked by their Mean Decrease Impurity (MDI) which calculates feature importance as the percentage of times a feature
is used to split a node, weighted by the number of samples it splits.
Table 6. Feature Importance from Random Forest Regression.

Random forest regression yields out-of-sample RMSE ranging from 9.23 to 13.95 when using all features and across all cities. Similar to the in-sample analysis, these errors are lower than those obtained from the baseline model, which has RMSE ranging from 13.36 to 18.05. However, we find that the results vary substantially across cities. We achieve the best performance in New York City, where the 9.23 out-of-sample RMSE is similar to the 5.72 in-sample RMSE from linear regression. London also performs comparatively well to its in-sample results (13.95 out-of-sample RMSE; 12.09 in-sample RMSE). As Figure 6(b) shows, the standard deviation of RMSE across simulations is also small for these cities (¡1.0). On the other hand, Los Angeles performs much worse compared to its in-sample results (12.64 out-of-sample RMSE; 5.63 in-sample RMSE). We find that in this case the performance varies substantially across simulation rounds (RMSE SD = 3.4). The lower performance in Los Angeles may be due to the fact that it has fewer disadvantaged neighborhoods (54) compared to New York (79) and London (186). There may also be other unique aspects of gentrification in Los Angeles that limit the usefulness of Airbnb data compared to other cities; for example, gentrification in Los Angeles occurred more recently than in New York City and London and was spurred by external real-estate investment rather than market changes (Center for Opportunity Urbanism, 2019).

Similar to the in-sample case, the models using all features perform better than the models including only structured or unstructured features, again confirming the importance of unstructured text data. One additional advantage of random forest regression is that it allows us to compute feature importance, i.e., how effective the feature is at reducing uncertainty. Feature importance is calculated using the Mean Decrease Impurity (MDI) score, which represents the percentage of times that a feature is used to split a node weighted by the number of samples it splits. We report the top-5 most important features for the three cities in Table 6. We find the number of listings and reviews to be the top-2 features for all cities. However, we observe that among the remaining features, many of them are unstructured data features such as Doc2Vec components (the top-3 in terms of correlation with the gentrification score), the LDA Location component, and location words. This again validates the importance of unstructured data and in particular features related to the listings’ location.

5. Discussion and Conclusions

This work adds to the growing literature employing alternative data sources to predict urban and economic outcomes. We present the first application of Natural Language Processing (NLP) for nowcasting gentrification, and provide evidence that latent information contained in the unstructured text content of Airbnb reviews complements information contained in the structured data; i.e., combining both types of data helps us better explain the process of gentrification. These results have important implications. First, our results highlight the importance of machine learning algorithms for extracting information from unstructured data to create better measures and predictions of socioeconomic indices. Second, NLP tools can find important markers of gentrification from the text content of Airbnb reviews. Third, our work suggests that, while gentrification is generally considered a hidden and slow process (Florida, 2010), Airbnb guests can “see” this process and capture it through words.

Our work does not come without limitations, which may inspire future research. We focus on correlations and predictions, not causality. Estimating causal relationships in our settings is difficult because of simultaneity issues; it could be that Airbnb is more likely to enter in gentrifying neighborhoods, or it could be that Airbnb is speeding up gentrification in the neighborhoods that it enters. To overcome these issues, we would require some exogenous shock that locally affects neighborhoods but not Airbnb.

In addition, Airbnb data itself has limitations. First, Airbnb data has only grown substantially in recent years (post-2013), which is one of the reasons for why we focus on nowcasting and not forecasting gentrification. Airbnb data availability can also depend on a cities tourism patterns, characteristics, and policies, which means that our approach is far from being universally applicable; for example, it is likely that large and touristy cities like those studied in this work are more suitable for our approach. We further caution that models solely based on Airbnb data will skew toward the perspective of its user population, which is predominantly affluent, educated, and younger (similar demographics to in-movers in the gentrification process). Policymakers should weigh the insights and biases of alternative data sources like Airbnb when using such data to inform decisions.

As a potential way to reduce the biases associated with relying on data from a single platform, future work could leverage an ensemble of user-generated data from different platforms. As prior work suggests (Glaeser et al., 2018a; Naik et al., 2014), there are plenty of user-generated data that could be used to improve our predictions and create more accurate models, including data obtained from Yelp or Google. Moreover, in the future, the availability of additional historical data from Airbnb and other platforms may be used to forecast (instead of nowcast) gentrification. While forecasting may be more beneficial for policymakers, it may also be more challenging than nowcasting. Therefore, future research may provide more insights about forecasting gentrification and the challenges that need to be addressed to achieve good results.

Finally, in this work we focus on textual data, but other unstructured data such as photos could be used to improve the model predictions.

The availability of big data and machine learning tools to analyze user-generated information is becoming increasingly useful in many settings, from marketing (Liu et al., 2019) to economics (Glaeser et al., 2018b) to urban social science (Smith et al., 2013). In the context of this paper, urban social science, such data availability is rapidly changing and improving how cities and municipalities measure important outcomes which, in turn, will directly improve policymaking, and, more generally, our understanding of how cities change and evolve.

Acknowledgements.
The authors thank Isaac Gelman, Lauren Phillips, Nat Redfern, and Sahil Agarwal for their assistance with this research. We also acknowledge the USC Center for AI in Society’s Student Branch for providing us with computing resources.

Online Resources

Data and code for this work are available at: https://github.com/shomikj/airbnb_gentrification.

References

  • D. Antenucci, M. Cafarella, M. Levenstein, C. Ré, and M. D. Shapiro (2014) Using social media to measure labor market flows. Working Paper Technical Report 20010, Working Paper Series, National Bureau of Economic Research. External Links: Document, Link Cited by: §2.
  • S. Basuroy, Y. Kim, and D. Proserpio (2020) Estimating the impact of airbnb on the local economy: evidence from the restaurant industry. Cited by: §2.
  • Y. Bengio, A. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8), pp. 1798–1828. Cited by: item 1, 6th item.
  • D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent dirichlet allocation. J. Mach. Learn. Res. 3, pp. 993–1022. External Links: Document, ISSN 1532-4435, Link Cited by: item 1, 5th item.
  • C. Bousquet (2017) Where is gentrification happening in your city?. External Links: Link Cited by: §3.1, §3.1.
  • Census Bureau American Community Surveys (2013) Tables S0101, S1501, S1901, K202507. External Links: Link Cited by: Table 1.
  • Center for Opportunity Urbanism (2019) Beyond gentrification: toward more equitable urban growth. External Links: Link Cited by: §4.3.
  • J. A. Chevalier and D. Mayzlin (2006) The effect of word of mouth on sales: online book reviews. Journal of Marketing Research 43 (3), pp. 345–354. Cited by: §2.
  • I. Cope (2016) The value of census statistics. External Links: Link Cited by: §1.
  • J. Cranshaw, R. Schwartz, J. I. Hong, and N. Sadeh (2012) The livehoods project: utilizing social media to understand the dynamics of a city. In International AAAI Conference on Weblogs and Social Media, pp. 58. Cited by: §2.
  • A. Culotta and J. Cutler (2016) Mining brand perceptions from twitter social networks. Marketing Science 35 (3), pp. 343–362. Cited by: §2.
  • R. Florida (2010) The great reset: how the post-crash economy will change the way we live and work. HarperCollins e-books. External Links: ISBN 9780061991219, LCCN 2010002867, Link Cited by: §5.
  • L. Freeman (2005) Displacement or succession?: residential mobility in gentrifying neighborhoods. Urban Affairs Review 40 (4), pp. 463–491. External Links: Document, Link, https://doi.org/10.1177/1078087404273341 Cited by: §1, §3.1, §3.1, §3.1.
  • E. L. Glaeser, H. Kim, and M. Luca (2018a) Nowcasting gentrification: using yelp data to quantify neighborhood change. In AEA Papers and Proceedings, Vol. 108, pp. 77–82. Cited by: §1, §2, §5.
  • E. L. Glaeser, S. D. Kominers, M. Luca, and N. Naik (2018b) Big data and big cities: the promises and limitations of improved measures of urban life. Economic Inquiry 56 (1), pp. 114–137. Cited by: §1, §2, §5.
  • Government Accountability Office (2019) 2020 decennial census. External Links: Link Cited by: §1.
  • D. Hristova, L. M. Aiello, and D. Quercia (2018) The new urban success: how culture pays. Frontiers in Physics 6, pp. 27. External Links: Link, Document, ISSN 2296-424X Cited by: §2.
  • C.J. Hutto and E. Gilbert (2015) VADER: a parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the 8th International Conference on Weblogs and Social Media, ICWSM 2014, Ann Arbor, MI, USA, pp. . Cited by: 3rd item.
  • Q. Le and T. Mikolov (2014) Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning, E. P. Xing and T. Jebara (Eds.), Proceedings of Machine Learning Research, Vol. 32, Bejing, China, pp. 1188–1196. External Links: Link Cited by: 6th item.
  • L. Lees, T. Slater, and E. Wyly (2008) Gentrification. Routledge, New York, NY. Cited by: §1, §3.1, §3.1, §3.1, §3.2.
  • D. K. Levy, J. Comey, and S. Padilla (2007) In the face of gentrification: case studies of local efforts to mitigate displacement. Journal of Affordable Housing and Community Development Law 16 (3), pp. 238–315. External Links: ISSN 10842268, Link Cited by: §1.
  • L. Liu, D. Dzyabura, and N. Mizik (2018) Visual listening in: extracting brand image portrayed on social media. In

    Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §2.
  • X. Liu, D. Lee, and K. Srinivasan (2019)

    Large-scale cross-category analysis of consumer review content on sales conversion leveraging deep learning

    .
    Journal of Marketing Research 56 (6), pp. 918–943. Cited by: §5.
  • M. Luca (2016) Reviews, reputation, and revenue: the case of yelp.com. Harvard Business School NOM Unit Working Paper. Cited by: §2.
  • M. Maciag (2015) Gentrification in america report. External Links: Link Cited by: §1, §3.2.
  • N. Naik, J. Philipoom, R. Raskar, and C. Hidalgo (2014) Streetscore – predicting the perceived safety of one million streetscapes. In

    2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops

    ,
    pp. 793–799. Cited by: §1, §2, §5.
  • Office of National Statistics (2000) Indices of multiple deprivation. External Links: Link Cited by: Table 1.
  • Office of National Statistics (2002) Population estimates. External Links: Link Cited by: Table 1.
  • Office of National Statistics (2015) The english indicies of deprivation technical report. External Links: Link Cited by: §3.1, Table 1.
  • D. Proserpio, S. Counts, and A. Jain (2016) The psychology of job loss: using social media data to characterize and predict unemployment. In Proceedings of the 8th ACM Conference on Web Science, pp. 223–232. Cited by: §2.
  • G. Quattrone, A. Greatorex, D. Quercia, L. Capra, and M. Musolesi (2018) Analyzing and predicting the spatial penetration of airbnb in u.s. cities. EPJ Data Sci. 7 (1), pp. 31. External Links: Document, Link Cited by: §1, §2.
  • G. Quattrone, A. Nocera, L. Capra, and D. Quercia (2020) Social interactions or business transactions? what customer reviews disclose about airbnb marketplace. In Proceedings of The Web Conference 2020, WWW ’20, New York, NY, USA, pp. 1526–1536. External Links: ISBN 9781450370233, Link, Document Cited by: §2, 2nd item.
  • G. Quattrone, D. Proserpio, D. Quercia, L. Capra, and M. Musolesi (2016) Who benefits from the “sharing” economy of airbnb?. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16, Republic and Canton of Geneva, CHE, pp. 1385–1394. External Links: ISBN 9781450341431, Link, Document Cited by: §2.
  • D. Shaw (2020) UK’s 2021 census could be the last, statistics chief reveals. External Links: Link Cited by: §1.
  • C. Smith, D. Quercia, and L. Capra (2013) Finger on the pulse: identifying deprivation using transit flow analysis. In Proceedings of the 2013 conference on Computer supported cooperative work, pp. 683–692. Cited by: §5.
  • US Census Bureau (2016) Alternative futures for the conduct of the 2030 census. External Links: Link Cited by: §1.
  • A. Venerandi, G. Quattrone, L. Capra, D. Quercia, and D. Saez-Trumper (2015) Measuring urban deprivation from user generated content. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work and Social Computing, CSCW ’15, New York, NY, USA, pp. 254–264. External Links: ISBN 9781450329224, Link, Document Cited by: §2.
  • D. Wachsmuth and A. Weisler (2018) Airbnb and the rent gap: gentrification through the sharing economy. Environment and Planning A: Economy and Space 50 (6), pp. 1147–1170. Cited by: §1.
  • I. Yrigoy (2016) The impact of airbnb in the urban arena: towards a tourism-led gentrification. the case-study of palma old quarter (mallorca, spain). M. Blàzquez, M. Mir-Gual, I. Murray, & GX Pons, Turismo y crisis, turismo colaborativo y ecoturismo, pp. 281–289. Cited by: §1.
  • Y. Zhang and M. Pennacchiotti (2013) Predicting purchase behaviors from social media. In Proceedings of the 22nd International Conference on World Wide Web, pp. 1521–1532. Cited by: §2.
  • M. Zuk, A. H. Bierbaum, K. Chapple, K. Gorska, and A. Loukaitou-Sideris (2018) Gentrification, displacement, and the role of public investment. Journal of Planning Literature 33 (1), pp. 31–44. External Links: Document, Link Cited by: §1, §1, §1, §3.1, §3.1, §3.1, §3.1.

Appendix A Appendix

Model City Root Mean Squared Error
(lr)3-6 Baseline Structured Unstructured All Features
In-Sample New York
Linear Los Angeles
Regression London ——————————— N/A ———————————
Out-of-Sample New York
Random Forest Los Angeles
Regression London ——————————— N/A ———————————
Note: Random Forest results averaged over 100 iterations with 50%-50% train-test split.
Table 7. Regression for Gentrification Score with Race.
Model City Root Mean Squared Error
(lr)3-6 Baseline Structured Unstructured All Features
In-Sample New York
Linear Los Angeles
Regression London ——————————— N/A ———————————
Out-of-Sample New York
Random Forest Los Angeles
Regression London ——————————— N/A ———————————
Note: Random Forest results averaged over 100 iterations with 50%-50% train-test split.
Table 8. Regression for Race.
Model City Root Mean Squared Error
(lr)3-6 Baseline Structured Unstructured All Features
In-Sample New York
Linear Los Angeles
Regression London
Out-of-Sample New York
Random Forest Los Angeles
Regression London
Note: Random Forest results averaged over 100 iterations with 50%-50% train-test split.
Table 9. Regression for Age.
Model City Root Mean Squared Error
(lr)3-6 Baseline Structured Unstructured All Features
In-Sample New York
Linear Los Angeles
Regression London
Out-of-Sample New York
Random Forest Los Angeles
Regression London
Note: Random Forest results averaged over 100 iterations with 50%-50% train-test split.
Table 10. Regression for Education.
Model City Root Mean Squared Error
(lr)3-6 Baseline Structured Unstructured All Features
In-Sample New York
Linear Los Angeles
Regression London
Out-of-Sample New York
Random Forest Los Angeles
Regression London
Note: Random Forest results averaged over 100 iterations with 50%-50% train-test split.
Table 11. Regression for Income.
Model City Root Mean Squared Error
(lr)3-6 Baseline Structured Unstructured All Features
In-Sample New York
Linear Los Angeles
Regression London
Out-of-Sample New York
Random Forest Los Angeles
Regression London
Note: Random Forest results averaged over 100 iterations with 50%-50% train-test split.
Table 12. Regression for Rent.
(a)
(b)
(c)
Figure 7. Cross-Correlation Among Gentrification Score and Socioeconomic Variables.
(a)
(b)
(c)
Figure 8. Cross-Correlation Among Airbnb Features.