Quantifying the Effects of the 2008 Recession using the Zillow Dataset

by   Arunav Gupta, et al.

This report explores the use of Zillow's housing metrics dataset to investigate the effects of the 2008 US subprime mortgage crisis on various US locales. We begin by exploring the causes of the recession and the metrics available to us in the dataset. We settle on using the Zillow Home Value Index (ZHVI) because it is seasonally adjusted and able to account for a variety of inventory factors. Then, we explore three methodologies for quantifying recession impact: (a) Principal Components Analysis, (b) Area Under Baseline, and (c) ARIMA modeling and Confidence Intervals. While PCA does not yield useable results, we ended up with six cities from both AUB and ARIMA analysis, the top 3 "losers" and "gainers" of the 2008 recession, as determined by each analysis. This gave us 12 cities in total. Finally, we tested the robustness of our analysis against three "common knowledge" metrics for the recession: geographic clustering, population trends, and unemployment rate. While we did find some overlap between the results of our analysis and geographic clustering, there was no positive regression outcome from comparing our methodologies to population trends and the unemployment rate.



There are no comments yet.


page 11


Confidence Intervals for the Number of Components in Factor Analysis and Principal Components Analysis via Subsampling

Factor analysis (FA) and principal component analysis (PCA) are popular ...

A bootstrap analysis for finite populations

Bootstrap methods are increasingly accepted as one of the common approac...

More crime in cities? On the scaling laws of crime and the inadequacy of per capita rankings – a cross-country study

Objectives: To evaluate the relationship between population size and num...

Confidence intervals for AB-test

AB-testing is a very popular technique in web companies since it makes i...

Assessing the effectiveness of empirical calibration under different bias scenarios

Background: Estimations of causal effects from observational data are su...

Quantifying the efficacy of childcare services on women employment

Women are set back in the labor market after becoming mother. Intuitivel...

Confidence Intervals for Stochastic Arithmetic

Quantifying errors and losses due to the use of Floating-Point (FP) calc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background of Crisis

The 2008 Recession, also known as the “subprime mortgage crisis” was an economic recession that started in the US and quickly had global economic implications. From a period from late 2007 to mid-2009, 8.4 million Americans lost their jobs, 1 in every 54 homes filed for foreclosure, and the US GDP fell by 4.3 percent.

Most economists agree that the crisis has its roots in the overuse of subprime mortgages, which were a type of security usually given to those who exhibit high financial risk and have low credit scores. Starting in the early 2000s, the housing industry was booming, and as houses steadily increased in value, more banks felt they could assume the risk of subprime mortgages and thus started giving them out en masse, even to those who would have traditionally been denied a loan. However, in 2007, the housing bubble burst and home values fell by almost 31.8%. As demand for housing plateaued, home values dropped, and people who had taken out subprime loans found that they were no longer able to pay off the high interest payments associated with a subprime mortgage. As thousands defaulted on their mortgages, the lending institutions lost money, too. Compounding on the current problems, many of these institutions had traded mortgage-backed securities (MBSs), which were backed by these risky subprime loans, to other institutions seeking a profit when housing prices increased. When that didn’t happen, MBSs lost value as well, causing many banks like the Lehman Brothers to go bankrupt. The collapse of the real estate industry then caused banks and businesses to lose trust in each other, driving stock prices down. As publicly-traded companies then saw decreased valuations, businesses shut down, and unemployment skyrocketed.

This “perfect storm” of financial disasters – subprime mortgages, housing value decline, mistrust in the stock market – hurt those at the bottom of the economic ladder the most, as without a job or a good credit rating to fall back on, they were unable to buy a home or provide for their families.

1.1 Maps

To visualise the recession effect on home values, we plotted the average ZHVI for each state per year, as well as the differences between each year.

(a) Figure A
(b) Figure B
Figure 1: ZHVI, 2007-2015

The first map is normalized to a scale of 0 to 550,000 second map is normalized to a scale of -50,000 to +150,000. Overall we see a significant drop and then recovery reflected in the maps.

2 Description of Dataset

Our focus dataset was the ‘Zillow Economics Data’ found on Kaggle, which includes records of transactions, contracts, and specs about public properties throughout neighbourhoods, cities, statistical metropolitan areas, counties and states. The data are in a time series format starting from 1996 to around 2014.

The observations in this dataset are derived from property listings and user behaviour on Zillow, and statistics were computed for metrics including the sales price, listing price, rent price. These statistics were acquired for properties per number of bedrooms, property type, property tier, and overall. Some other features include, but aren’t limited to, the number of days the property was listed on Zillow, the raw inventory, the percentage of annual increase or decrease in property value, sales turnover, and raw sales.

We decided to conduct our analyses on the property statistics across metropolitan areas. In our data, metropolitan areas are specified by a CBSA code. This code corresponds to a certain, CBSA, or core-based statistical area, which is characterised by one or more counties that are anchored by an urban centre. The Zillow Economics Data include both micropolitan and metropolitan CBSAs, where the urban centre of a micropolitan CBSA is between 10,000 and 50,000 and the metropolitan urban centre has greater than 50,000 people.

For reasons described in our section on the method of PCA, we decided not to use statistics for individual metrics as our dependent variables, but instead another provided measure, the Zillow Home Value Index (ZHVI). The ZHVI is based on the Zestimate home valuation model; it takes the median estimate in a geographic area on a given day. The median Zestimate is more sensible to use as it handles extreme values much better. Thus, we proceeded to use the ZHVI as the basis for our AUB and ARIMA implementations. There is an equivalent for rental spaces, the Zillow Rent Index (ZRI), but we didn’t run analyses for these properties.

3 Methodologies

3.1 Pca

The first methodology our group has performed for trial is called the Principal Component Analysis (PCA). The original dataset has approximately 95 columns of variables for us to build our analysis on. For our study, we would only like to extract a few useful features in the purpose of reducing the amount of excessive information. One way to reduce our high dimensional 95 columns of data into low dimensional representations of it is by using the method of PCA. In summary, the PCA selects a few important features from a sparse vector of data and compresses it by ignoring components which are not meaningful. Therefore, the data can be recovered and summarized as few dimensions as possible. The process can simply be described by figure 3 below:

Figure 2: Visualization of PCA

As shown in the process, the way we select the important features is by keeping the dimensions with the highest variance, and discarding the dimensions with the lowest variance. The highest variance dimensions maximize the amount of “randomness” that gets preserved in the compressed data. The method for determining the compressed data is simply by minimizing the Mean Square Error (MSE) between the original data and the new compressed line.

We apply the PCA method onto our dataset and have reduced the 95 variables into 2. Here is a plot of the result in figure 4:

Figure 3: Visualization of PCA on US nationwide data (n=3)

Unfortunately this graph does not provide much unique insights to us, as there is not much useful information from it that we can extract. The graph is messy, and not scalable because in the original dataset, most of the regions do not have complete data in them. In conclusion, the first methodology our group has tried, the PCA, does not reinforce us in finding any interesting correlation between our data and the recession.

3.2 Area Under Baseline (AUB)

The Area Under Baseline metric seeks to answer the question: “How much did the recession affect the ZHVI of a city?” To answer this question, we must look at the total impact the recession had on the city in question. There are multiple parts to the procedure: (a) transform the ZHVI data into a moving average (hereafter noted as ), (b) find the “recession window” for the city, and (c) find the area between the trend and the baseline ZHVI across the recession window.

Below, we’ll illustrate the procedure on Aberdeen, WA, which has a ZHVI trend graph that is very conducive to the process. Figure 5 shows Aberdeen’s , computed with a window = 5. The red lines denote the recession window: the beginning of the window is the greatest local maximum of the after Jan. 1, 2007. The end of the window is the point in time where the intersects the value of at the start point. The intuition here is that the recession is considered “recovered” once the ZHVI has returned to its pre-recession value. If the ZHVI never reaches its pre-recession value, the end date of the recession window is set at the last available value in the dataset. Figure 5 also shows the baseline (green line), which is defined as the value at the start point of the recession window. Finally, we can compute the residual between the baseline and the for each point in time between the start and end date of the recession window, then sum those values up to get the AUB for Aberdeen, WA.

Figure 4: for Aberdeen, WA

3.2.1 Theory and Notation

First we find the 5 month moving averages of the ZHVI values in the recession period The 5 month Moving Average for month is:

ZHVI value at month i

We look for local maxes by finding dates where the moving average was positive before but negative after. If such local max exists, we find the local max with the largest ZHVI value:

The time that the local max occurs is declared the time window start.

If it exists, we declare the time window end b to be the date of the first ZHVI value greater than the baseline. If not, we choose the date at the end of the recession period

ZHVI at end of recession period

Finally we find the area under the baseline by taking the sum of the differences between each ZHVI value in the time window and the baseline.

window start
window end
ith ZHVI value in window

3.2.2 Results of Analysis

Overall, the top 3 and bottom 3 AUB scores and their respective cities are listed in the tables below. Since a higher AUB score should be interpreted as an indicator of a higher recession impact, we will call the top 3 and bottom 3 cities “losers” and “gainers,” respectively:

Top ”Loser” Cities AUB Top ”Gainer” Cities AUB
Key West, FL Mt. Vernon, IL 17,080
Salinas, CA McAlester, OK 132,480
Carson City, NV Norfolk, NE 287,040
Table 1: Results of AUB analysis

3.3 ARIMA Model

3.3.1 What is Time Series Forecasting?

There are many problems in predictive modeling that involve a time component. When we are making predictions about the outcome in the future, we are still treating all prior observations equally. In time series analysis, we have two different goals of either trying to understand and describe our time series data, or making predictions, or forecasting. Descriptive time series analysis can help with prediction, as it comprehends models to aid in identifying underlying causes, but it is not required and can be an investment. Forecasting calls on models to fit historical data and using that information to predict future observations. ”The purpose of time series analysis is generally twofold: to understand or model the stochastic mechanisms that gives rise to an observed series and to predict or forecast the future values of a series based on the history of that series.”

To better understand time series analysis, we could decompose a time series into the following parts:

  1. Level - The average value in a series.

  2. Trend - The often linear increasing or decreasing behaviour of the series over time. - Optional, contingent on non-stationary or stationary time series

  3. Seasonality - Repeating patterns of cycles in behaviour over time. The ZHVI measure provided by the Kaggle data is already smoothed and seasonally adjusted.

  4. Noise - Variability in observations, unexplainable by model

We can combine these components to provide an observed time series, and add them together to form our model:

Time series data can also require ample scaling and cleaning to adjust for uneven frequency, time spacing, outliers, missing values, etc. The ZHVI data have been cleaned to supply the

metro_data.csv table.

The autoregressive integrated moving average (ARIMA) is a time-series fitted model designed to aid in descriptive analysis and forecasting of time-series data. ARIMA is often applied to data that show non-stationarity. As implied by the name, ARIMA has these key attributes:

3.3.2 Autoregression

ARIMA employs a simple autoregression (AR) model, in which observations from previous time steps are used as input to a regression equation to predict the next value. Formally, we can indicate an autoregressive model of order

by the following:


where is a constant, are the parameters of the model, and is noise. , the order of the autoregressive model, represents the number of lags, or previous observed series values, to be included in the model.

Since we are using regression with a neighborhood of terms, we can express this model equivalently with a backshift operator, :

A backshift, or lag, operator operates on an element of a time series to produce the previous element. Let us define an arbitrary time series . Then . The backshift operator can be raised to arbitrary integer powers so that .

3.3.3 Integration

The ARIMA model is integrated, meaning it uses a process known as differencing, in which observations at consecutive time steps are subtracted. This makes our non-stationary time series stationary, stabilising the mean by reducing trend. In the cases where seasonality will be reduced, the time series variance will also be stabilised.

3.3.4 Moving Average

We express a moving average model of order :


where is the mean of the series, are the parameters of the model, and are noise error terms. Write in terms of the backshift operator:

Simply put, the average is represented here is represented as the central value of our set of numbers, but it’s calculated for values of the dependent variable at different time intervals. The order denotes the size of the moving average window.

The ARIMA() model in Python accepts , , and , where is the order of differencing.

After importing all necessary libraries and table, we proceeded to the ARIMA analysis on a sample metropolitan area, San Diego. First, we produced visualis of the ZHVI trend and a correlogram. The correlogram, or autocorrelation plot, plots the sample correlations of the regression for each lag value. We want to choose a nonzero value of for our model such that the autocorrelation is high, so we can avoid overestimation or underestimation of true values for training of our forecasting model.

As soon as we are done fitting the model, we will have a summary of the fit. We have also plotted the distributions for the residual errors, from which we can maybe capture some trend information. The density plot of the residual values show that they are Gaussian but not centred at zero. This is indicative of a bias in the prediction.

We now test our model and produce a 95% prediction interval for our forecasted results for the ZHVI in San Diego from 2017 to 2020. Note that there is a slight overlap with the in-sample and out-sample predictions.
Finally, we calculate the area of the confidence interval.
The smaller the area of the 95% confidence interval, the less volatile the recovery over a longer period of time, and the more certain we are that there will be a continuing increasing trend in the ZHVI. We proceeded to calculate the area of the 95% prediction interval for each metropolitan area. The above shows the distribution of the normalised areas for across our observations.

4 Results

4.1 Geographic Clustering

By plotting the cities we received from each of the two working methodologies in the previous section (AUB and ARIMA), we can observe a couple patterns:

Figure 5: Visualization of results from AUB and ARIMA

First, we can notice that the gainer cities from both methods are primarily clustered in middle America/Great Plains, while many loser cities are located in the West. This defies conventional wisdom, which dictates that the West was not hit as hard as the Great Plains region due to its burgeoning technology industry. However, we can understand why Key West was marked as a loser, as it has a big tourism industry that was severely hit during the Recession, when less people could afford vacations and tourism.

4.2 Metrics vs. Population

We plot our AUB and ARIMA outputs against population for correlation. Figure 7 shows the two resulting graphs.

(a) Figure A
(b) Figure B
Figure 6: Regressing both methods against population statistics

Each blue dot represents a metro area. Since the r-value for both graphs are low, we conclude that there is little to no linear correlation between population and our algorithms. In the future we may consider transforming our data before testing for correlation.

4.3 Metrics vs. Unemployment

We have also plotted our AUB and ARIMA outputs against unemployment rates in hopes of finding any possible correlations. Figure 8 shows the two resulting graphs.

(a) Figure A
(b) Figure B
Figure 7: my caption

Each blue dot in the graphs represents a metro area. Judging from the regression lines of these two plots, there seems to be a correlation which reveals that the hardest-hit cities also had the biggest job losses. But as the R-squared values of them appear to be quite low, the correlations are not much secured. Therefore in conclusion, our attempts of plotting the population and the unemployment rates against our AUB and ARIMA outputs unfortunately do not show significant results.

5 Conclusion

Although we were able to come up with two unique metrics from determining the impact of the recession, neither of them seemed to agree with the universally-accepted metrics for determining recession impact – unemployment rate and city size. Our reasons for reaching this conclusion are because the AUB and ARIMA scores do not correlate very well with population and unemployment rate intuitively. This could be for a variety of reasons. The most prevalent is the notion that perhaps housing data such as the ZHVI is not the best indicator of a recession, after all. Another issue could be noise in the data. In many cases, the average ZHVI of a given metropolis was much larger than that of any other one, a feature of the dataset we could have corrected for by normalizing each metro’s data before applying a moving average.

5.1 Future Steps

Moving forward, some steps we may take are as follows:

  1. An investigation of the spillover effect in our data.

    1. According to a CityLab article, the spillover effect occurs when economic event that occurs in one city or metropolitan area will again occur in an adjacent city or metropolitan area. The study in the article pointed out that this was evident in a number of metros Chicago, New York, and Hartford, while Washington, D.C., Austin, and Providence were able to ride out the recession on their own. Essentially, we will want to check if our data exhibit possible relationships between ZHVI trends in one city and those of its neighbouring cities.

  2. A new algorithm that combines the AUB method and ARIMA forecasting. This may involve tuning hyperparameters of our ARIMA model to get more detailed predictive curves/trends on which we can then apply the Area Under Baseline.

    1. We can visualise the median algorithm outputs per each state on a map, similar to those shown toward the beginning of the paper.

    2. With enough tweaking, we could use this algorithm to predict the effect of the 2020 recession on ZHVI values. This will involve more research on how to work with non-stationary time data, and may possibly yield the use of another more flexible time-series module.

  3. Comparing methodologies of other researchers in acquiring results for recession effects on housing metrics, and seeing if the results we attained stack up. This will provide us more insight as to what machine-learning approach we should employ with the sort of data we are given.


  • [1] The great recession, 2019.
  • [2] Kimberly Amadeo. What caused the 2008 financial crisis and could it happen again?, 2019.
  • [3] Richie Bernardo. 2017’s most and least recession-recovered cities, Jan 2017.
  • [4] CityLab, University of Toronto’s School of Cities, and Rotman School of Management. Which u.s. cities suffer the most during a recession?, Jun 2016.
  • [5] Investopedia. How the 2008 housing crash affected the american dream, Nov 2019.
  • [6] Robert Rich. The great recession, 2019.
  • [7] Lauryn Ringwood, Philip Watson, and Paul Lewin. A quantitative method for measuring regional economic resilience to the great recession. Growth and Change, 50(1):381–402, 2019.