I Introduction
Online realestate information systems, such as Trulia, Zillow, Yahoo! Homes and Realtor.com, have gained enormous popularity in a trend termed as onlinetooffline commerce (O2O). One important feature offered by these online systems, e.g., Zestimate [1], Trulia Estimates [2]
, is the estimate of a home’s worth through machine learning based on diverse realestate information, including comparative sales of similar homes and home attributes like the numbers of bedrooms and bathrooms, square footage, lot sizes, build year, characteristics of the basement, presence of fireplaces or airconditioning, etc. For example, about 100 million homes across the US currently have Zestimate values
[1]. Although automatically computed home price estimates may not substitute official appraisals conducted by banks, governments or realestate agents, they serve as a starting point in determining home values, provide vast amounts of valuable data to homebuyers and investors, and can greatly aid offline realestate transaction decisions.House price estimation is an interdisciplinary research topic that has been widely studied by researchers from diverse backgrounds, including economists, statisticians and computer scientists. Early methods are based on parametric models, including Hedonic models
[3, 4]based on linear regression of house attributes and CS index based on repeat sales modeling
[5]. Later models take into account spatial correlation [3, 6] between houses in a same community or region to estimate a house’s market value based on neighboring houses with known values, e.g., those that are sold in the recent year.A milestone in modern dataintensive realestate valuation literature is the introduction of semiparametric or nonparametric models
[7, 8, 9], which assume that the house prices can be decomposed into a latent underlying land desirability and a regression of house attributes. In all these contemporary semi/nonparametric models, a common assumption is that the underlying land desirability of a house is a weighted average of its neighbours’ desirability, where the weights as well as the neighbourhood of the house are determined by certain kernel functions used to assess the closeness of two houses in the feature space. However, such kernelbased semi/nonparametric methods suffer from several intrinsic weaknesses. First, its hard to choose the specific form of kernel functions with a large number of hyperparameters to tune. More importantly, by using uniformly damping weights depending on the distance, kernel functions cannot handle irregularly shaped nonconvex regions or regions with interior uninhabited areas, which are common in reality. For example, when calculating the latent desirability of a house, a “neighbour” on the other side of a river may have the same influence as a “neighbour” on the same side of the river, as long as the two “neighbours” have the same distance to the house in the feature space. Neither can they handle land desirability discontinuities within a same region containing implicit subcommunities, e.g., estate homes vs. regular homes, homes near a coast or on a hill with views vs. other homes, etc.In this paper, we explore the novel application of latest advances in spatial functional learning and finite element analysis to realworld house price modelling problems. We propose the hierarchical spatial functional model (HSFM), which can handle irregularly shaped geographic regions as well as model land desirability discontinuities within heterogenous regions in reality. We have made multiple contributions:
First, similar to kernelbased methods, we assume the house value is composed of a regression from house features and a latent land desirability surface that depends on location. However, in our spatial functional approach, a smooth latent surface is learnt through a finiteelement spatial spline regression approach recently developed in statistics [10, 11, 12], thus overcoming the limitation of uniform damping in kernelbased methods and being able to handle any irregularly shaped nonconvex regions.
Second, the original spatial spline regression still yields large errors in modeling realworld house price data, as it tends to smoothen a spatial field throughout the region of interest, which may actually be heterogeneous in terms of both house prices and features in reality, characterized by implicit patches and hidden land value discontinuities. In our hierarchical spatial functional model, to further handle hidden land value discontinuities, we assume that a home’s value is the combination of the global land desirability in the region of interest, and the local land desirability in the hidden subregion or subcommunity it belongs to. We develop an effective method to learn this hierarchical model through the use of spatial constrained partitioning based on locations and the latent land desirability, to automatically discover hidden subcommunities.
We evaluate the proposed approach on a dataset of over 6000 houses in Edmonton, Alberta, Canada in a typical region along a river, featuring heterogeneous hidden subcommunities and strong nonconvexity. Results suggest that our HSFM approach reduces the root mean square error (RMSE) by 28.22% and reduces the mean relative absolute error (MRAE) by 21.61% over original spatial spline regression, leading to an MRAE of only 6.60% in house price estimation.
Ii Relationship to Prior Work
House market value estimation in a region based on partially known house prices is an interdisciplinary research topic which lies at the intersection of urban economics, statistics and computer science.
Parametric Models. Earlier methods are mainly based on linear regression. The socalled hedonic model says that the price of a house is a linear function of its attributes [3], [4], e.g., lot size, square footage, number of bedrooms, distance to landmarks, etc. However, the method is limited by the linearity assumption, and it is impossible to gather all relevant attributes. One important development in realestate value estimation is to consider the spatial correlation, since many implicit community variables, e.g., employment rate, income, school, land desirability, etc., are hard to measure but will lead to spatial correlation of house prices. [3], [6] incorporate the spatial autocorrelation into the residuals of the linear regression hedonic model. However, such parametric models all assume a fixed number of parameters and cannot grow the number of parameters with the amount of training data. This becomes a serious limitation in the presence of an everincreasing amount of data.
Nonparametric Models and KernelBased Methods
. Semi/nonparametric models have appeared in modern computationalintensive house valuation techniques, and almost all existing methods of this kind fall in the class of nearest neighbour models and adopt a kernelbased interpolation
[13], [14] or the socalled kernelbased locally weighted regression [7], [9]. However, the major limitation of kernelbased methods mentioned above is that the kernel function used to dictate the weights of correlation between neighboring houses is uniformly damping with their distance in the feature space. Due to this reason, kernel functions is not suitable handle nonconvex regions, since they tend to link houses across uninhabited areas (e.g., a river or a hill) where the linkage is weak. Neither can kernel functions model complex phenomena such as sudden land value jumps within the same region due to the existence of subcommunities.Spatial Functional Analysis. Spatial functional analysis [10] has been applied to census information prediction and temperature prediction. It is able to recover a smooth spatial field over nonconvex regions with irregular boundary. To the best of our knowledge, we are the first to apply latest spatial functional analysis techniques to realestate price modeling, assuming the house price consists of a regression part based on house features and a spatial field that models the land desirability at a certain location. In particular, we leverage the idea of spatial spline regression to handle any irregularly shaped geographic regions and thus overcome the limitations of the existing kernelbased methods in house valuation literature. Furthermore, we have developed a novel hierarchical spatial functional model to interpret the inherent house price heterogeneity that exists in most realworld regions of interest, and propose effective training algorithms to automatically capture land desirability jumps between the hidden subcommunities that are common in reality.
Iii Models
The problem of house value estimation through comparative analysis is to estimate the market values of all houses in a certain region, based on information like locations, attributes, and the values of a subset of all houses (e.g., those recently sold). Suppose the region of interest is modelled by an irregularly bounded domain that excludes the uninhabited areas such as parks, rivers, industrial areas, hills and so on. is given as an input and can be generated from the collection of houses being considered according to a process to be outlined in Sec. IV. Let denote the geographic location (e.g., GPS coordinate) of each property in the bounded domain. Let be the value (or log value) of property and
be a vector of
covariates associated with property that represents attribute values (e.g., lot size, net area, number of bedrooms/bathrooms/garages, age, etc.)Iiia The Spatial Functional Model
Modern house value estimation literature [7, 8, 9] all models the value of a house as a combination of its underlying land desirability and attribute value, leading to a semiparametric model:
(1) 
where is a realvalued function or a spatial field that models the underlying land desirability over different positions and contains the regression coefficients for the property’s attributes. The scalars are independent residuals. When is the original house value, (1) is an additive model. On the other hand, when is the log value of house , (1) is essentially a multiplicative model for the original house value and can accommodate multiplicative factors, e.g., lot size, square footage, in an additive form [5, 7, 8, 9].
Furthermore, is assumed to be twicedifferentiable over . The model above enables a key assumption that many spatial data studies hinge upon, that is, the property lots close to each other are likely to have a similar land desirability, while price variations among neighbours are still captured by different attributes .
Both the land desirability and the covariate coefficients can be learned based on training data points in , which contain the following information: 1) the values (or log values) of these points: , 2) their positions , and 3) an covariate matrix , whose th row is given by . Once and are learned from the training data, the value of any property can be estimated by plugging its attribute information and position into (1). The model (1) can be trained using either kernelbased methods [7, 8, 9] or finite element analysis [10, 11, 12].
However, the weakness of model (1) is that it is not capable of capturing the inherent heterogeneity that exists in realworld regions. Fig. 1 shows that the house prices of Riverbend area in Edmonton, AB Canada. The house prices in the middleleft region and the bottomright corner are much higher than other regions. It turns out that the North Saskatchewan River is on the left boundary and the middleleft area is also next to the Terwillegar Park. Therefore, the houses in that region have more privacy and better views. The bottomright corner has similar profile as it is near the Whitemud Creek. Moreover, the houses in this area were built more recently and thus have more expensive interior and exterior features than others.
Such hidden heterogeneity related to house features and/or land values leads to the discontinuity/jump of land desirability surface and varied attribute coefficients. These sudden jumps in desirability and attribute coefficients across local subregions can not be captured by model (1), where a smooth and a constant are assumed over . Furthermore, such local subregions are usually hidden and are not explicitly outlined by the municipal division, as shown in Fig. 1.
IiiB A Hierarchical Spatial Functional Model
We now propose the hierarchical spatial functional model (HSFM) to learn the values (or log values) by the sum of a global value function and a local value function , both depending on attributes and locations . To learn both value functions and , we adopt the semiparametric model in (1). In summary, our hierarchical spatial functional model is of the form
(2)  
where , for , , , which means that is a partition of and the scalars , , and are independent residuals. The global value function is homogeneous in the entire domain . The local value function is homogenous in each local region yet heterogeneous between different local regions . In other words, captures the global contributions of attributes , while capture the local attribute contributions and are different for each local region . Similarly, the spatial field represents the global underlying land desirability, while the spatial fields stand for the local underlying land desirabilities, which are different for each local region.
Our hierarchical spatial functional model has two levels, a global level and a local level. It is more general than (1) where only a global level is learned and becomes the same as (1) without the local value function . At the global level, our model learns the global coefficients for attributes and the global spatial field . At the local level, the local attribute coefficients and local spatial fields are learned in each local homogeneous subregion. Therefore, our model is more powerful than model (1) and is capable of not only characterizing the global trends but also distinguishing the local variabilities to handle land desirability discontinuities. There are various methods to identify the homogeneous local regions . In this work, we use a spatial constrained clustering method based on the global land desirability to partition the whole region into subregions, as shown in Fig. 2.
The model (2) can be learned in a sequential way like other hierarchical models. In particular, we first estimate the global value function as , based on the training data, which can be done using the same methods for learning model (1). Then, using the residuals , we obtain the estimate of the local value function for each local region based on certain partitions of the whole region . Note that the partitions may depend on our estimated global spatial fields in the first step. We describe our learning algorithms in details in Sec. IV.
Iv Learning Algorithms
In this section, we first describe a finiteelement analysis approach called Spatial Spline Regression (SSR) that can approximately solve model (1) for both and over any irregularly shaped domain . We then present the training algorithm of our hierarchical spatial functional model (2) to learn and at the global scale as well as each and in local regions.
Iva Spatial Spline Regression
Spatial spline regression [10] is a recently proposed finiteelement analysis approach to jointly solve for and from model (1) over any irregularly shaped domain . To obtain a smooth estimate of the land desirability spatial field , based on functional analysis, the following penalized sum of square errors is minimized [10], [12]:
(3) 
where denotes the Laplacian of over to smoothen out the roughness of the spatial field , since neighboring houses should share similar land desirability. The tuning parameter is used to control the smoothing of the land value surface and can be selected using some datadriven or ad hoc methods.
However, the challenge to solving problem (3) is that it involves searching for a functional over a nonconvex domain that may have strong concavities, complicated boundaries, and even interior holes. Spatial spline regression [10] can solve the type of problem (3) via finite element analysis: the domain is divided into small disjoint triangles, and a polynomial function is defined on each of these triangles, such that the summation of these polynomial functions defined on different pieces closely approximates the desired . In our data analysis, we use the package Triangle (version 1.6) to generate Delaunay triangulation mesh for finite element analysis, see https://www.cs.cmu.edu/~quake/triangle.html for details.
Specifically, let denote the vertices of all the small triangles, which are called control points and can be adaptively selected by available data points. We define a piecewise linear or quadratic basis function called Lagrangian finite element with , associated with each control point such that evaluates to at and is equal to at all other control points. Therefore, according to the Lagrangian property of the basis we can approximate for any only using the values of on the control points, i.e., . That is, if we let denote the predefined basis functions, each corresponding to a control point, then we have
(4) 
Since are predefined and known a priori, the variational estimation of in problem (3) boils down to the estimation of only scalar values, i.e., . With the piecewise approximation given by (4), solving (3) is simply solving a set of linear equations for . The estimator for can then be derived from (4), as shown in [10].
IvB Learning the Hierarchical Spatial Functional Model
Although spatial spline regression is capable of handling irregularly shaped regions, its main disadvantage is that it tends to smoothen the land desirability throughout an entire region of interest and can not capture price discontinuities between hidden subcommunities that are exist in the heterogeneous region, e.g., the left lower corner region in Fig. 1. Hierarchical Spatial Functional Model (HSFM) solves this issue by performing spatial functional analysis at both the global scale and local scales for each hidden subregion to characterize the price jumps.
We use a cascaded approach to train an HSFM in Algorithm 1. First, a global spatial field and global attribute contributions are fitted for the entire region using SSR. This removes the overall global trend and obtain the global residuals. Second, we generate a geopartition of the entire region containing , , using a spatial constrained clustering algorithm to be described in detail in the next subsection. The key idea and the challenge of our geopartition algorithm are to obtain clusters that are both geographically separated and characterized by price discontinuities between clusters. Third, for each local region , we learn a local spatial field and local attribute contributions based on the global residuals of the points in to capture the local information. Since the local spatial fields are fitted for each disjointly, the hierarchical method allows for the nonsmoothness or discontinuities between the local patches , . As a result, this sequential learning approach significantly increase the prediction power by exploring the intrinsic hierarchical structure of house prices.
IvC GeoPartitioning Algorithm
The implementation of the model (2) requires the spatial partition of the whole region into local regions , such that , for , and . This partitioning is the Step 2 in Algorithm 1. In order to achieve higher prediction power, the partitioning should be based on meaningful criteria and thus the resulted partition is capable of capturing the homogeneous local regions in term of house prices. In other words, in each subregion, the local spatial fields is smooth and the local attribute contributions ’s are constant. In this work, we propose to use a spatial constrained clustering based on the global land desirability to partition the whole region into subregions, namely geopartition; see Fig. 2.
In geopartition, we propose to use a spatial constrained clustering approach based on the estimated global fields . In particular, to identify significant areas as partition patches, we combine the Penalized Spatial Distance (PSD) measure [15] with the Clustering by Fast Search and Find of Density Peaks (CFSFDP) clustering algorithm [16] to partition houses based on both spatial and nonspatial attributes. The PSD distance measure takes both spatial coordinate and nonspatial attributes into account, and is able to yield geographically separated clusters when combined with specific clustering algorithms. The calculation of PSD distance requires the nonspatial attribute’s value surface. We utilize the house price as nonspatial attribute. To get a surface fitting of house prices, we apply the SSR model to the dataset with . Then we calculate the PSD distance between each pair of houses to get a distance matrix .
We then apply the CFSFDP clustering algorithm on the distance matrix to divide houses into geographically separated subregions. CFSFDP clustering algorithm spot clusters based on the idea that cluster centers are characterized by a higher density than their neighbours and by a relatively large distance from points with higher densities. It is able to detect variant shape clusters, and it doesn’t require to specify the number of clusters. We refer to [15] and [16] for more details about PSD and CFSFDP.
The resulted three clusters are shown in Fig. 2. The left two subregions are split by the Riverbend Hill Road. The houses in the left top subregion were built earlier than the other two subregions and there are relatively more condos in that subregion. This means geopartition indeed generates meaningful partition and our model will gain more prediction power by learning local structure from those subregions.
V Performance Evaluation
To evaluate the effectiveness our proposed approach in modeling house values in typical urban regions in the North America, we use the publicly accessible realestate property assessment data from the City of Edmonton, Alberta, Canada, which is the 5th largest city in Canada. The dataset considered contains the assessed market values of 6130 residential houses (including single houses, duplex and townhouses) in the RiverBend area of Edmonton in 2015, together with their corresponding features including effective build year, net area, characteristics of basement, presence of a fire place, presence of air conditioning, lot size, site coverage and north/south location. The data considered is quite heterogeneous: the mean house value is CAD 655,894 and the median is CAD 598,000, while the house values range from only CAD 65,000 all the way up to CAD 2,855,500, with the heat map of house values shown in Fig. 1. The datasets and code are made publicly available ^{1}^{1}1https://github.com/BangLiu/RealEstateModeling/ for future research.
It is worth noting that the collected data points are still sparse and only represent half of the houses in the region being considered in Fig. 1. To test the generalizability of the model, we randomly sample 80% of all houses as the training set, where we know the full information about the houses including their assessed value, features and GPS locations. The other 20% of the houses serve as the test data, where only their features and GPS locations are known. The objective is to estimate the assessed values of houses in the test set.
In particular, we evaluate and compare the following approaches on the same data:

LR: linear regression based on the model without the spatial field.

SSR (spatial field only): spatial spline regression based on the model .

HSFM (geopartition): hierarchical spatial functional model (2) with spatial constrained clustering based on CFSFDP at the local scale.
For each model, we have two versions, where is either the original value or logvalue of house . We always evaluate the performance by the mean squared error (MSE) and mean relative absolute error (MRAE) of the produced estimates for the original (nonlog) house value. For HSFM models, the geopartitioning can actually be precomputed before the global and local spatial functional fitting happens.
Fig. 3 and Fig. 4 compare the performance of baseline models and models proposed in this paper by presenting the MSEs and MRAEs given by different methods respectively. As we can see that our proposed models significantly outperforms baseline models by a great percentage, which proves the effectiveness of our models.
Comparing HSFM with geopartition to the first 3 baseline models, we can see that it improves a lot by introducing the secondary level model fitting on local subregions. In this way, after we fit the global trend of house price surface, the local level model fitting is able to further characterize different regions’ local variance, leading to the improvement of modeling accuracy. Compare the MSEs of log version and nonlog version algorithms, we can see that the log version method gives better MSE, while nonlog version gives better MRAE.
Method  1%  2%  3%  5%  10%  15% 

LR (nonlog)  0.07  0.15  0.21  0.34  0.62  0.80 
SSR (nonlog)  0.08  0.17  0.24  0.40  0.69  0.84 
HSFM (nonlog)  
LR (log)  0.09  0.19  0.28  0.42  0.71  0.85 
SSR (log)  0.11  0.19  0.28  0.45  0.73  0.86 
HSFM (log) 
As a summary, we compare the CDF of relative absolute errors (RAEs) under different methods in Table I. We can see that HSFM (log) yields the best performance with of estimates having an error less than , proving the superior performance of the new model.
Vi Concluding Remarks
We propose the Hierarchical Spatial Functional Model (HSFM) for house price modeling based on the recent advancement in spatial functional data analysis. It models house prices as a combination of a global spatial field, representing the globalscale land desirability surface, and multiple local spatial fields, each characterizing a localscale land desirability surface, as well as a linear regression from house features. We propose effective methods to discover the hidden homogeneous subcommunities that may exist in a heterogeneous region of interest based on spatial constrained clustering. We then solve the proposed HSFM by applying a recently developed finiteelement analysis technique called spatial spline regression on both the global and local levels in sequence. We demonstrate the effectiveness of our model for house price estimation based on partially known house prices based on more than 6000 houses in Edmonton, Alberta, Canada.
References
 [1] Zestimate, “Zestimate,” 2016, [Online; accessed 17June2016]. [Online]. Available: https://www.zillow.com/zestimate/
 [2] Trulia, “Trulia estimates,” 2016, [Online; accessed 17June2016]. [Online]. Available: https://www.trulia.com/
 [3] R. A. Dubin, “Predicting house prices using multiple listings data,” The Journal of Real Estate Finance and Economics, vol. 17, no. 1, pp. 35–59, 1998.
 [4] S. C. Bourassa, F. Hamelink, M. Hoesli, and B. D. MacGregor, “Defining housing submarkets,” Journal of Housing Economics, vol. 8, no. 2, pp. 160–183, 1999.
 [5] K. E. Case and R. J. Shiller, “The efficiency of the market for singlefamily homes,” The American Economic Review, vol. 79, no. 1, pp. 125–137, March 1989.

[6]
R. K. Pace, R. Barry, J. M. Clapp, and M. Rodriquez, “Spatiotemporal autoregressive models of neighborhood effects,”
The Journal of Real Estate Finance and Economics, vol. 17, no. 1, pp. 15–33, 1998.  [7] J. M. Clapp, “A semiparametric method for estimating local house price indices,” Real Estate Economics, vol. 32, no. 1, pp. 127–160, 2004.
 [8] S. Chopra, T. Thampy, J. Leahy, A. Caplin, and Y. LeCun, “Discovering the hidden structure of house prices with a nonparametric latent manifold model,” in Proc. of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2007, pp. 173–182.
 [9] A. Caplin, S. Chopra, J. V. Leahy, Y. LeCun, and T. Thampy, “Machine learning and the spatial structure of house prices and housing returns,” Available at SSRN 1316046, 2008.
 [10] L. M. Sangalli, J. O. Ramsay, and T. O. Ramsay, “Spatial spline regression models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 75, no. 4, pp. 681–703, 2013.
 [11] S. N. Wood, “Thin plate regression splines,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 65, no. 1, pp. 95–114, 2003.
 [12] T. Ramsay, “Spline smoothing over difficult regions,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 64, no. 2, pp. 307–319, 2002.
 [13] R. K. Pace, “Nonparametric methods with applications to hedonic models,” The Journal of Real Estate Finance and Economics, vol. 7, no. 3, pp. 185–204, 1993.
 [14] P. M. Anglin and R. Gencay, “Semiparametric estimation of a hedonic price function,” Journal of Applied Econometrics, vol. 11, no. 6, pp. 633–648, 1996.
 [15] B. Zhang, W. J. Yin, M. Xie, and J. Dong, “Geospatial clustering with nonspatial attributes and geographic nonoverlapping constraint: a penalized spatial distance measure,” in Advances in Knowledge Discovery and Data Mining. Springer, 2007, pp. 1072–1079.
 [16] A. Rodriguez and A. Laio, “Clustering by fast search and find of density peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014.