Refining Coarse-grained Spatial Data using Auxiliary Spatial Data Sets with Various Granularities

09/21/2018 ∙ by Yusuke Tanaka, et al. ∙ Kyoto University 0

We propose a probabilistic model for refining coarse-grained spatial data by utilizing auxiliary spatial data sets. Existing methods require that the spatial granularities of the auxiliary data sets are the same as the desired granularity of target data. The proposed model can effectively make use of auxiliary data sets with various granularities by hierarchically incorporating Gaussian processes. With the proposed model, a distribution for each auxiliary data set on the continuous space is modeled using a Gaussian process, where the representation of uncertainty considers the levels of granularity. The fine-grained target data are modeled by another Gaussian process that considers both the spatial correlation and the auxiliary data sets with their uncertainty. We integrate the Gaussian process with a spatial aggregation process that transforms the fine-grained target data into the coarse-grained target data, by which we can infer the fine-grained target Gaussian process from the coarse-grained data. Our model is designed such that the inference of model parameters based on the exact marginal likelihood is possible, in which the variables of fine-grained target and auxiliary data are analytically integrated out. Our experiments on real-world spatial data sets demonstrate the effectiveness of the proposed model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many cities around the world are now collecting large amounts of spatial data from a wide range of sources. Governments and other organizations are releasing data on items such as poverty rate, air pollution, traffic flow, energy consumption and crime [Shadbolt et al.2012, Goldstein and Dyson2013, Barlacchi et al.2015]. Analyzing such spatial data is of critical importance in improving the life quality of citizens in many fields such as socio-economics [Rupasinghaa and Goetz2007, Smith, Mashhadi, and Capra2014], public health [Jerrett et al.2013], public security [Bogomolov et al.2014, Wang et al.2016] and urban planning [Yuan, Zheng, and Xie2012]. For example, knowing the spatial distribution of poverty enables us to optimize allocation of resources for remedial action. Likewise, the spatial distribution of air pollution is useful in creating policies that can control air quality and thus protect human health.

(a) Community (b) Borough
Figure 1: The distribution of poverty rates at different spatial granularities.

Naturally, information at fine spatial granularity is preferred because it allows us to identify key regions that require intervention to improve city environments efficiently. As an example, Figures 1(a) and 1(b) visualize the distributions of poverty rates in New York City by community district and by borough, respectively; darker hues represent poorer regions. Clearly, to better understand socio-economic problems, Figure 1(a) is better than Figure 1(b). In practice, however, such information is often aggregated into coarse granularities as in Figure 1(b). It is usually thought to be too time-consuming and costly to conduct a census over the whole population of a city, and a sample survey

is conducted instead. Accordingly, the number of samples associated with each fine-grained region may not be large enough to provide a statistically significant estimate of the value associated to this region; the typical response is to aggregate samples over larger regions 

[Smith, Mashhadi, and Capra2014].

With the recent increase in data availability, utilizing auxiliary spatial data sets on the same region is an effective way of refining coarse-grained target data [Bogomolov et al.2014, Park2013, Smith and Capra2016, Smith, Mashhadi, and Capra2014, Wotling et al.2000]. In these works, the regression models are used for estimating the relationships between target data (e.g., poverty rate) and auxiliary data sets (e.g., unemployment rate). These existing methods, however, require that the spatial granularities of all the auxiliary data sets are the same as the desired granularity of target data. This requirement prevents us from making full use of the auxiliary data sets with various granularities. The auxiliary data sets are actually associated with various geographical partitions. For example, New York City has released various spatial data sets portioned into boroughs, community districts, zip code, police precincts and so on.

We propose a probabilistic model for refining coarse-grained target data through the effective utilization of auxiliary data sets with various granularities. An important characteristic is discerning the usefulness of each auxiliary data set which depends on not only the strength of relationship with the target data but also the level of spatial granularity. For example, consider the case of two auxiliary data sets that have the same strength of relationship with the target data, but different spatial granularities. In that case, the finer-grained one is seen as more helpful for refining the coarse-grained target data.

With the proposed model, the fine-grained target data are assumed to follow a Gaussian process (GP) [Rasmussen and Williams2006]

whose mean function is modeled by a linear regression of the auxiliary data sets. This GP-based modeling allows us to consider the spatial correlation in the target data and the auxiliary data sets simultaneously. Since the target data are observed not at fine granularity but at coarse granularity, we model a spatial aggregation process to transform the fine-grained target data into the coarse-grained target data. Furthermore, to handle auxiliary data sets with various granularities, we apply GP regression to each auxiliary data set to derive a predictive distribution defined on the continuous space; this conceptually corresponds to spatial interpolation. A key idea is that it hierarchically incorporates the predictive distributions into the model; that is, it does not use point estimates. This enables us to consider uncertainty in the prediction of auxiliary data sets. The uncertainty is governed by several factors, one of which is sample density, i.e., spatial granularity of the auxiliary data; the finer the granularity is, the lower the uncertainty is. Incorporating the uncertainty leads to effectively learning the usefulness of the auxiliary data with consideration of the levels of spatial granularity; this allows our model to accurately refine the coarse-grained target data. We predict the fine-grained target data via a Bayesian inference procedure. The proposed model is designed such that the estimation of model parameters based on the exact marginal likelihood is possible: By analytically integrating out the variables of fine-grained target and auxiliary data, we can estimate the parameters without explicitly obtaining these variables. We construct the predictive distribution of the fine-grained target data by using the estimated parameters.

2 Related Work

The problem of refining coarse-grained spatial data has been studied in various fields such as socio-economics [Smith and Capra2016, Smith, Mashhadi, and Capra2014], agricultural economics [Howitt and Reynaud2003, Xavier et al.2016], epidemiology [Sturrock et al.2014] , meteorology [Wilby et al.2004, Zorita and von Storch1999] and geographical information system (GIS) [Boucher and Kyriakidis2006, Goovaerts2010]. This problem is also called statistical downscaling, spatial disaggregation, and areal interpolation. The previous works can be categorized into two cases in terms of target data availability.

In the first case, in which a large amount of coarse- and fine-grained target data are available, we can predict the fine-grained target data by using a mapping function from coarse- to fine-grained data. The mapping function can be learnt by using various machine learning methods including linear regression models 

[Hessami et al.2008]

, neural networks 

[Cannon2011, Misra, Sarkar, and Mitra2017]

and support vector machines 

[Ghosh2010]

. Recently, super-resolution techniques based on deep neural networks have been applied for refining coarse-grained spatial data 

[Vandal et al.2017, Vandal et al.2018]. The super-resolution techniques aim to learn a mapping function from low- to high-resolution images [Dong et al.2014]. The method by [Vandal et al.2017] is based on the analogy between gridded spatial data and images; values at grid cells are regarded as values at pixels. The large amount of fine-grained data needed for training is, however, not available in many cases (e.g., poverty survey), and often only coarse-grained data are available. These methods are not applicable in such situations.

In the second case, in which only coarse-grained target data are available, many regression-based methods have been proposed that use auxiliary spatial data sets to refine coarse-grained target data [Flaxman, Wang, and Smola2015, Smith, Mashhadi, and Capra2014, Wang et al.2016, Zheng, Liu, and Hsieh2013, Zheng et al.2015]. Regression models (linear and non-linear) are used for estimating the relationships between target data and auxiliary data sets. A few methods can construct the regression models under the spatial aggregation constraints [Murakami and Tsutsumi2011, Park2013]. The constraints state that a value associated with a coarse-grained region is a linear average of their constituent values in a fine-grained partition. In order to satisfy the spatial aggregation constraints, the regression residuals at the coarse-grained regions are allocated to the fine-grained regions by using the spatial interpolation method, i.e., kriging [Stein1999]. These methods, however, assume that the auxiliary data sets have spatial granularities equivalent to that of fine-grained target data to be estimated. This assumption makes it difficult to utilize multiple auxiliary data sets with various granularities.

Several regression methods have been developed for estimating relationships between multi-scale spatial data sets [Miller et al.2015, Diodato et al.2010, Xu2017, Xu et al.2018]. These methods predict the target data with the same granularity as that of the training data by utilizing multi-scale auxiliary data sets. They do not, however, consider the spatial aggregation constraint, which is a critical factor in predicting the fine-scale target data from the coarse-scale target data.

There have been several hierarchical Bayesian models to predict fine-grained target data using fine-grained auxiliary data sets [Taylor, Andrade-Pacheco, and Sturrock2018, Wilson and Wakefield2017, Keil et al.2013]. Although they introduce a fully Bayesian treatment for model parameters, the uncertainty in the prediction of auxiliary data sets is ignored: They cannot discern the usefulness of each auxiliary data set considering their levels of spatial granularity.

Different from prior works, the proposed model can effectively make use of auxiliary data sets with various granularities by hierarchically incorporating Gaussian processes. This hierarchical modeling allows us to effectively learn the usefulness of each auxiliary data set considering the levels of spatial granularity. Our model also considers the spatial aggregation constraints by integrating the Gaussian processes with a spatial aggregation process to transform the fine-grained target data into the coarse-grained target data.

3 Problem Formulation

Symbol Description
set of indices of auxiliary spatial data sets
index of auxiliary spatial data set,
total region of a city
location point represented by
latitude and longitude coordinates,
coarse-grained partition of of target data
region in the coarse-grained partition
of target data,
fine-grained partition of of target data
region in the fine-grained partition
of target data,
partition of of th auxiliary data set
region in the partition of th auxiliary data set,
value associated with region in coarse-grained
target data,
value associated with region in fine-grained
target data,
value associated with region in th auxiliary
data set,
Table 1: Notation.

In this section, we describe the spatial data this study focuses on, and define our problem of refining coarse-grained spatial data by using, for the same region, auxiliary spatial data sets with various granularities. Assume that we have a target spatial data set with coarse granularity, and we would like to obtain a fine-grained version. Let be the collection of indices of auxiliary data sets. The notations used in this paper are listed in Table 1.

Partition: Let be a total region of a city, and be a location point represented by its coordinates (e.g., latitude and longitude). Partition of is a collection of disjoint subsets, called regions, of , whose union is equal to . Let denote the number of regions in . We can consider several partitions of as follows. Let be the coarse-grained partition, i.e., that of the coarse-grained target data. Let be the fine-grained partition, of the desired fine-grained target data. For , let be the partition of the th auxiliary data set.

Spatial data: Let be a -dimensional vector consisting of the coarse-grained target values, where is the value associated with region . For , let be a -dimensional vector consisting of the th auxiliary data values, where is the value associated with region of the th auxiliary data set.

Problem: Suppose that we have coarse-grained target data whose partition is , auxiliary data sets with the respective partitions , and the desired fine-grained partition , we wish to estimate a -dimensional vector consisting of the fine-grained target values, where is the value associated with region . Here, the values , and are assumed to be intensive quantities such as ratios; that is, they are independent of the area scale of the respective regions. When the values are extensive quantities such as population, they can be transformed into intensive quantities by dividing them with the areas of regions.

4 Proposed Model

Figure 2: Generative process of coarse-grained target data given three auxiliary data sets.

We propose a probabilistic model that allows auxiliary spatial data sets with various granularities to be used in refining coarse-grained spatial data. Our model is based on Gaussian process (GP) [Rasmussen and Williams2006]

, which is a flexible non-parametric model for non-linear functions in a continuous domain. We model the generative process for coarse-grained target data

, given the auxiliary data sets with known partitions , coarse-grained partition , and fine-grained partition

. In other words, we model the conditional probability

instead of the joint probability of and . It enables us to adopt two-step inference approach described in Section 5, which is advantageous in the computational cost for learning model parameters.

The generative process (given three auxiliary data sets) is illustrated schematically in Figure 2, where darker hues represent regions with higher values. This process contains the following three steps: (a) Deriving the predictive distribution over continuous space for each auxiliary data set via GP regression, which corresponds to spatial interpolation; (b) generating the fine-grained target data via a GP whose mean function is modeled as the linear regression of the continuous predictive distributions of the auxiliary data sets; (c) generating the coarse-grained target data by spatially aggregating the constituent values in a fine-grained partition.

In our problem, each value is associated with a region in a partition rather than a single location point in ; this prevents us from directly applying GP. We thus associate each region in a partition with its centroid, and regard each value as being associated with the centroid of that region. This assumption, while significantly simplifying computations involved, might worsen the fit of the GP to the data set, which however is appropriately taken into account in the following steps as increased uncertainty of the GPs for both the respective auxiliary data sets (described in (5)) and the target data (described in (6)). For , let be the set of the centroids in partition , where is the centroid of region . Similarly, for fine-grained partition , let be the set of centroids in . Thus, our problem is now reformulated as estimating , where is a target value at the centroid of region , as indicated by the auxiliary spatial data sets .

(a) Deriving predictive distributions of auxiliary spatial data sets: In order to handle auxiliary spatial data sets with various granularities, we use GP regression to derive a posterior Gaussian process for a latent continuous random function on ; this conceptually corresponds to spatial interpolation of each auxiliary spatial data set. We then evaluate the predictive distribution on the basis of the posterior Gaussian process. Let be a noise-free latent function for the th auxiliary data set at location . We assume that follows a Gaussian process, , with mean zero and a covariance function . Though our model does not depend on any particular choice of the covariance function, for simplicity we consider the well-known covariance function, i.e., squared-exponential kernel, which is widely used for measuring the similarity between function values in spatial coordinates [Rasmussen and Williams2006]. The squared-exponential kernel is defined as

(1)

where is the scale parameter,

is a signal variance that controls the magnitude of the covariance, and

is the Euclidean norm. We assume that the th auxiliary data is generated with an additive Gaussian noise with noise variance . If represents the prediction of the th auxiliary data set for the centroids of the fine-grained partition, the predictive distribution of is as follows:

(2)

where is the predictive means, and is the covariance matrix, whose diagonal elements represent the uncertainties in the prediction at the test points . Incorporation of the predictive distributions (2) is expected to allow the usefulness of auxiliary data to be effectively learnt as it allows consideration of the uncertainty in the prediction. Details are given in (7) in Section 5. Here, is a covariance matrix whose entries are covariances between training points . is a covariance matrix whose entries are covariances between training points and test points . is a covariance matrix whose entries are covariances between test points .

(b) Generative process of fine-grained target data: We model a generative process for the fine-grained target data . Let be a noise-free latent function for the fine-grained target data at location . We assume that follows a Gaussian process, , with mean function , where and are the regression coefficient of the th auxiliary data set and the bias parameter, respectively. The covariance function is a squared-exponential kernel with the scale parameter and signal variance . Given the predictive values for the auxiliary data sets from (2), the conditional distribution of at the centroids is given by

(3)

where and is a covariance matrix defined by . Here, we let be the number of auxiliary data sets. We define the augmented matrix as the matrix , in which is a column vector of 1’s. This GP-based modeling enables us to consider the spatial correlation in the target data and the auxiliary data sets simultaneously.

(c) Generative process of coarse-grained target data: We design a spatial aggregation process to transform the fine-grained target data into the coarse-grained target data , in order to encourage consistency between , which is to be estimated, and the available coarse-grained target data . In the spatial aggregation process, a value associated with one region in the coarse-grained partition is obtained by aggregating the values in the fine-grained regions contained in the coarse-grained region (see the upper part of Figure 2). Then, is generated from the following conditional distribution given ,

(4)

where is the noise variance for the coarse-grained target data, and is a aggregation matrix, whose entries are nonnegative weighting coefficients; the row sum of should equal 1. We set the coefficients in accordance with the property of the target data. For example, in cases where target data are incidences of disease, then the -entry of would be proportional to the population in the intersection of the coarse-grained region and the fine-grained region . In the following, for simplicity, we consider a simple aggregation matrix, in which entry is if the fine-grained region is contained in the coarse-grained region , and zero otherwise. Here, is a subset of , all the elements of which are contained in the coarse-grained region .

5 Inference

Given the coarse-grained target data , the auxiliary spatial data sets with centroids , the centroids of fine-grained partition and the aggregation matrix , we aim to predict the fine-grained target data via a Bayesian inference procedure. In order to calculate the predictive distribution of

, we need to estimate the model parameters. The problem of estimating the model parameters can be divided into two steps: 1) estimate hyperparameters

for each auxiliary data set and 2) estimate regression coefficient and hyperparameters for the target data. Although one could also opt for estimating all the model parameters simultaneously (i.e., one-step inference), it will increase the computational cost of inference drastically; we adopt the efficient two-step inference as described in the following paragraphs. We finally construct the predictive distribution of by using the estimated parameters. Details of the inference procedure are shown in Algorithm 1.

Input : , , ,
Output : Predictive distribution of
1:  Initialize model parameters, , , , , , ,
2:  /* first inference step */
3:  for do
4:   Estimate by maximizing the logarithm of (5)
5:  end for
6:  /* second inference step */
7:  Estimate by maximizing the logarithm of (6)
8:  Construct predictive distribution of by (8) using the estimated model parameters
Algorithm 1 Bayesian inference procedure of the fine-grained target data

The first inference step: Given the th auxiliary spatial data set with centroids , the marginal likelihood of is given by

(5)

The hyperparameters are estimated by maximizing the logarithm of (5). We solve the optimization problem through the use of the BFGS method [Liu and Nocedal1989]. By solving the optimization problem for each auxiliary data set independently, we obtain the set of the estimated hyperparameters for all auxiliary data sets. The predictive distribution of corresponding to (2) is obtained using the estimated hyperparameters.

The second inference step: Given the coarse-grained target data and the centroids of fine-grained partition , the marginal likelihood of is given by

(6)

where is a matrix, and we analytically integrate out the latent variables and with the help of the conjugacy of the distributions (2), (3), and (4). is a covariance matrix represented by , where . The -entry of is shown in (7).

(7)

Here, in (7) represents Kronecker delta; if , and otherwise. The residual variance term in (7) represents the residual variance in the regression of . This term contains the uncertainty in the prediction of , i.e., , which is weighted by . The spatial correlation term in  (7) represents the strength of spatial correlation between and . This term contains the covariance between and , i.e., , which is weighted by . On the basis of the marginal likelihood (6) with this covariance matrix , our model can effectively learn the regression coefficient while taking into consideration the prediction uncertainties and the spatial correlations from the auxiliary data sets with various granularities, simultaneously. The parameter and the hyperparameters , , are estimated by maximizing the logarithm of (6). We solve the optimization problem by using the BFGS method [Liu and Nocedal1989]. The derivatives of the logarithm of (6) with respect to , , , are described in Appendix A.

Predictive distribution of fine-grained target data: Using the estimated model parameters, the predictive distribution of the fine-grained target data is given by

(8)

where is the predictive means, and where is the covariance matrix. We can obtain the refinement results, i.e., the estimated fine-grained target data, by using the predictive means . By analyzing the covariance matrix , we can also evaluate the confidence of the refinement results.

6 Experiments

Data description: We evaluated the proposed model using real-world spatial data sets from NYC Open Data 111https://opendata.cityofnewyork.us. There are 44 data sets that contain a variety of categories such as social indicators, land use, air quality and taxi traffic. Each data set is associated with one of six geographical partitions, i.e., school district (32), UHF42 (42), community district (59), police precinct (77), zip code (186) and taxi zone (249), where each number in parenthesis denotes the number of regions in the corresponding partition. In our experiments, we try to refine the poverty rate data set and the five air pollution data sets (i.e., PM2.5, ozone, formaldehyde, benzene, elemental carbon). The experimental setting is as follows: 1) Given the poverty rate data set with the borough partition (), we would like to refine the data into the community district partition (), and 2) given each air pollution data set with the borough partition (), we aim to refine the data into the UHF42 partition (). Appendix B details the data sets and the settings.

Baselines: The existing methods can be applied to auxiliary data sets with various granularities if pre-processing is applied, i.e., spatial interpolation, so that the granularities of the auxiliary data sets match with that of the fine-grained target data. Accordingly, we first performed spatial interpolation of each auxiliary data set by using GP regression; we then obtained the predictive values at the centroids of the target fine-grained partition so that the spatial granularities of all auxiliary data sets equaled that of the fine-grained target data. We compared the proposed model with three baselines: GP regression (GPR) [Rasmussen and Williams2006], Linear regression-based method (LR-based method) [Smith, Mashhadi, and Capra2014] and Two-stage statistical downscaling method (2-stage SD) [Park2013]. Here, GPR is a simple spatial interpolation, namely, it predicts the fine-grained target data by using only the coarse-grained target data . Details of these baselines are given in Appendix C.

PM2.5 Ozone Formaldehyde Benzene Elemental carbon Poverty rate
 Proposed model
 2-stage SD
 LR-based method
 GPR
Table 2: MAPE

and standard errors for the predictions of the fine-grained target data.

(a) True
(b) Proposed model
(c) 2-stage SD
(d) LR-based method
Figure 3: Comparison of the predicted fine-grained target data for PM2.5 data set.
(a) True
(b) Proposed model
(c) 2-stage SD
(d) LR-based method
Figure 4: Comparison of the predicted fine-grained target data for poverty rate data set.

Fine-grained target data prediction: We evaluated our model in terms of its performance in predicting fine-grained target data

. The evaluation metric is the mean absolute percentage error (MAPE) in fine-grained target values:

, where is the true value associated with region in the target fine-grained partition; is its predicted value. Table 2

shows the MAPE and the standard error of absolute percentage error for the proposed model, 2-stage SD, LR-based method and GPR. For all data sets, our model performed better than the baselines, and the differences between our model and the baselines are statistically significant (Student’s t-test). In Table 

2, the single star () and the double star () indicate significant difference at the levels of and , respectively. We found similar results using other evaluation metrics (e.g., MAE, RMSE, RMSPE). These results show that our model well utilized the auxiliary data sets with various granularities to accurately predict the fine-grained target data.

Figures 3 and 4 visualize the predicted fine-grained target data for the PM2.5 data set and for the poverty rate data set, respectively. We illustrate the true fine-grained data on the left in Figures 3 and 4, and the predictions made by the proposed model, 2-stage SD and LR-based method on the right. Here, the predictive values of each method were normalized to the range , and darker hues represent regions with higher values. As shown in these figures, our model refined the coarse-grained data more precisely than the other methods. In particular, in both data sets, our model achieved significant improvement in the north part of the map (i.e., Manhattan). Such visualization results are useful for finding key regions, e.g., the poorest regions of a city.

Proposed model 2-stage SD
Auxiliary data Auxiliary data
1. Fire incident (Zip code) 0.173 1-2 fam. bldg (Comm.) -0.088
2. Taxi dropoff (Taxi zone) 0.139 Hospital (Comm.) 0.069
3. 311 call (Zip code) 0.135 Public school (Comm.) 0.069
4. Public telephone (Zip code) 0.114 Lots of vacant (Comm.) -0.067
5. Natural gas (Zip code) 0.109 Crime (Police precinct) 0.064
Table 3: Top-5 relevant auxiliary data as estimated by our model and 2-stage SD for PM2.5 data set.
(a) Fire incidents
(b) Taxi dropoff
Figure 5: Top-2 auxiliary data sets ranked by the proposed model for PM2.5 data set.
(a) 1-2 fam. bldg
(b) Hospital
Figure 6: Top-2 auxiliary data sets ranked by the 2-stage SD for PM2.5 data set.

Evaluation of auxiliary spatial data sets: Table 3 shows the top five relevant auxiliary data sets as determined by our model and 2-stage SD for the PM2.5 data set. These auxiliary data sets are arranged in descending order of the absolute values of the estimated regression coefficient , each of which is listed in the “” columns of Table 3. By comparing the sorted list of the auxiliary data sets created by the proposed model with that yielded by 2-stage SD, we can confirm that the proposed model assigned relatively large regression coefficients to the auxiliary data sets with finer-grained partitions (i.e., Zip code and Taxi zone).

Figures 5 and 6 visualize the top two relevant auxiliary data sets as estimated by our model and 2-stage SD for the PM2.5 data set, respectively. Comparing these visualizations with that of the true target data in Figure 3(a) shows that our model emphasized the most useful auxiliary data sets, i.e., those that are both strongly related with the target data and have fine granularities; 2-stage SD evaluated the usefulness of auxiliary data sets only in terms of the strength of relationships with the target data.

Figure 7 shows the relation between the regression coefficient and the uncertainty in the prediction of auxiliary data sets estimated by the proposed model for the PM2.5 data set. In this figure, each auxiliary data set is depicted by a dot whose color indicates its partition. The horizontal axis shows the averages of the variances in the predicted values of each auxiliary data set; for the th auxiliary data set, the average of variances was calculated by , which is the degree of uncertainty in predicting the th auxiliary data set; the vertical axis shows the absolute values of the estimated coefficients. As shown, the absolute coefficient values estimated by our model were likely to be higher for the auxiliary data sets that had lower degrees of uncertainty. These results indicate that our model can effectively learn the usefulness of each auxiliary data set by considering the uncertainty in the prediction of auxiliary data sets. Consequently, the proposed model can precisely refine the coarse-grained target data by effectively utilizing auxiliary data sets with various granularities.

Figure 7: Relation between the coefficients and the uncertainties for PM2.5.

7 Conclusion

This paper has proposed a probabilistic model for refining coarse-grained spatial data by utilizing auxiliary spatial data sets with various granularities on the same region. Our model can effectively make use of auxiliary data sets with various granularities by hierarchically incorporating Gaussian processes. Our model also has the advantage of allowing the inference of model parameters based on the exact marginal likelihood, in which the variables of fine-grained target and auxiliary data are analytically integrated out. Using multiple real-world spatial data sets in New York City, we confirmed that our model can predict the fine-grained target data more precisely compared with the baselines.

Our future work is to consider shapes of regions as in the previous study [Rathbun1998]: The assumption of using the centroid of each region allows for GP-based formulations and significantly simplifying computations involved; meanwhile, it might worsen the fit of the GP to the exotic shaped regions (e.g., extremely elongated). Another future work is to incorporate fully Bayesian treatment for model parameters. It can be expected to provide the better results.

References

  • [Barlacchi et al.2015] Barlacchi, G.; Nadai, M. D.; Larcher, R.; Casella, A.; Chitic, C.; and G. Torrisi et al., j. 2015. A multi-source dataset of urban life in the city of Milan and the province of Trentino. Scientific Data 2.
  • [Bogomolov et al.2014] Bogomolov, A.; Lepri, B.; Staiano, J.; Oliver, N.; Pianesi, F.; and Pentland, A. 2014. Once upon a crime: Towards crime prediction from demographics and mobile data. In ICMI, 427–434. ACM.
  • [Boucher and Kyriakidis2006] Boucher, A., and Kyriakidis, P. C. 2006. Super-resolution land cover mapping with indicator geostatistics. Remote Sensing of Environment 104:264–282.
  • [Cannon2011] Cannon, A. J. 2011. Quantile regression neural networks: Implementation in R and application to precipitation downscaling. Computers & Geosciences 37(9):1277–1284.
  • [Diodato et al.2010] Diodato, N.; Bellocchi, G.; Bertolin, C.; and Camuffo, D. 2010. Multiscale regression model to infer historical temperatures in a central mediterranean sub-regional area. Climate of the Past Discussions 6:2625–2649.
  • [Dong et al.2014] Dong, C.; Loy, C. C.; He, K.; and Tang, X. 2014. Learning a deep convolutional network for image super-resolution. In ECCV, 184–199. Springer.
  • [Flaxman, Wang, and Smola2015] Flaxman, S. R.; Wang, Y. X.; and Smola, A. J. 2015. Who supported Obama in 2012?: Ecological inference through distribution regression. In KDD, 289–298. ACM.
  • [Ghosh2010] Ghosh, S. 2010. SVM-PGSL coupled approach for statistical downscaling to predict rainfall from GCM output. Journal of Geophysical Research: Atmospheres 115(D22).
  • [Goldstein and Dyson2013] Goldstein, B., and Dyson, L. 2013. Beyond transparency: Open data and the future of civic innovation.
  • [Goovaerts2010] Goovaerts, P. 2010. Combining areal and point data in geostatistical interpolation: Applications to soil science and medical geography. Mathematical Geosciences 42(5):535–554.
  • [Hessami et al.2008] Hessami, M.; Gachon, P.; Ouarda, T. B.; and St-Hilair, A. 2008. Automated regression-based statistical downscaling tool. Environmental Modeling & Software 23(6):813–834.
  • [Howitt and Reynaud2003] Howitt, R., and Reynaud, A. 2003. Spatial disaggregation of agricultural production data using maximum entropy. European Review of Agricultural Economics 30(2):359–387.
  • [Jerrett et al.2013] Jerrett, M.; Burnett, R. T.; Beckerman, B. S.; Turner, M. C.; Krewski, D.; and et al., G. T. 2013. Spatial analysis of air pollution and mortality in California. American Journal of Respiratory and Critical Care Medicine 188(5):593–599.
  • [Keil et al.2013] Keil, P.; Belmaker, J.; Wilson, A. M.; Unitt, P.; and Jetz, W. 2013. Downscaling of species distribution models: a hierarchical approach. Methods in Ecology and Evolution 4(1):82–94.
  • [Kyriakidis2004] Kyriakidis, P. C. 2004. A geostatistical framework for area-to-point spatial interpolation. Geographical Analysis 36(3):259–289.
  • [Liu and Nocedal1989] Liu, D. C., and Nocedal, J. 1989. On the limited memory BFGS method for large scale optimization. Mathematical programming 45(1–3):503–528.
  • [Miller et al.2015] Miller, B. A.; Koszinski, S.; Wehrhan, M.; and Sommer, M. 2015. Impact of multi-scale predictor selection for modeling soil properties. Geoderma 239–240:97–106.
  • [Misra, Sarkar, and Mitra2017] Misra, S.; Sarkar, S.; and Mitra, P. 2017.

    Statistical downscaling of precipitation using long short-term memory recurrent neural networks.

    Theor. Appl. Climatol.
  • [Murakami and Tsutsumi2011] Murakami, D., and Tsutsumi, M. 2011. A new areal interpolation technique based on spatial econometrics. Procedia-Social and Behavioral Sciences 21:230–239.
  • [Park2013] Park, N. W. 2013. Spatial downscaling of TRMM precipitation using geostatistics and fine scale environmental variables. Advances in Meteorology 2013.
  • [Rasmussen and Williams2006] Rasmussen, C. E., and Williams, C. K. I. 2006. Gaussian processes for machine learning.
  • [Rathbun1998] Rathbun, S. L. 1998. Spatial modelling in irregularly shaped regions: Kriging estuaries. Environmetrics 9:109–129.
  • [Rupasinghaa and Goetz2007] Rupasinghaa, A., and Goetz, S. J. 2007. Social and political forces as determinants of poverty: A spatial analysis. The Journal of Socio-Economics 36(4):650–671.
  • [Shadbolt et al.2012] Shadbolt, N.; O’Hara, K.; Berners-Lee, T.; Gibbins, N.; Glaser, H.; Wendy, H.; and Schraefel, M. C. 2012. Linked open government data: Lessons from data.gov.uk. IEEE Intelligent Systems 27(3):16–24.
  • [Smith and Capra2016] Smith, C. C., and Capra, L. 2016. Beyond the baseline: Establishing the value in mobile phone based poverty estimates. In WWW, 425–434. ACM.
  • [Smith, Mashhadi, and Capra2014] Smith, C. C.; Mashhadi, A.; and Capra, L. 2014. Poverty on the cheap: Estimating poverty maps using aggregated mobile communication networks. In CHI, 511–520. ACM.
  • [Stein1999] Stein, M. L. 1999. Interpolation of spatial data: Some theory for kriging.
  • [Sturrock et al.2014] Sturrock, H. J. W.; Cohen, J. M.; Keil, P.; Tatem, A. J.; Menach, A. L.; Ntshalintshali, N. E.; Hsiang, M. S.; and Gosling, R. D. 2014. Fine-scale malaria risk mapping from routine aggregated case data. Malaria Journal 13:421.
  • [Taylor, Andrade-Pacheco, and Sturrock2018] Taylor, B. M.; Andrade-Pacheco, R.; and Sturrock, H. J. W. 2018. Continuous inference for aggregated point process data. Journal of the Royal Statistical Society: Series A (Statistics in Society) 12347.
  • [Vandal et al.2017] Vandal, T.; Kodra, E.; Ganguly, S.; Michaelis, A.; Nemani, R.; and Ganguly, A. R. 2017. DeepSD: Generating high resolution climate change projections through single image super-resolution. In KDD, 1663–1672. ACM.
  • [Vandal et al.2018] Vandal, T.; Kodra, E.; Ganguly, S.; Michaelis, A.; Nemani, R.; and Ganguly, A. R. 2018. Generating high resolution climate change projections through single image super-resolution: An abridged version. In IJCAI, 5389–5393.
  • [Wang et al.2016] Wang, H.; Kifer, D.; Graif, C.; and Li, Z. 2016. Crime rate inference with big data. In KDD, 635–644. ACM.
  • [Wilby et al.2004] Wilby, R. L.; Zorita, S. P.; Timbal, E.; Whetton, B.; and Mearns, L. O. 2004. Guidelines for Use of Climate Scenarios Developed from Statistical Downscaling Methods.
  • [Wilson and Wakefield2017] Wilson, K., and Wakefield, J. 2017. Pointless continuous spatial surface reconstruction. [online]. Available: https://arxiv.org/abs/1709.09659.
  • [Wotling et al.2000] Wotling, G.; Bouvier, C.; Danloux, J.; and Fritsch, J. M. 2000. Regionalization of extreme precipitation distribution using the principal components of the topographical environment. Journal of Hydrology 233(1-4):86–101.
  • [Xavier et al.2016] Xavier, A.; Freitas, M. B. C.; Rosrio, M. D. S.; and Fragoso, R. 2016. Disaggregating statistical data at the field level: An entropy approach. Spatial Statistics 23:91–103.
  • [Xu et al.2018] Xu, J.; Liu, X.; Wilson, T.; Tan, P. N.; Hatami, P.; and Luo, L. 2018. Muscat: Multi-scale spatio-temporal learning with application to climate modeling. In IJCAI, 2912–2918.
  • [Xu2017] Xu, J. 2017. Multi-task learning and its application to geospatio-temporal data. ProQuest Dissertations Publishing.
  • [Yuan, Zheng, and Xie2012] Yuan, J.; Zheng, Y.; and Xie, X. 2012. Discovering regions of different functions in a city using human mobility and pois. In KDD, 186–194. ACM.
  • [Zheng et al.2015] Zheng, Y.; Yi, X.; Li, M.; Li, R.; Shan, Z.; Chang, E.; and Li, T. 2015. Forecasting fine-grained air quality based on big data. In KDD, 2267–2276. ACM.
  • [Zheng, Liu, and Hsieh2013] Zheng, Y.; Liu, F.; and Hsieh, H. P. 2013. U-air: When urban air quality inference meets big data. In KDD, 1436–1444. ACM.
  • [Zorita and von Storch1999] Zorita, E., and von Storch, H. 1999. The analog method as a simple statistical downscaling technique: Comparison with more complicated methods. Journal of Climate 12:2474–2489.

Appendix A Derivatives of model parameters

The log-marginal likelihood of is given by

(9)

We describe the first derivatives of (9) with respect to , , , , which is required for estimating the parameter based on the BFGS method. The derivative of (9) with respect to is given by

(10)

where and is a matrix of elementwise derivatives. The derivative of the element  (7) is obtained by

(11)

Denoting , the derivative of (9) with respect to is given by

(12)

The matrix of elementwise derivatives is trivial. The derivative of the element  (7) with respect to each hyperparameter is as follows:

(13)
(14)
(15)

Appendix B Description of real-world spatial data sets

We used the real-world spatial data sets from NYC Open Data 222https://opendata.cityofnewyork.us. for evaluating the proposed model. The data sets were collected and released for improving the urban environment in New York City, and contain a variety of categories such as social indicators, land use, air quality and taxi traffic. Details of the data sets are listed in Table 4. There are multiple data sets in each category, with the total number of data sets being 44. Each data set is associated with one of six geographical partitions, i.e., school district, UHF42, community district, police precinct, zip code and taxi zone. These partitions have various spatial granularities; the number of regions in each partition is shown in Table 4. These data sets are gathered once a year using the time ranges shown in Table 4; the values of data are divided by the number of observation times. When the values of data are extensive quantities (i.e., proportional to the scale of areas, e.g., population), the values are divided by the areas of respective regions; the resulting values are intensive quantities (i.e., independent of area scale, e.g., population density).

In our experiments, we try to refine the poverty rate data set in the social indicator category and the five air pollution data sets in the air quality category. The poverty rate data set contains the values of poverty rates associated with each region in the community district partition as visualized in Figure 1(a). The air pollution data sets contain the average concentrations of pollutants (i.e., PM2.5, ozone, formaldehyde, benzene, elemental carbon) associated with each region in the UHF42 partition. In order to evaluate the performance in refining coarse-grained data, we used the data that were aggregated into a coarser-grained partition, i.e., borough partition, via spatial averaging, where the borough partition has five regions as illustrated in Figure 1(b). The experimental setting is as follows: 1) Given the poverty rate data set with borough partition (), we would like to refine the data into the community district partition (), and 2) given each air pollution data set with the borough partition (), we aim to refine the data into the UHF42 partition (). In the setting for the poverty rate data set, we used all data sets other than the target data as auxiliary data sets, so the number of auxiliary data sets was 43. In the setting for the air pollution data sets, we used all data sets not contained in the air quality category, so was 36.

Appendix C Baselines description

For GPR, we predict the fine-grained target data based only on the coarse-grained target data . For LR-based method and 2-stage SD, given the coarse-grained target data and the predictive values of all auxiliary data sets , we predict the fine-grained target data . Details of these baselines are given below.

Gaussian process regression (GPR): We compared our proposed model with a simple spatial interpolation (i.e., GPR) of the coarse-grained spatial data . This baseline assumes that the target data are explained by only the spatial correlation. Given and the set of centroids of the coarse-grained partition , we predicted the fine-grained target data by using the predictive distribution. Note that this baseline does not use the auxiliary spatial data sets.

Linear regression-based method (LR-based method): We used a linear regression-based method that has been applied in various studies (e.g., [Bogomolov et al.2014, Smith, Mashhadi, and Capra2014]). The linear regression model is used for estimating the relationships between the coarse-grained target data and the auxiliary data sets. The procedure in the training phase is as follows: 1) aggregate all auxiliary data sets into the coarse-grained partition of target data via spatial averaging; 2) estimate the regression coefficients of the respective auxiliary data sets by using the coarse-grained target data and the auxiliary data sets aggregated via spatial averaging. In the prediction phase, generate unknown values for the target fine-grained partition by applying the estimated relationships to the predictive values of auxiliary data sets as follows: , where is the estimated regression coefficient.

Two-stage statistical downscaling method (2-stage SD): We used the statistical downscaling method proposed in [Park2013]. This method assumes that coarse-grained target data can be decomposed into linear regression terms and residual terms. The downscaling procedure is divided into two stages. In the first stage, we obtain the regression coefficients in a manner similar to the training phase of the LR-based method. In the second stage, given the estimated coefficient , the fine-grained target data are estimated to be those that satisfy the following relation:

(16)

This relation expresses the spatial aggregation constraint, i.e., the assumption that value associated with coarse-grained region is the linear average of the constituent values in the fine-grained partition. Here, and are the residuals in the coarse-grained and fine-grained partitions, respectively. To obtain the fine-grained target data , the residual value in the fine-grained partition must be determined. Since the linear regression terms have already been fixed in the first stage, is obtained from (16); the residuals in the fine-grained partition are predicted by applying the spatial interpolation method, i.e., area-to-point simple kriging [Kyriakidis2004], to the residuals in the coarse-grained partition.

Category/Name #data sets Partition #regions Time range Description
Education 3 School district 32 2010 Class size, ratio of #pupils to #teachers, SAT score
Air quality 8 UHF42 42 2009–2010 Average concentration of pollutants
Social indicator 13 Community district 59 2009–2013 Poverty rate, population, mean commute time, etc.
Land use 11 Community district 59 2009–2013 Area percentage for commercial office, parking, etc.
Crime 1 Police precinct 77 2010–2016 Number of crimes
Incident 2 Zip code 186 2010–2016 #311 calls, #fire incidents
Telecommunication 2 Zip code 186 2016 #public telephones, #free Wi-Fi hotspots
Consumption 2 Zip code 186 2010–2014 Greenhouse gas (GHG) emission, natural gas consumption
Taxi traffic 2 Taxi zone 249 2014–2016 #taxi pick-up and drop-off events
Table 4: Spatial data sets.