1 Introduction
Datasets are the key elements in big data mining, and the quality of datasets has an important impact on the results of big data analysis. For a higher quality dataset, some hidden rules can often be mined from it, and through these rules we can find some interesting things. At present, big data mining technology is widely used in various fields, such as geographic analysis [10, 16], financial analysis, smart city and biotechnology. It usually needs a better dataset to support the research, but in fact there is always noise data or missing value data in the datasets [9, 14, 17]. In order to improve data quality, various machine learning algorithms [7, 15] are often required to estimate the missing value and clean noise data.
RBF interpolation algorithm is a popular method for estimating missing values [4, 5, 6].In largescale computing, the cost can be minimized by using adaptive scheduling method [1]. RBF is a distancebased function, which is meshless and dimensionless, thus it is inherently suitable for processing multidimensional scattered data. Many scholars have done a lot of work on RBF research. Skala [13] used CSRBF to analyze big datasets, Cuomo et al. [2, 3] studied the reconstruction of implicit curves and surfaces by RBF interpolation. Kedward et al. [8] used multiscale RBF interpolation to study mesh deformation. In RBF, the shape factor is an important reason affecting the accuracy of interpolation. Some empirical formulas for optimum shape factor have been proposed by scholars.
In this paper, our objective is to estimate missing values in geographical data.We proposed an adaptive RBF interpolation algorithm, which adaptively determines the shape factor by the density of the local dataset. To evaluate the performance of adaptive RBF interpolation algorithm in estimating missing values, we used three datasets for verification experiments, and compared the accuracy and efficiency of adaptive RBF interpolation with that of NN interpolation and AIDW.
The rest of the paper is organized as follows. Section 2 mainly introduces the implementation process of the adaptive RBF interpolation algorithm, and briefly introduces the method to evaluate the performance of adaptive RBF interpolation. Section 3 introduces the experimental materials. Section 4 presents the estimated results of missing values, then discusses the experimental results. Section 5 draws some conclusions.
2 Methods
In this paper, our objective is to develop an adaptive RBF interpolation algorithm to estimate missing values in geospatial big data, and compare the results with that of NN and AIDW. In this section, we firstly introduce the specific implementation process of the adaptive RBF interpolation algorithm, then briefly introduces the method to evaluate the performance of adaptive RBF interpolation.
2.1 Adaptive RBF Interpolation Algorithm
The basic ideas behind the RBF interpolation are as follows. Constructing a highdimensional function £¬suppose there is a set of discrete points with associated data values Thus, the function can be expressed as a linear combination of RBF in the form (Eq. (1)):
(1) 
where is the number of interpolation points is the undetermined coefficient, the function is a type of RBF
The kernel function selected in this paper is Multiquadric RBF(MQRBF), which is formed as (Eq. (2)):
(2) 
where is the distance between the interpolated point and the data point, is the shape factor. Submit the data points into Eq. (1), then the interpolation conditions become (Eq. (3)):
(3) 
When using the RBF interpolation algorithm in a big dataset, it is not practical to calculate an interpolated point with all data points. Obviously, the closer the data point is to the interpolated point, the greater the influence on the interpolation result and the data point far from the interpolated point to a certain distance, its impact on the interpolated point is almost negligible. Therefore, we calculate the distances from an interpolated point to all data points, and select 20 points with the smallest distances as a local dataset for the interpolated point.
In Eq. (2), the value of the shape factor in MQRBF has a significant influence on the calculation result of interpolation. We consult the method proposed by Lu and Wang [11, 12], adaptively determining the value of the interpolated points by the density of the local dataset. The expected density is calculated by the function (Eq. (4)):
(4) 
where is the number of data points in the dataset, is the maximum value of for the data points in the dataset, is the minimum value of in dataset is the maximum value of in dataset, is the minimum value of in dataset
And the local density is calculated by (Eq. (5)):
(5) 
where is the number of data points in the local dataset, in this paper, we set as 20. is the maximum value of for the data points in local dataset, is the minimum value of in local dataset is the maximum value of in local dataset, is the minimum value of in local dataset
With both the local density and the expected density, the local density statistic can be expressed as (Eq. (6)):
(6) 
where is the location of an interpolated point Then normalize the measure to by a fuzzy membership function (Eq. (7)):
(7) 
Then determine the shape factor by a triangular membership function. See Eq (8).
(8) 
where are five levels of shape factor.
After determining the shape factor , the next steps are the same as the general RBF calculation method. The specific process of the adaptive RBF interpolation algorithm is illustrated in Fig.1.
2.2 Evaluating the Performance of Adaptive RBF Interpolation
In order to evaluate the computational accuracy of the adaptive RBF interpolation algorithm, we use the metric, Root Mean Square Error (RMSE) to measure the accuracy. The RMSE evaluates the error accuracy by comparing the deviation between the estimated value and the true value. Then, we compare the accuracy and efficiency of adaptive RBF estimator with the results of NN and AIDW estimators.
3 Experimental Design
To evaluate the performance of the presented adaptive RBF interpolation algorithm we use three datasets to test it. The details of the experimental environment are listed in Table 1.
In our experiments, we use three datasets from three cities’ digital elevation model (DEM) (Fig.2(a) to Fig.2(c)), the range of three DEM maps is the same We randomly select 10% observed samples from each dataset as the samples with missing values, and the rest as the samples with known values. It should be noted that the samples with missing values have really elevation values in fact, but for testing, we assume the elevations are missing. Basic information of the datasets is listed in Table 2.
Specification  Details 

OS  Windows 7. Professional 
CPU  Intel (R) i54210U 
CPU Frequency  1.70 GHz 
CPU RAM  8 GB 
CPU Core  4 
Dataset  Number of known values  Number of missing values  Illustration 

Beijing  1,111,369  123,592  Fig.2(a) 
Chongqing  1,074,379  97,525  Fig.2(b) 
Longyan  1,040,670  119,050  Fig.2(c) 
4 Results and Discussion
We compare the accuracy and efficiency of adaptive RBF estimator with that of NN and AIDW estimators.
In Fig.3 and Fig.4, we find that the accuracy of the adaptive RBF estimator is the best performing, and theNN estimator with the lowest accuracy. With the number of known data points in the datasets decreases, the accuracy of three estimators decreases significantly. Moreover, the computational efficiency of adaptive RBF estimator is worse than that of NN estimator and AIDW estimator, among them,NN has the best computational efficiency. With the increase of data quantity, the disadvantage of the computational efficiency of NN estimator becomes more and more obvious.
The data points selected from DEM are evenly distributed, and the shape factor of the adaptive RBF interpolation algorithm is adapted according to the density of the points in the local dataset therefore, when the missing data is estimated in a dataset with a more uniform data point, the advantages of the adaptive RBF interpolation algorithm may not be realized. We need to do further research in datasets with uneven datasets
5 Conclusions
In this paper, we specifically proposed an adaptive RBF interpolation algorithm for estimating the missing values in geographical data. We performed three groups of experimental tests to evaluate the computational accuracy and efficiency of the proposed adaptive RBF interpolation by comparing with the NN interpolation and AIDW method. We found that the accuracy of the adaptive RBF interpolation performs better than NN interpolation and AIDW in regularly distributed datasets. But the efficiency of adaptive RBF interpolation is worse than the NN interpolation and AIDW.
Acknowledgments
This research was jointly supported by the National Natural Science Foundation of China (11602235), and the Fundamental Research Funds for China Central Universities (2652018091, 2652018107, and 2652018109).
References
 [1] (2016) An approach to forecast queue time in adaptive scheduling: how to mediate system efficiency and users satisfaction. International Journal of Parallel Programming 45 (5), pp. 1–30. Cited by: §1.
 [2] (2016) Reconstruction of implicit curves and surfaces via rbf interpolation. Applied Numerical Mathematics 116, pp. 60–63. Cited by: §1.
 [3] (2013) Surface reconstruction from scattered point via rbf interpolation on gpu. Cited by: §1.
 [4] (2018) Accelerating multidimensional interpolation using moving leastsquares on the gpu. Concurrency Computation 30 (24). Cited by: §1.
 [5] (2018) Comparison of estimating missing values in iot time series data using different interpolation algorithms. International Journal of Parallel Programming, pp. 1–15. Cited by: §1.
 [6] (2017) Performance evaluation of gpuaccelerated spatial interpolation using radial basis functions for building explicit surfaces. International Journal of Parallel Programming (157), pp. 1–29. Cited by: §1.
 [7] (2018) Estimating significant wave height from sar imagery based on an svm regression model. Acta Oceanologica Sinica 37 (3), pp. 103–110. Cited by: §1.
 [8] (2017) Efficient and exact mesh deformation using multiscale rbf interpolation. Journal of Computational Physics 345, pp. 732–751. Cited by: §1.
 [9] (2018) Multiple imputation of missing data in nested casecontrol and casecohort studies. Biometrics 74 (4). Cited by: §1.
 [10] From big data to big analysis: a perspective of geographical conditions monitoring. pp. 1–15. Cited by: §1.
 [11] (2008) An adaptive inversedistance weighting spatial interpolation technique. Computers & Geosciences 34 (9), pp. 1044–1055. Cited by: §2.1.
 [12] (2016) Improving gpuaccelerated adaptive idw interpolation algorithm using fast knn search. Springerplus 5 (1), pp. 1389. Cited by: §2.1.
 [13] (2017) RBF interpolation with csrbf of large data sets. Procedia Computer Science 108, pp. 2433–2437. Cited by: §1.
 [14] (2016) Extreme learning machine for missing data using multiple imputations. Neurocomputing 174 (PA), pp. 220–231. Cited by: §1.

[15]
(2018)
Very largescale data classification based on kmeans clustering and multikernel svm
. Soft Computing (1), pp. 1–9. Cited by: §1.  [16] (2016) Big data and urban informatics: innovations and challenges to urban planning and knowledge discovery. Cited by: §1.
 [17] (2018) A biascorrected estimator in multiple imputation for missing data. Statistics in Medicine 47 (1), pp. 1–16. Cited by: §1.
Comments
There are no comments yet.