Parallel Computation of PDFs on Big Spatial Data Using Spark

05/08/2018
by   Ji Liu, et al.
0

We consider big spatial data, which is typically produced in scientific areas such as geological or seismic interpretation. The spatial data can be produced by observation (e.g. using sensors or soil instrument) or numerical simulation programs and correspond to points that represent a 3D soil cube area. However, errors in signal processing and modeling create some uncertainty, and thus a lack of accuracy in identifying geological or seismic phenomenons. Such uncertainty must be carefully analyzed. To analyze uncertainty, the main solution is to compute a Probability Density Function (PDF) of each point in the spatial cube area. However, computing PDFs on big spatial data can be very time consuming (from several hours to even months on a parallel computer). In this paper, we propose a new solution to efficiently compute such PDFs in parallel using Spark, with three methods: data grouping, machine learning prediction and sampling. We evaluate our solution by extensive experiments on different computer clusters using big data ranging from hundreds of GB to several TB. The experimental results show that our solution scales up very well and can reduce the execution time by a factor of 33 (in the order of seconds or minutes) compared with a baseline method.

READ FULL TEXT

page 5

page 18

page 21

research
11/04/2021

Auto Tuning of Hadoop and Spark parameters

Data of the order of terabytes, petabytes, or beyond is known as Big Dat...
research
10/26/2017

Distributed Spatial Data Clustering as a New Approach for Big Data Analysis

In this paper we propose a new approach for Big Data mining and analysis...
research
02/14/2020

Big Data Staging with MPI-IO for Interactive X-ray Science

New techniques in X-ray scattering science experiments produce large dat...
research
06/08/2023

Learned spatial data partitioning

Due to the significant increase in the size of spatial data, it is essen...
research
02/05/2018

Analytical Cost Metrics : Days of Future Past

As we move towards the exascale era, the new architectures must be capab...
research
09/13/2017

Visualization of Big Spatial Data using Coresets for Kernel Density Estimates

The size of large, geo-located datasets has reached scales where visuali...

Please sign up or login with your details

Forgot password? Click here to reset