Big enterprise registration data imputation: Supporting spatiotemporal analysis of industries in China

04/05/2018
by   Fa Li, et al.
0

Big, fine-grained enterprise registration data that includes time and location information enables us to quantitatively analyze, visualize, and understand the patterns of industries at multiple scales across time and space. However, data quality issues like incompleteness and ambiguity, hinder such analysis and application. These issues become more challenging when the volume of data is immense and constantly growing. High Performance Computing (HPC) frameworks can tackle big data computational issues, but few studies have systematically investigated imputation methods for enterprise registration data in this type of computing environment. In this paper, we propose a big data imputation workflow based on Apache Spark as well as a bare-metal computing cluster, to impute enterprise registration data. We integrated external data sources, employed Natural Language Processing (NLP), and compared several machine-learning methods to address incompleteness and ambiguity problems found in enterprise registration data. Experimental results illustrate the feasibility, efficiency, and scalability of the proposed HPC-based imputation framework, which also provides a reference for other big georeferenced text data processing. Using these imputation results, we visualize and briefly discuss the spatiotemporal distribution of industries in China, demonstrating the potential applications of such data when quality issues are resolved.

READ FULL TEXT

page 4

page 5

page 6

page 7

page 8

page 9

page 14

page 15

research
07/29/2019

Geospatial Big Data Handling with High Performance Computing: Current Approaches and Future Directions

Geospatial big data plays a major role in the era of big data, as most d...
research
01/19/2018

A hybrid architecture for astronomical computing

With many large science equipment constructing and putting into use, ast...
research
02/14/2019

Theory-plus-code documentation of the DEPAM workflow for soundscape description

In the Big Data era, the community of PAM faces strong challenges, inclu...
research
12/01/2021

A unified framework to improve the interoperability between HPC and Big Data languages and programming models

One of the most important issues in the path to the convergence of HPC a...
research
08/23/2017

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale

Today's high-performance computing (HPC) systems are heavily instrumente...
research
07/05/2022

Fine-Grained Modeling and Optimization for Intelligent Resource Management in Big Data Processing

Big data processing at the production scale presents a highly complex en...
research
01/30/2019

NAOMI: Non-Autoregressive Multiresolution Sequence Imputation

Missing value imputation is a fundamental problem in modeling spatiotemp...

Please sign up or login with your details

Forgot password? Click here to reset