Northlight: Declarative and Optimized Analysis of Atmospheric Datasets in SparkSQL

09/16/2021
by   Justus Henneberg, et al.
0

Performing data-intensive analytics is an essential part of modern Earth science. As such, research in atmospheric physics and meteorology frequently requires the processing of very large observational and/or modeled datasets. Typically, these datasets (a) have high dimensionality, i.e. contain various measurements per spatiotemporal point, (b) are extremely large, containing observations over a long time period. Additionally, (c) the analytical tasks being performed on these datasets are structurally complex. Over the years, the binary format NetCDF has been established as a de-facto standard in distributing and exchanging such multi-dimensional datasets in the Earth science community – along with tools and APIs to visualize, process, and generate them. Unfortunately, these access methods typically lack either (1) an easy-to-use but rich query interface or (2) an automatic optimization pipeline tailored towards the specialities of these datasets. As such, researchers from the field of Earth sciences (which are typically not computer scientists) unnecessarily struggle in efficiently working with these datasets on a daily basis. Consequently, in this work, we aim at resolving the aforementioned issues. Instead of proposing yet another specialized tool and interface to work with atmospheric datasets, we integrate sophisticated NetCDF processing capabilities into the established SparkSQL dataflow engine – resulting in our system Northlight. In contrast to comparable systems, Northlight introduces a set of fully automatic optimizations specifically tailored towards NetCDF processing. We experimentally show that Northlight scales gracefully with the selectivity of the analysis tasks and outperforms the comparable state-of-the-art pipeline by up to a factor of 6x.

READ FULL TEXT

page 1

page 2

page 7

page 9

research
11/19/2018

Astronomical observations: a guide for allied researchers

Observational astrophysics uses sophisticated technology to collect and ...
research
01/21/2022

AiTLAS: Artificial Intelligence Toolbox for Earth Observation

The AiTLAS toolbox (Artificial Intelligence Toolbox for Earth Observatio...
research
12/24/2021

Machine learning for Earth System Science (ESS): A survey, status and future directions for South Asia

This survey focuses on the current problems in Earth systems science whe...
research
12/10/2018

Scaling-Up In-Memory Datalog Processing: Observations and Techniques

Recursive query processing has experienced a recent resurgence, as a res...
research
12/14/2022

Analytical Engines With Context-Rich Processing: Towards Efficient Next-Generation Analytics

As modern data pipelines continue to collect, produce, and store a varie...
research
04/08/2011

Proceedings of the 2011 New York Workshop on Computer, Earth and Space Science

The purpose of the New York Workshop on Computer, Earth and Space Scienc...
research
08/29/2022

Differentiable Programming for Earth System Modeling

Earth System Models (ESMs) are the primary tools for investigating futur...

Please sign up or login with your details

Forgot password? Click here to reset