An Assessment of Data Transfer Performance for Large-Scale Climate Data Analysis and Recommendations for the Data Infrastructure for CMIP6

08/26/2017
by   Eli Dart, et al.
0

We document the data transfer workflow, data transfer performance, and other aspects of staging approximately 56 terabytes of climate model output data from the distributed Coupled Model Intercomparison Project (CMIP5) archive to the National Energy Research Supercomputing Center (NERSC) at the Lawrence Berkeley National Laboratory required for tracking and characterizing extratropical storms, a phenomena of importance in the mid-latitudes. We present this analysis to illustrate the current challenges in assembling multi-model data sets at major computing facilities for large-scale studies of CMIP5 data. Because of the larger archive size of the upcoming CMIP6 phase of model intercomparison, we expect such data transfers to become of increasing importance, and perhaps of routine necessity. We find that data transfer rates using the ESGF are often slower than what is typically available to US residences and that there is significant room for improvement in the data transfer capabilities of the ESGF portal and data centers both in terms of workflow mechanics and in data transfer performance. We believe performance improvements of at least an order of magnitude are within technical reach using current best practices, as illustrated by the performance we achieved in transferring the complete raw data set between two high performance computing facilities. To achieve these performance improvements, we recommend: that current best practices (such as the Science DMZ model) be applied to the data servers and networks at ESGF data centers; that sufficient financial and human resources be devoted at the ESGF data centers for systems and network engineering tasks to support high performance data movement; and that performance metrics for data transfer between ESGF data centers and major computing facilities used for climate data analysis be established, regularly tested, and published.

READ FULL TEXT
research
05/26/2021

The Petascale DTN Project: High Performance Data Transfer for HPC Facilities

The movement of large-scale (tens of Terabytes and larger) data sets bet...
research
07/08/2021

HTCondor data movement at 100 Gbps

HTCondor is a major workload management system used in distributed high ...
research
07/22/2023

Verifiable Sustainability in Data Centers

Sustainability is crucial for combating climate change and protecting ou...
research
09/19/2022

Snowmass 2021 Computational Frontier CompF4 Topical Group Report: Storage and Processing Resource Access

Computing plays a significant role in all areas of high energy physics. ...
research
12/04/2018

JOVIAL: Notebook-based Astronomical Data Analysis in the Cloud

Performing astronomical data analysis using only personal computers is b...
research
08/07/2019

Performance Comparison for Neuroscience Application Benchmarks

Researchers within the Human Brain Project and related projects have in ...

Please sign up or login with your details

Forgot password? Click here to reset