The Bearable Lightness of Big Data: Towards Massive Public Datasets in Scientific Machine Learning

07/25/2022
by   Wai Tong Chung, et al.
38

In general, large datasets enable deep learning models to perform with good accuracy and generalizability. However, massive high-fidelity simulation datasets (from molecular chemistry, astrophysics, computational fluid dynamics (CFD), etc. can be challenging to curate due to dimensionality and storage constraints. Lossy compression algorithms can help mitigate limitations from storage, as long as the overall data fidelity is preserved. To illustrate this point, we demonstrate that deep learning models, trained and tested on data from a petascale CFD simulation, are robust to errors introduced during lossy compression in a semantic segmentation problem. Our results demonstrate that lossy compression algorithms offer a realistic pathway for exposing high-fidelity scientific data to open-source data repositories for building community datasets. In this paper, we outline, construct, and evaluate the requirements for establishing a big data framework, demonstrated at https://blastnet.github.io/, for scientific machine learning.

READ FULL TEXT

page 2

page 3

page 4

page 5

page 6

research
11/26/2022

A Physics-informed Diffusion Model for High-fidelity Flow Field Reconstruction

Machine learning models are gaining increasing popularity in the domain ...
research
10/05/2020

Using Bayesian deep learning approaches for uncertainty-aware building energy surrogate models

Fast machine learning-based surrogate models are trained to emulate slow...
research
07/28/2023

Does Full Waveform Inversion Benefit from Big Data?

This paper investigates the impact of big data on deep learning models f...
research
03/04/2020

Hybrid modeling: Applications in real-time diagnosis

Reduced-order models that accurately abstract high fidelity models and e...
research
08/20/2020

PicoDomain: A Compact High-Fidelity Cybersecurity Dataset

Analysis of cyber relevant data has become an area of increasing focus. ...
research
06/09/2023

Data-Link: High Fidelity Manufacturing Datasets for Model2Real Transfer under Industrial Settings

High-fidelity datasets play a pivotal role in imbuing simulators with re...
research
05/26/2021

Scalable Multigrid-based Hierarchical Scientific Data Refactoring on GPUs

Rapid growth in scientific data and a widening gap between computational...

Please sign up or login with your details

Forgot password? Click here to reset