Position Paper on Dataset Engineering to Accelerate Science

03/09/2023
by   Emilio Vital Brazil, et al.
0

Data is a critical element in any discovery process. In the last decades, we observed exponential growth in the volume of available data and the technology to manipulate it. However, data is only practical when one can structure it for a well-defined task. For instance, we need a corpus of text broken into sentences to train a natural language machine-learning model. In this work, we will use the token dataset to designate a structured set of data built to perform a well-defined task. Moreover, the dataset will be used in most cases as a blueprint of an entity that at any moment can be stored as a table. Specifically, in science, each area has unique forms to organize, gather and handle its datasets. We believe that datasets must be a first-class entity in any knowledge-intensive process, and all workflows should have exceptional attention to datasets' lifecycle, from their gathering to uses and evolution. We advocate that science and engineering discovery processes are extreme instances of the need for such organization on datasets, claiming for new approaches and tooling. Furthermore, these requirements are more evident when the discovery workflow uses artificial intelligence methods to empower the subject-matter expert. In this work, we discuss an approach to bringing datasets as a critical entity in the discovery process in science. We illustrate some concepts using material discovery as a use case. We chose this domain because it leverages many significant problems that can be generalized to other science fields.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/26/2022

Heliophysics Discovery Tools for the 21st Century: Data Science and Machine Learning Structures and Recommendations for 2020-2050

Three main points: 1. Data Science (DS) will be increasingly important t...
research
11/05/2022

Toward Human-AI Co-creation to Accelerate Material Discovery

There is an increasing need in our society to achieve faster advances in...
research
05/24/2022

Overview of STEM Science as Process, Method, Material, and Data Named Entities

We are faced with an unprecedented production in scholarly publications ...
research
04/11/2023

Human-AI Co-Creation Approach to Find Forever Chemicals Replacements

Generative models are a powerful tool in AI for material discovery. We a...
research
06/28/2023

S2SNet: A Pretrained Neural Network for Superconductivity Discovery

Superconductivity allows electrical current to flow without any energy l...
research
05/22/2021

Cybercosm: New Foundations for a Converged Science Data Ecosystem

Scientific communities naturally tend to organize around data ecosystems...

Please sign up or login with your details

Forgot password? Click here to reset