DATED: Guidelines for Creating Synthetic Datasets for Engineering Design Applications

05/15/2023
by   Cyril Picard, et al.
0

Exploiting the recent advancements in artificial intelligence, showcased by ChatGPT and DALL-E, in real-world applications necessitates vast, domain-specific, and publicly accessible datasets. Unfortunately, the scarcity of such datasets poses a significant challenge for researchers aiming to apply these breakthroughs in engineering design. Synthetic datasets emerge as a viable alternative. However, practitioners are often uncertain about generating high-quality datasets that accurately represent real-world data and are suitable for the intended downstream applications. This study aims to fill this knowledge gap by proposing comprehensive guidelines for generating, annotating, and validating synthetic datasets. The trade-offs and methods associated with each of these aspects are elaborated upon. Further, the practical implications of these guidelines are illustrated through the creation of a turbo-compressors dataset. The study underscores the importance of thoughtful sampling methods to ensure the appropriate size, diversity, utility, and realism of a dataset. It also highlights that design diversity does not equate to performance diversity or realism. By employing test sets that represent uniform, real, or task-specific samples, the influence of sample size and sampling strategy is scrutinized. Overall, this paper offers valuable insights for researchers intending to create and publish synthetic datasets for engineering design, thereby paving the way for more effective applications of AI advancements in the field. The code and data for the dataset and methods are made publicly accessible at https://github.com/cyrilpic/radcomp .

READ FULL TEXT

page 7

page 8

page 9

page 10

research
05/17/2023

Bridging the Gap: Enhancing the Utility of Synthetic Data via Post-Processing Techniques

Acquiring and annotating suitable datasets for training deep learning mo...
research
08/23/2023

Performance Comparison of Design Optimization and Deep Learning-based Inverse Design

Surrogate model-based optimization has been increasingly used in the fie...
research
06/23/2023

Exploring the Potential of AI-Generated Synthetic Datasets: A Case Study on Telematics Data with ChatGPT

This research delves into the construction and utilization of synthetic ...
research
10/11/2022

BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset

In this work, we present BanglaParaphrase, a high-quality synthetic Bang...
research
05/29/2023

Synfeal: A Data-Driven Simulator for End-to-End Camera Localization

Collecting real-world data is often considered the bottleneck of Artific...
research
05/04/2023

Recent Advances in the Foundations and Applications of Unbiased Learning to Rank

Since its inception, the field of unbiased learning to rank (ULTR) has r...

Please sign up or login with your details

Forgot password? Click here to reset