Two step clustering for data reduction combining DBSCAN and k-means clustering

11/22/2021
by   Bart J. J. Kremers, et al.
0

A novel combination of two widely-used clustering algorithms is proposed here for the detection and reduction of high data density regions. The Density Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is used for the detection of high data density regions and the k-means algorithm for reduction. The proposed algorithm iterates while successively decrementing the DBSCAN search radius, allowing for an adaptive reduction factor based on the effective data density. The algorithm is demonstrated for a physics simulation application, where a surrogate model for fusion reactor plasma turbulence is generated with neural networks. A training dataset for the surrogate model is created with a quasilinear gyrokinetics code for turbulent transport calculations in fusion plasmas. The training set consists of model inputs derived from a repository of experimental measurements, meaning there is a potential risk of over-representing specific regions of this input parameter space. By applying the proposed reduction algorithm to this dataset, this study demonstrates that the training dataset can be reduced by a factor  20 using the proposed algorithm, without a noticeable loss in the surrogate model accuracy. This reduction provides a novel way of analyzing existing high-dimensional datasets for biases and consequently reducing them, which lowers the cost of re-populating that parameter space with higher quality data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/15/2022

AMD-DBSCAN: An Adaptive Multi-density DBSCAN for datasets of extremely variable density

DBSCAN has been widely used in density-based clustering algorithms. Howe...
research
04/08/2021

Fast Regression of the Tritium Breeding Ratio in Fusion Reactors

The tritium breeding ratio (TBR) is an essential quantity for the design...
research
12/22/2019

Efficient Parameter Sampling for Neural Network Construction

The customizable nature of deep learning models have allowed them to be ...
research
03/24/2020

Data-driven surrogates for high dimensional models using Gaussian process regression on the Grassmann manifold

This paper introduces a surrogate modeling scheme based on Grassmannian ...
research
07/22/2021

A local approach to parameter space reduction for regression and classification tasks

Frequently, the parameter space, chosen for shape design or other applic...
research
06/10/2019

HTDet: A Clustering Method using Information Entropy for Hardware Trojan Detection

Hardware Trojans (HTs) have drawn more and more attention in both academ...
research
03/08/2023

Loss-Curvature Matching for Dataset Selection and Condensation

Training neural networks on a large dataset requires substantial computa...

Please sign up or login with your details

Forgot password? Click here to reset