K-Metamodes: frequency- and ensemble-based distributed k-modes clustering for security analytics

09/30/2019
by   Andrey Sapegin, et al.
0

Nowadays processing of Big Security Data, such as log messages, is commonly used for intrusion detection purposed. Its heterogeneous nature, as well as combination of numerical and categorical attributes does not allow to apply the existing data mining methods directly on the data without feature preprocessing. Therefore, a rather computationally expensive conversion of categorical attributes into vector space should be utilised for analysis of such data. However, a well-known k-modes algorithm allows to cluster the categorical data directly and avoid conversion into the vector space. The existing implementations of k-modes for Big Data processing are ensemble-based and utilise two-step clustering, where data subsets are first clustered independently, whereas the resulting cluster modes are clustered again in order to calculate metamodes valid for all data subsets. In this paper, the novel frequency-based distance function is proposed for the second step of ensemble-based k-modes clustering. Besides this, the existing feature discretisation method from the previous work is utilised in order to adapt k-modes for processing of mixed data sets. The resulting k-metamodes algorithm was tested on two public security data sets and reached higher effectiveness in comparison with the previous work.

READ FULL TEXT
research
11/19/2020

Similarity-based Distance for Categorical Clustering using Space Structure

Clustering is spotting pattern in a group of objects and resultantly gro...
research
08/17/2016

Clustering Mixed Datasets Using Homogeneity Analysis with Applications to Big Data

Datasets with a mixture of numerical and categorical attributes are rout...
research
06/06/2020

An Efficient k-modes Algorithm for Clustering Categorical Datasets

Mining clusters from datasets is an important endeavor in many applicati...
research
10/18/2022

Clustering Categorical Data: Soft Rounding k-modes

Over the last three decades, researchers have intensively explored vario...
research
11/26/2019

Securing Cluster-heads in Wireless Sensor Networks by a Hybrid Intrusion Detection System Based on Data Mining

Cluster-based Wireless Sensor Network (CWSN) is a kind of WSNs that beca...
research
12/22/2022

Co-clustering based exploratory analysis of mixed-type data tables

Co-clustering is a class of unsupervised data analysis techniques that e...
research
10/30/2018

Cluster Size Management in Multi-Stage Agglomerative Hierarchical Clustering of Acoustic Speech Segments

Agglomerative hierarchical clustering (AHC) requires only the similarity...

Please sign up or login with your details

Forgot password? Click here to reset