Benchmarking the Benchmark – Analysis of Synthetic NIDS Datasets

04/19/2021
by   Siamak Layeghy, et al.
10

Network Intrusion Detection Systems (NIDSs) are an increasingly important tool for the prevention and mitigation of cyber attacks. A number of labelled synthetic datasets generated have been generated and made publicly available by researchers, and they have become the benchmarks via which new ML-based NIDS classifiers are being evaluated. Recently published results show excellent classification performance with these datasets, increasingly approaching 100 percent performance across key evaluation metrics such as accuracy, F1 score, etc. Unfortunately, we have not yet seen these excellent academic research results translated into practical NIDS systems with such near-perfect performance. This motivated our research presented in this paper, where we analyse the statistical properties of the benign traffic in three of the more recent and relevant NIDS datasets, (CIC, UNSW, ...). As a comparison, we consider two datasets obtained from real-world production networks, one from a university network and one from a medium size Internet Service Provider (ISP). Our results show that the two real-world datasets are quite similar among themselves in regards to most of the considered statistical features. Equally, the three synthetic datasets are also relatively similar within their group. However, and most importantly, our results show a distinct difference of most of the considered statistical features between the three synthetic datasets and the two real-world datasets. Since ML relies on the basic assumption of training and test datasets being sampled from the same distribution, this raises the question of how well the performance results of ML-classifiers trained on the considered synthetic datasets can translate and generalise to real-world networks. We believe this is an interesting and relevant question which provides motivation for further research in this space.

READ FULL TEXT

page 18

page 19

research
11/18/2020

NetFlow Datasets for Machine Learning-based Network Intrusion Detection Systems

Machine Learning (ML)-based Network Intrusion Detection Systems (NIDSs) ...
research
05/09/2022

On Generalisability of Machine Learning-based Network Intrusion Detection Systems

Many of the proposed machine learning (ML) based network intrusion detec...
research
09/12/2022

Is Synthetic Dataset Reliable for Benchmarking Generalizable Person Re-Identification?

Recent studies show that models trained on synthetic datasets are able t...
research
12/16/2020

Try Before You Buy: A practical data purchasing algorithm for real-world data marketplaces

Data trading is becoming increasingly popular, as evident by the appeara...
research
05/08/2023

A Survey on AI/ML-Driven Intrusion and Misbehavior Detection in Networked Autonomous Systems: Techniques, Challenges and Opportunities

AI/ML-based intrusion detection systems (IDSs) and misbehavior detection...
research
01/23/2020

Towards Automatic Clustering Analysis using Traces of Information Gain: The InfoGuide Method

Clustering analysis has become a ubiquitous information retrieval tool i...

Please sign up or login with your details

Forgot password? Click here to reset