Reducing the Cost of Training Security Classifier (via Optimized Semi-Supervised Learning)

05/02/2022
by   Rui Shu, et al.
0

Background: Most of the existing machine learning models for security tasks, such as spam detection, malware detection, or network intrusion detection, are built on supervised machine learning algorithms. In such a paradigm, models need a large amount of labeled data to learn the useful relationships between selected features and the target class. However, such labeled data can be scarce and expensive to acquire. Goal: To help security practitioners train useful security classification models when few labeled training data and many unlabeled training data are available. Method: We propose an adaptive framework called Dapper, which optimizes 1) semi-supervised learning algorithms to assign pseudo-labels to unlabeled data in a propagation paradigm and 2) the machine learning classifier (i.e., random forest). When the dataset class is highly imbalanced, Dapper then adaptively integrates and optimizes a data oversampling method called SMOTE. We use the novel Bayesian Optimization to search a large hyperparameter space of these tuning targets. Result: We evaluate Dapper with three security datasets, i.e., the Twitter spam dataset, the malware URLs dataset, and the CIC-IDS-2017 dataset. Experimental results indicate that we can use as low as 10 classification performance than using 100 Conclusion: Based on those results, we would recommend using hyperparameter optimization with semi-supervised learning when dealing with shortages of labeled security data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/03/2019

Learning to Self-Train for Semi-Supervised Few-Shot Classification

Few-shot classification (FSC) is challenging due to the scarcity of labe...
research
09/12/2021

DRo: A data-scarce mechanism to revolutionize the performance of Deep Learning based Security Systems

Supervised Deep Learning requires plenty of labeled data to converge, an...
research
10/12/2016

Semi-supervised Discovery of Informative Tweets During the Emerging Disasters

The first objective towards the effective use of microblogging services ...
research
08/18/2013

Reference Distance Estimator

A theoretical study is presented for a simple linear classifier called r...
research
08/25/2023

Harvard Glaucoma Detection and Progression: A Multimodal Multitask Dataset and Generalization-Reinforced Semi-Supervised Learning

Glaucoma is the number one cause of irreversible blindness globally. A m...
research
03/24/2020

A Pitfall of Learning from User-generated Data: In-depth Analysis of Subjective Class Problem

Research in the supervised learning algorithms field implicitly assumes ...

Please sign up or login with your details

Forgot password? Click here to reset