DeepAI
Log In Sign Up

Reducing the Cost of Training Security Classifier (via Optimized Semi-Supervised Learning)

05/02/2022
by   Rui Shu, et al.
0

Background: Most of the existing machine learning models for security tasks, such as spam detection, malware detection, or network intrusion detection, are built on supervised machine learning algorithms. In such a paradigm, models need a large amount of labeled data to learn the useful relationships between selected features and the target class. However, such labeled data can be scarce and expensive to acquire. Goal: To help security practitioners train useful security classification models when few labeled training data and many unlabeled training data are available. Method: We propose an adaptive framework called Dapper, which optimizes 1) semi-supervised learning algorithms to assign pseudo-labels to unlabeled data in a propagation paradigm and 2) the machine learning classifier (i.e., random forest). When the dataset class is highly imbalanced, Dapper then adaptively integrates and optimizes a data oversampling method called SMOTE. We use the novel Bayesian Optimization to search a large hyperparameter space of these tuning targets. Result: We evaluate Dapper with three security datasets, i.e., the Twitter spam dataset, the malware URLs dataset, and the CIC-IDS-2017 dataset. Experimental results indicate that we can use as low as 10 classification performance than using 100 Conclusion: Based on those results, we would recommend using hyperparameter optimization with semi-supervised learning when dealing with shortages of labeled security data.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/03/2019

Learning to Self-Train for Semi-Supervised Few-Shot Classification

Few-shot classification (FSC) is challenging due to the scarcity of labe...
09/12/2021

DRo: A data-scarce mechanism to revolutionize the performance of Deep Learning based Security Systems

Supervised Deep Learning requires plenty of labeled data to converge, an...
07/24/2021

A Real Use Case of Semi-Supervised Learning for Mammogram Classification in a Local Clinic of Costa Rica

The implementation of deep learning based computer aided diagnosis syste...
10/12/2016

Semi-supervised Discovery of Informative Tweets During the Emerging Disasters

The first objective towards the effective use of microblogging services ...
08/18/2013

Reference Distance Estimator

A theoretical study is presented for a simple linear classifier called r...
03/24/2020

A Pitfall of Learning from User-generated Data: In-depth Analysis of Subjective Class Problem

Research in the supervised learning algorithms field implicitly assumes ...
05/19/2020

Self-Updating Models with Error Remediation

Many environments currently employ machine learning models for data proc...

1. Introduction

When using machine learning to address security tasks, most existing models are built on supervised-learning algorithms. For example, many existing spam detection techniques (Crawford et al., 2015; Wu et al., 2018), malware detection techniques (Souri and Hosseini, 2018) or network intrusion detection systems (Resende and Drummond, 2018) train different classifiers to learn inherent relationships that exist between selected features and associated output class (i.e., label), namely abnormal and benign. Next, those classifiers are tested on unseen data for classification purposes. Thus, labeled data is necessary for training a helpful model in a supervised paradigm.

However, there are often cases when labeled security data is insufficient and expensive to collect, while a large set of unlabeled security data is available. To make good use of these unlabeled data, more practitioners resort to ways to annotate data to enlarge the size of labeled training data. Such process is referred as data annotation or data labeling. However, the work of data annotation is usually time-consuming and costly. For example, Tu et al. (Tu et al., 2020) in other domain (e.g., SE) task reported that manually reading and labeling 22,500+ GitHub commits requires 175 person-hours (approximately nine weeks), including cross-checking among labelers. Moreover, specific domain knowledge (e.g., security) is also required to ensure the high quality of annotated data.

Publication Year Brief Description
(Le et al., 2021) 2021 Use SSL to maximize the effectiveness of limit labeled training data for insider threat detection.
(Wang et al., 2015) 2015 Model and automate the Android policy refinement process with SSL.
(Zhang et al., 2020) 2020 Use label propagation to detect review spam groups.
(Nunes et al., 2016) 2016 Help with classification task of identifying relevant products in darknet/deepnet marketplaces, etc.
(Wang et al., 2021) 2021 Provide a multi-label propagation based method for fraud detection.
(Alabdulmohsin et al., 2016) 2016

Propose a method to estimate the maliciousness of a given file through a semi-supervised label propagation procedure.

(Taheri et al., 2020) 2020 Introduce a DL-based semi-supervised approach against label flipping attacks in the malware detection system.
(Pallaprolu et al., 2016) 2016 Propose to use label propagation to discover infected Remote Access Trojans packets in large unlabeled data.
(Ni et al., 2015) 2015 Use label propagation for malware detection.
(Kolosnjaji et al., 2016) 2016 Create a semi-supervised malware classification system that unifies views of static and dynamic malware analysis.
Table 1. A list of prior work using semi-supervised learning in security tasks.

Semi-supervised learning (SSL) (Zhu and Goldberg, 2009; Zhu, 2005) can address the challenges as mentioned above from the algorithm perspective. SSL is a method that correlates the features in unlabeled data with labeled data and further generates pseudo-labels for the unlabelled data in an intuitive way. The newly labeled dataset (i.e., a mixture of both labeled and pseudo-labeled samples) is then used to train a model in a supervised manner. Without involving manual and expensive labor, SSL significantly reduces the cost of training models with a large labeled dataset instead of only a tiny portion of them. Table 1 lists a sample list of prior work that uses semi-supervised learning in multiple security tasks. Among these work, there are two representative semi-supervised learning algorithms, i.e., label propagation (Zhu and Ghahramani, 2002) and label spreading (Zhou et al., 2003). Both algorithms use graph representation, compute data similarity between labeled and unlabeled data, and further propagate known labels through the edges of the graph to unlabeled data.

However, we observe that applications in Table 1 rarely apply hyperparameter optimization on SSL algorithms. We argue that exploring suitable hyperparameter configurations for semi-supervised learning algorithms would have an impact on the performance. To validate this argument, we propose an adaptive framework called Dapper that adopts hyperparameter optimization to control the configurations of the SSL algorithms. Beyond that, Dapper also explores the hyperparameter space of machine learning classifier (e.g., Random Forest). Furthermore, we also observe that, for some security datasets, the mixed training dataset (including labeled data and pseudo-labeled data) still suffers from class imbalance (which is a common issue in the security dataset). In this case, Dapper adaptively adds a tunable version of SMOTE (Agrawal and Menzies, 2018) to rebalance the ratio between classes in the training datasets.

We evaluate Dapper with three real-world study cases, i.e. the Twitter spam dataset (Chen et al., 2015b), malware URLs (Mamun et al., 2016) and CIC-IDS-2017 dataset (Sharafaldin et al., 2018). Our experimental results indicate that Dapper outperforms default SSL learners and optimized SSL learners. We can use as low as 10% of the original labeled dataset but achieve close or even better classification performance (e.g., g-measure and recall) than using 100% original labeled data. Based on those results, we recommend using Dapper framework when dealing with the shortage of available labeled data for security tasks.

The remainder of this paper is organized as follows. We discuss background and related work in Section 2 and our methodology in Section 3. We then report our experiment details in Section 4

, including datasets, evaluation metrics, etc. Section 

5 presents our experiment results. We discuss the threats to validity in Section 6 and then we conclude in Section 7.

2. Background and Related Works

2.1. Training Security Models Requires Labeled Data

Diverse machine learning techniques have been widely applied in the cyber-security field to address wide-ranging problems such as spam detection, malware detection, and network intrusion detection. For example, in Twitter spam detection (Chen et al., 2015b), machine learning algorithms use account-based features (e.g., the number of followers or friends) or message-based features (e.g., length of a tweet) to train useful models which are further used to predict other new spamming activities. In malware detection (Mamun et al., 2016)

, several prior works analyze web URLs where malicious URLs are intended for malicious purposes such as stealing user privacy information. Security practitioners use machine learning techniques to classify malicious websites with features extracted from URLs such as URL tokens, length of URLs, etc. Another example is the network intrusion detection 

(Mamun et al., 2016) which endeavors to identify malicious behaviors in the network traffic. Machine learning-based techniques have gained enormous popularity in this field. They learn useful features from the network traffic and classify the normal and abnormal activities based on the learned patterns.

Security classification algorithms learn models from data, so insufficient training data can lead to low-quality models. But in many cases there are only a small number of labeled instances are available (compared to a much large amount of unlabeled instances). For example, in a network intrusion detection scenario, network traffic of the monitored system is continuously generated with a large load, but the abnormal traffic, which is few and available only under malicious attacks (Gharib et al., 2016).

To make good use of unlabeled data, practitioners propose methods to annotate unlabeled data involving human efforts. However, the process of data annotation faces several challenges:

  • Time-consuming and expensive. Due to the large volume of unlabeled data, much effort and time or finance is expected to be involved in the process, and it is not always an affordable solution. For example, manual labeling would cost $320K and 39,000 hours to label GitHub issues of 50 projects as buggy or non-buggy (Tu et al., 2020).

  • Require domain knowledge. A lack of professional expertise is commonly the root cause of poor label quality. In the context of security, a person who lacks knowledge of security vulnerability, intrusion detection, malware, etc., hardly guarantees the right decision on tagging new data.

Moreover, existing data annotation methods can be mainly categorized into the following types:

  • Manual. Manual data annotation works during the initial phase of a project when the data size is small and not complicated. This method is not salable for a large collection and might become overwhelmed and cause degraded label quality.

  • Crowdsourcing. Crowdsource data annotation is a better choice than manual labeling, which can be scalable with the help of platforms such as Amazon Mechanical Turk (MTurk) (Inc, 2021). However, crowdsourcing can be expensive and can not guarantee the quality of the label.

  • Outsourcing. This method includes employing data labeling companies in low-cost markets. With extra QA processes and other solutions, label quality can be controlled and improved. However, outsourcing data annotation is still a manual process. (albeit with a cheap cost of labor).

  • Interactive Learning

    . With methods such as active learning, rather than annotating all the data independently and simultaneously, only a fraction of the total number of data to be labeled 

    (Settles and Craven, 2008). With active learning, the expert users will label the most suitable data.

In summary, a drawback with all the above methods is that they are all human-in-the-loop (HITL) methods. This kind of method raises the issue of requiring expertise of that human (and they might make mistakes). The rest of this paper explores fully automated techniques to avoid this issue.

2.2. Semi-Supervised Learning

Ground truth (i.e., labeled data) is often limited and costly to acquire when applying machine learning to security tasks. To address this problem, as shown in Table 1, some prior works in security have used semi-supervised learning to address such challenge. Specifically, semi-supervised learning (SSL) (Zhu and Goldberg, 2009; Zhu, 2005) is a branch of machine learning algorithms that lies between supervised learning (Hastie et al., 2009)

and unsupervised learning 

(Celebi and Aydin, 2016). In supervised learning, the training dataset comprises only labeled data, which is to learn a function that can generalize well on the unseen data. Unsupervised learning only considers unlabeled data, where data points are grouped into clusters with similar properties. Semi-supervised learning combines both supervised learning and unsupervised learning, which uses a small amount of labeled data and a large amount of unlabeled data. The use of semi-supervised learning avoids searching labeled data or manually annotating unlabeled data. Major semi-supervised learning algorithms can be mainly categorized into the following two groups (Silva et al., 2016):

Wrapper-Based Methods. Methods in this group use a supervised algorithm in an iterative way. During each iteration, a certain amount of unlabeled data is labeled by the decision function that is learned and incorporated into the training data. With the labeled data already available as well as its own prediction, the classification model is retrained for the next iteration. Two well-known representative methods in this group are self-training (Scudder, 1965) and co-training (Blum and Mitchell, 1998). Self-training is a technique in which initially, a classifier is trained with the small amount of labeled data and then used to classify unlabeled data. The high confident unlabeled data as well as their predicted labels are added to the training dataset. The whole process is repeated either for a fixed number of iterations or until there are no high-confidence samples left in the unlabeled data. The co-training algorithm adopts an iterative learning process similar to self-training. It combines both labeled and unlabeled data under two-view setting. Initially, co-training trains two classifiers from each of the subset separately with limited labeled dataset. Then unlabeled data which the two classifiers have confident prediction will enlarge the labeled dataset for further training. This process repeats until a termination condition is met. The premise of co-training is that it assumes features can be split into two subsets and both subsets are conditionally independent given the class and sufficient to train classifiers by itself. Both method have several drawbacks. For example, in self-training, mistakes can re-enforce themselves. Co-training makes several assumptions, and only works well when conditional independence holds.

Graph-Based Methods. Graph-based methods (Van Engelen and Hoos, 2020) create a graph that connects instances in the training dataset and propagates labels from labeled data to unlabeled data, through the edges of the graph. This process typically involves computing similarities between data instances. Consider the geometry of the dataset, which can be represented by an empirical graph , where nodes denote the training data and edges represents the similarities or affinity between adjacent nodes. Labels that assigned to the nodes in the graph can propagate along the edges of the graph to their connected nodes. The assumption behind the approach is that nodes with strong edges are more likely to share the same label. Two representative graph-based algorithms are label propagation (Zhu and Ghahramani, 2002) and label spreading (Zhou et al., 2003), which we will introduce in details in Section 3.1 and Section 3.2. Compared with other semi-supervised learning algorithms, graph-based methods are fast and easy to use due to its linear time complexity, and therefore explored in this study.

2.3. Smote

Security data commonly suffer from data class imbalance issues. Many prior works propose methods to rebalance the ratio between classes to address this concern. SMOTE (Chawla et al., 2002) (i.e., Synthetic Minority Oversampling Technique) is a widely used oversampling technique that works by randomly selecting samples from minority classes and choosing nearest neighbors for each chosen sample. A synthetic instance is created at a randomly selected point between each pair of chosen sample and its neighbor. The synthetic samples are added to the original dataset to balance the ratio between majority and minority classes. Agrawal et al. proposed an auto-tuning version of SMOTE, which is called SMOTUNED (Agrawal and Menzies, 2018). SMOTUNED adjusts several key parameters of SMOTE such as (the number of neighbors selected), (the number of synthetic samples to create), and

(the power parameter for the Minkowski distance metric). SMOTUNED applies an evolutionary algorithm called differential evolution 

(Storn and Price, 1997) as the optimizer to explore SMOTE’s parameter space. Our study uses SMOTUNED as our oversampling technique, but with a more novel optimizer introduced in the following subsection.

Figure 1. An overview of the architecture of Dapper framework.

2.4. Hyperparameter Optimization

Hyperparameter is a type of parameter in machine learning models that can be estimated from data learning and have to be set before model training (Yang and Shami, 2020). Typically a hyperparameter has a known effect on a model in the general sense, but it is not clear how to best set a hyperparameter for a given dataset. In this sense, a range of possibilities have to be explored. Hyperparameter optimization or hyperparameter tuning is a technique that explores a range of hyperparameters and searches for the optimal solution for a task.

Existing hyperparameter optimization methods can be mainly categorized into the following groups (Yang and Shami, 2020). The first group is decision-theoretic methods. This kind of methods is based on the concept of defining a search space and select combinations in the search space. The most common methods of this type are grid search and random search. Grid search (GS) (Bergstra et al., 2011) defines a search space as a grid of hyperparameter values and exhaustively searches and evaluates every position in the grid. Random search (RS) (Bergstra and Bengio, 2012) defines a search space as a bounded domain of hyperparameter values and randomly samples points in that domain. Decision-theoretic methods are a good choice when the search space is small and not complicated.

Metaheuristic algorithms such as genetic algorithms and particle swarm optimization belongs to the second group. Genetic algorithms (GA) 

(Lessmann et al., 2005) detect well-performing hyperparameter combinations during each generation, and pass them to the next generation until the optimal combination is found. In particle swarm optimization (PSO) (Lorenzo et al., 2017), each particle communicates with other particles to detect and update the current global optimum in each iteration until the final optimum is found. Metaheuristic algorithms suffer from time-consuming problems when search space is large since they are computationally expensive.

Unlike previous groups of methods, Bayesian optimization (Snoek et al., 2012; Shahriari et al., 2015)

is a novel hyperparameter optimization technique that keeps track of past evaluation results. The principle of Bayesian optimization is using those results to build a probability model of objective function, and maps hyperparameters to a probability of a score on the objective function, and therefore uses it to select the most promising hyperparameters to evaluate in the true objective function. This method is also called Sequential Model-Based Optimization (SMBO) 

(Hutter et al., 2011). The probability representation of the objective function is called surrogate function or response surface because it is a high-dimensional mapping of hyperparameters to the probability of a score on the objective function. The surrogate function is much easier to optimize than the objective function and Bayesian methods work by finding the next set of hyperparameters to evaluate the actual objective function by selecting hyperparameters that perform best on the surrogate function. This method continually updates the surrogate probability model after each evaluation of the objective function.

3. Methodology

Based on the above discussion, we were motivated to address our security problems using a combination of semi-supervised learning and hyperparameter optimization (and SMOTE when the data is imbalanced). The basis of our framework about semi-supervised learning is an idea called pseudo-labeling (Lee et al., 2013). Pseudo-labeling works by iteratively propagating labels from labeled data to unlabeled data, i.e., relabeling the unlabeled data with algorithms. Our framework involves two pseudo-labeling approaches, i.e., label propagation and label spreading, which have been the de facto standard method of the inference phase in graph-based semi-supervised learning. We first introduce both algorithms in detail and then present the proposed framework.

3.1. Label Propagation

Label propagation (LP) algorithm (Zhu and Ghahramani, 2002) is analogous to the -Nearest-Neighbours algorithms and assumes that data points close to each other tend to have a similar label. To be specific, LP is an iterative algorithm that computes soft label assignments by pushing the estimated label at each node to its neighbouring nodes based on the edge weights. In other words, the new estimated label at each node is calculated as the weighted sum of the labels of its neighbours.

More formally, if we consider two set: as labeled datasets, and as unlabeled datasets, where in a binary classification problem, and . LP constructs a graph where is the set of vertices representing set and , and the edges in set represents the similarity of two node and with weight . The weight is computed in a way that two nodes with smaller distance (i.e., more similar) will have a larger weight. Moreover, a Laplacian transition matrix on is denoted as

(1)

is used to propagate the labels.

There are two repeated steps involved in this algorithm until the label assignment process converges. LP starts with an initial assignment, which is random for the unlabeled data points and equal to the true labels for the labeled data points, then LP propagates labels from each node to the neighbouring nodes and then reset the predictions of the labeled data points to the corresponding true labels. LP finally converges to a harmonic function and this process can also be interpreted as a random walk with the transition matrix and stops when a labeled node is hit.

3.2. Label Spreading

Label spreading (LS) algorithm (Zhou et al., 2003)

is a variant to the label propagation algorithm. LS aims at minimizing a loss function that has regularization properties which is more robust to noise data. Instead of using Laplacian transition matrix for propagation, LS uses the normalized Laplacian matrix. A second different is the clamping effect LS has on the label distribution. Clamping allows LS to adjust the weight of the ground truth labeled data to some degree, rather than using hard clamping of input labels in LP.

3.3. Dapper Framework

Figure 1 presents the framework of Dapper. We split the original dataset int three parts, the training dataset, the validation dataset and the testing dataset with a predefined ratio . The training dataset is further split into two subsets with another ratio . One subset is denoted as the labeled training dataset . We remove all the actual labels of the other subset and reset with default value . We treat subset as the unlabeled training dataset.

The optimization module in the Dapper framework has two sub-modules (as the blue dashed box shows in Figure 1), the semi-supervised learning algorithm and the machine learning classifier. Our framework is also adaptive, which means when Dapper detects that the input training dataset is highly imbalanced (i.e., the percentage of minority class samples is lower than a predefined threshold ), Dapper automatically adds an oversampling sub-module. This oversampling sub-module is based on a tuned version of SMOTE which we discuss before in Section 2.3. The optimization process in the framework proceeds with a fixed number of evaluation trails, which is shown in Algorithm 1.

1 Function Dapper ;
Input : Training dataset - ,
Validation dataset - ,
SSL ratio - ,
Imbalance threshold - ,
Hyperparameter space - ,
Target function -
Output : Optimized model
2 Split the training dataset into two subsets with ratio , which treated as labeled dataset and unlabaled dataset
3 Reset labels of unlabeled dataset to
4 for  number of Bayesian Optimization trails do
5        Sample a combined hyperparameter set
6        Run SSL learner with and assign pseudo-labels
7        Concatenate training subsets into with actual labels and pseudo-labels
8        if  minority class percentage ¡  then
9               Rebalance with SMOTE with
10              
11       Train classifier with with
12        Evaluate trained classifier with
13        Compute loss towards target function
14       
15Rank all optimization trails by loss with smallest on the top
return Optimized model
Algorithm 1 Pseudocode of Dapper’s optimization process.
Item Hyperparameter Range Brief Description
Label Propagation kernel

knn’, ’rbf’

Kernel function to use.
gamma (10, 30) Parameter for rbf kernel.
n_neighbors (5, 15) Parameter for knn kernel.
max_iter (500, 1500) The maximum number of iterations allowed.
Label Spreading kernel ’knn’, ’rbf’ Kernel function to use.
gamma (10, 30) Parameter for rbf kernel.
n_neighbors (5, 15) Parameter for knn kernel.
alpha (0.1, 0.9) Clamping factor.
max_iter (500, 1500) The maximum number of iterations allowed.
SMOTE k [1, 20] Number of neighbours.
r [1, 6] Minkowski distance metric.
m [50, 500] Number of synthetic samples.
Random Forest n_estimators [50, 200] The number of trees in the forest.
min_samples_leaf [1, 25] The minimum number of samples required to be at a leaf node.
min_samples_split [2, 25] The minimum number of samples required to be at an internal node.
max_leaf_nodes [2, 100] Total number of leaf nodes in a tree.
max_depth [1, 25] The maximum depth of the tree.
max_features ’auto’, ’sqrt’, ’log2’ The number of features to consider when looking for the best split.
bootstrap ’True’, ’False’ Whether bootstrap samples are used when building trees.
Imbalance
Threshold
t 30% A threshold to control whether SMOTE is used or not.
Table 2. Hyperparameter space explored in this study. This space covers two semi-supervised learning algorithms, SMOTE, the Random Forest classifier and the imbalance threshold.

During each trail, we repeat the following steps:

  1. With a chosen SSL learner (label propagation algorithm or label spreading algorithm), as well as a sampled combined set of hyperparameters, we assign pseudo-labels to unlabeled dataset with the algorithm.

  2. We concatenate the original labeled dataset and pseudo-labeled dataset into a new training dataset.

  3. If the percentage of the minority class samples of training dataset is lower than a threshold, we use SMOTE (with sampled hyperparameters) to balance the class ratio of the new training dataset. Otherwise, we pass this step.

  4. We train a classifier (with sampled hyperparameters) with the new training dataset, and evaluate the trained classifier on the validation dataset.

  5. The loss value of each trail, i.e. the complement of g-measure, is logged.

After the optimization process, we rank all the evaluated classifiers by the loss value. The one with the smallest loss is selected and further tested on the testing dataset. Note that we sample the hyperparameters of SSL learner, SMOTE, and classifier in a combination manner, in which way we address multiple optimization problems simultaneously. Bayesian Optimization directs the whole sampling process and searches for the next promising set of hyperparameters after each trail from Table 2.

Besides, the reasons we design the Dapper framework in an adaptive manner to address the class imbalance issue are twofold. Firstly, data class imbalance is common in most security datasets, and many prior studies hint that oversampling the dataset is more likely to produce better classification performance. Secondly, we hypothesize that standard semi-supervised learning algorithms do not adequately address the class imbalance issue. This means, in the mixed dataset with original labeled data and pseudo-labeled data, the problem still remains. Our experimental results further confirm the hypothesis, which we will discuss in detail in the result section.

In order to endorse the merits of Dapper (i.e., solving multiple optimization problems), we also compare Dapper with two other treatments:

  1. No optimization: with default SSL learner (default LP or default LS);

  2. Single optimization: with optimized SSL learner only (optimized LP or optimized LS).

The first treatment is used to endorse the merit of hyperparameter optimization, while the second treatment is used to demonstrate the advantages of Dapper over tuning SSL learners only. Moreover, our study performs sensitivity experiment, in which we pick different value of the ratio and explore the performance under each ratio.

4. Experiment

4.1. Datasets and Algorithm

Our proposed Dapper framework is evaluated with three security datasets which cover different security tasks, such as spam detection, malware detection and network intrusion detection.

Twitter Spam (Chen et al., 2015b). As spam on Twitter becomes a growing problem, researchers have adopted different machine learning algorithms to detect Twitter spam. This dataset is generated from over 600 millions public tweets, and further labeled around 6.5 million spam tweets with 12 features extracted. The ground truth is established with Trend Micro’s Web Reputation Service, which identify malicious tweets through URLs. We sample a total size of 5,000 instances from prior work (Chen et al., 2015b) (which has a size of about 100k), with 4,758 non-spam tweets and 242 spam tweets. This dataset has 12 features, such as account age, number of followers of the twitter user, the number of tweets the twitter user sent, etc.

Malware URLs (Mamun et al., 2016). The original dataset collects about 114,400 URLs initially, containing benign and malicious URLs in four categories: spam URLs, phishing URLs, website URLs distributing malware and defacement URLs where pages belong to the trusted but compromised sites. This work selects malware URLs as our experimental target. In the selected dataset, more than 11,500 URLs related to malware websites were obtained from DNS-BH which is a project that maintain list of malware sites. There are 7,781 benign URLs and 6,711 malicious URLs in the dataset, and 79 features such as ratio of argument and URLs, count of token, and the proportion of digits in the URL parts, etc.

CIC-IDS-2017 (Sharafaldin et al., 2018). This dataset consists of labeled network flows. It is comprised of both normal traffic and simulated abnormal data caused by intentional attacks on a test network. This dataset was constructed using the NetFlowMeter Network Traffic Flow analyzer, which collected multiple network traffic features and supported Bi-directional flows. We sample a portion of original dataset, which includes 11,425 normal traffic and 2,714 abnormal traffic. There are 70 features of the dataset, including average packet size, mean packet length, total forward packets, etc.

Dataset
Training
Set
Validation
Set
Testing
Set
Imbalance
Rate
Twitter
Spam
3,200 800 1,000 4.84%
Malicious
URLs
9,274 2,319 2,899 46.3%
CIC-IDS-2017 9,048 2,263 2,828 19.2%
Table 3. Details of the studied dataset.

As we show in Table 3, each dataset is split into training set, validation set and testing set with a ratio of 6.4 : 1.6 : 2 in a stratified way. To simulate the case of semi-supervised learning, we further divide the training set into labeled data and unlabeled data with different ratios in our sensitivity experiment. Another observation from Table 3 is the imbalance rate, where two datasets suffer from class imbalance issues.

We select random forest classifier as our machine learning algorithm throughout the whole experiment. Random forest utilizes ensemble learning which consists of multiple decision trees. The ‘forest’ generated by the algorithm is trained through bagging or bootstrap aggregating. Random forest establishes the outcome based on the predictions of the decision trees and predicts by taking the average or mean of the output from various decision trees. There are two reasons why we select random forest. Firstly, random forest is commonly used as classifier in previous security tasks such as intrusion detection 

(Resende and Drummond, 2018). Secondly, the implementation of random forest in the Scikit-learn machine learning software library (scikit-learn developers, 2021) provides multiple hyperparameters that can be tuned (as can be seen from Table 2).

Furthermore, the implementation of both label propagation algorithm and label spreading algorithm we adopt are publicly available in Scikit-learn, and the autotuned version of SMOTE is implemented according to  (Agrawal and Menzies, 2018).

4.2. Evaluation Metrics

If we let TP, TN, FP, FN to denote true positives, true negatives, false positives, and false negatives (respectively), we note that recall (pd), false positive rate (pf), g-measure (g-score), precision (prec), and f-measure (f1) are defined as follows:

(2)
(3)
(4)
(5)
(6)

where 1) recall represents the ability of one algorithm to identify instances of positive class from the given dataset; 2) false positive rate measures the instances that are falsely classified by an algorithm as positive which are actually negative; 3) g-measure is the harmonic mean of recall and the complement of false positive rate, and it is also our optimization goal. We also report AUC-ROC value (i.e., Area Under the Receiver Operating Characteristics) for the completeness of results. This metric is an important metric to tell how much the model is capable to distinguish between classes. The higher, the better. In the worst case, the value is 0.5, which means the model has no discrimination ability between positive and negative class. Note that this study does not focus on other metrics such as precision or accuracy, as these metrics would fail to demonstrate the ability of a model under imbalanced classification. We report f-measure also for the completeness concern.

5. Evaluation Results

Metric
Results from
Prior Work (Chen et al., 2015b)
100% labeled
data used
Dapper
Recall 92.9 58.3 85.4
False Positive
Rate
7.1 0.1 9.7
G-measure N/A 73.6 87.8
AUC-ROC N/A 79.1 87.8
 F-measure 56.6 72.7 45.1
Size of
Training Data
A portion of
over 100k
3200 320
(a) Twitter Spam
Metric
Results from
Prior Work (Mamun et al., 2016)
100% labeled
data used
Dapper
Recall 99.0 99.2 94.0
False Positive
Rate
N/A 0.4 0.6
G-measure N/A 99.4 96.6
AUC-ROC N/A 99.4 96.7
 F-measure 99.0 99.3 96.6
Size of
Training Data
9,274 9,274 927
(b) Malware URLs
Metric
Results from
Prior Work (Sharafaldin et al., 2018)
100% labeled
data used
Dapper
Recall 97.0 98.3 96.3
False Positive
Rate
N/A 0.3 6.3
G-measure N/A 99.0 95.0
AUC-ROC N/A 99.0 95.0
 F-measure 97.0 98.6 86.5
Size of
Training Data
A portion of
over 2.8m
9,048 904
(c) CIC-IDS-2017
Table 4. Summary results of 1) prior work which published the dataset; 2) 100% training data used in a supervised way; and 3) 10% of training dataset used with Dapper on label spreading. All results are based on the random forest classifier.
Label Rate 90% 80% 70% 60% 50% 40% 30% 20% 10%
Default LP 52.1 50.0 50.0 27.1 10.4 2.1 2.1 0.0 0.0
Optimized LP 52.1 60.4 60.4 68.8 60.4 52.1 33.3 8.3 0.0
Dapper + LP 87.5 83.3 85.4 81.3 85.4 83.3 83.3 83.3 72.9
Default LS 58.3 60.4 56.3 68.8 66.7 68.8 50.0 43.8 29.2
Optimized LS 58.3 62.5 58.3 68.8 68.8 68.8 66.7 52.1 39.6
Dapper + LS 87.5 87.5 85.4 87.5 87.5 85.4 85.4 85.4 85.4
(a) Recall
Label Rate 90% 80% 70% 60% 50% 40% 30% 20% 10%
Default LP 68.4 66.7 66.7 42.6 18.9 4.1 4.1 0.0 0.0
Optimized LP 68.4 75.3 75.3 81.4 75.3 68.4 50.0 15.4 0.0
Dapper + LP 89.1 87.3 88.7 86.7 89.2 89.4 86.0 83.1 81.4
Default LS 73.7 75.3 72.0 81.4 79.9 81.4 66.6 60.8 45.2
Optimized LS 73.7 76.9 73.7 81.4 81.4 81.4 79.9 68.4 56.7
Dapper + LS 89.1 89.1 89.3 88.3 91.1 89.1 88.9 85.3 87.8
(b) G-Measure
Label Rate 90% 80% 70% 60% 50% 40% 30% 20% 10%
Default LP 76.0 75.0 75.0 63.5 55.2 51.0 51.0 50.0 50.0
Optimized LP 76.0 80.2 80.1 84.3 80.1 75.9 66.7 54.2 50.0
Dapper + LP 89.1 87.5 88.8 87.1 89.4 89.9 86.0 83.1 82.5
Default LS 79.2 80.2 78.1 84.3 83.2 84.3 74.9 71.7 64.5
Optimized LS 79.2 81.3 79.2 84.3 84.2 84.3 83.2 75.8 69.7
Dapper + LS 89.1 89.1 89.5 88.3 91.2 89.3 89.1 85.3 87.8
(c) AUC-ROC
Label Rate 90% 80% 70% 60% 50% 40% 30% 20% 10%
Default LP 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Optimized LP 0.0 0.0 0.2 0.2 0.2 0.2 0.0 0.0 0.0
Dapper + LP 9.2 8.3 7.7 7.0 6.6 3.6 11.2 17.2 8.0
Default LS 0.0 0.0 0.1 0.2 0.3 0.2 0.1 0.3 0.1
Optimized LS 0.0 0.0 0.0 0.2 0.3 0.3 0.3 0.5 0.2
Dapper + LS 9.2 9.2 6.5 10.9 5.0 6.8 7.2 14.8 9.7
(d) False Positive Rate
Label Rate 90% 80% 70% 60% 50% 40% 30% 20% 10%
Default LP 68.5 66.7 66.7 42.6 18.9 4.1 4.1 0.0 0.0
Optimized LP 68.5 75.3 73.4 79.5 73.4 66.7 0.5 15.4 0.0
Dapper + LP 47.2 47.9 50.3 50.6 53.9 65.6 41.0 31.7 44.0
Default LS 73.7 75.3 71.1 79.5 77.1 79.5 65.8 58.3 44.4
Optimized LS 73.7 76.9 73.7 79.5 78.6 78.7 77.1 64.1 55.1
Dapper + LS 47.2 47.2 54.3 43.3 60.9 53.2 51.9 35.7 45.1
(e) F-Measure
Table 5. Results of the Twitter Spam dataset. The best results of each metric from different treatments in each label rate are highlighted in blue color. LP,LS= label propagation and label spreading (described in §3.1 and §3.2).
Label Rate 90% 80% 70% 60% 50% 40% 30% 20% 10%
Default LP 96.3 91.2 84.4 70.3 48.7 26.5 11.5 5.7 1.8
Optimized LP 98.3 98.1 98.1 97.2 97.3 94.5 76.2 66.9 48.1
Dapper + LP 98.4 98.2 98.1 97.5 97.1 97.1 79.2 69.1 49.2
Default LS 98.2 98.3 98.3 98.1 97.8 97.2 96.3 95.2 91.2
Optimized LS 99.0 98.6 98.0 97.9 97.6 97.5 97.7 95.8 91.2
Dapper + LS 99.0 98.8 98.5 98.4 97.8 98.0 97.1 96.3 94.0
(a) Recall
Label Rate 90% 80% 70% 60% 50% 40% 30% 20% 10%
Default LP 97.9 95.3 91.5 82.6 65.5 41.8 20.6 10.7 3.5
Optimized LP 98.1 98.7 98.4 97.7 97.9 96.6 86.4 80.1 64.9
Dapper + LP 98.8 98.7 98.5 98.3 98.0 97.7 88.2 81.7 65.9
Default LS 98.7 98.7 98.7 98.6 98.2 98.3 97.8 97.2 94.9
Optimized LS 99.2 99.1 98.8 98.7 98.4 98.3 97.9 97.4 94.9
Dapper + LS 99.3 99.2 98.9 98.7 98.6 98.4 98.1 97.7 96.6
(b) G-Measure
Label Rate 90% 80% 70% 60% 50% 40% 30% 20% 10%
Default LP 97.9 95.5 92.1 85.1 74.4 63.2 55.7 52.8 50.9
Optimized LP 98.1 98.7 98.4 97.7 97.9 96.7 88.0 83.4 74.0
Dapper + LP 98.8 98.7 98.5 98.3 98.0 97.8 89.4 84.5 74.5
Default LS 98.7 98.7 98.7 98.6 98.2 98.3 97.8 97.2 95.0
Optimized LS 99.2 99.1 98.8 98.7 98.4 98.3 97.9 97.5 95.0
Dapper + LS 99.3 99.2 98.9 98.7 98.6 98.4 98.1 97.7 96.7
(c) AUC-ROC
Label Rate 90% 80% 70% 60% 50% 40% 30% 20% 10%
Default LP 0.3 0.2 0.1 0.0 0.0 0.0 0.0 0.1 0.0
Optimized LP 0.6 0.6 1.1 0.7 1.4 1.1 0.2 0.2 0.1
Dapper + LP 0.7 0.8 0.9 0.9 1.1 1.6 0.4 0.1 0.1
Default LS 0.9 0.8 0.9 1.0 1.3 0.6 0.7 0.8 1.1
Optimized LS 0.5 0.4 0.4 0.5 0.7 0.8 1.8 0.8 1.2
Dapper + LS 0.5 0.4 0.6 1.0 0.6 1.2 0.9 0.9 0.6
(d) False Positive Rate
Label Rate 90% 80% 70% 60% 50% 40% 30% 20% 10%
Default LP 97.8 95.3 91.4 82.6 65.5 41.8 20.6 10.7 3.5
Optimized LP 98.0 98.7 98.4 97.6 97.8 96.5 86.4 80.1 64.9
Dapper + LP 98.7 98.6 98.5 98.2 97.9 97.6 88.1 81.7 65.9
Default LS 98.6 98.7 98.6 98.5 98.1 98.2 97.7 97.1 94.7
Optimized LS 99.2 99.0 98.8 98.6 98.3 98.2 97.8 97.3 94.7
Dapper + LS 99.2 99.2 98.9 98.6 98.5 98.3 98.0 97.6 96.6
(e) F-Measure
Table 6. Results of the Malware URLs dataset. The best results of each metric from different treatments in each label rate are highlighted in blue color. LP,LS= label propagation and label spreading (described in §3.1 and §3.2).
Label Rate 90% 80% 70% 60% 50% 40% 30% 20% 10%
Default LP 95.9 90.1 80.1 67.6 50.3 28.9 17.3 8.1 3.3
Optimized LP 97.6 97.4 97.4 95.7 94.6 90.8 82.1 42.3 19.3
Dapper + LP 99.4 99.8 99.6 99.1 98.9 97.6 97.6 96.8 95.0
Default LS 98.1 97.2 97.2 95.8 95.0 93.0 91.7 90.2 87.3
Optimized LS 98.3 97.2 97.4 96.5 96.5 95.2 94.1 93.2 87.8
Dapper + LS 98.1 99.4 99.2 97.6 99.4 98.5 98.0 97.4 96.3
(a) Recall
Label Rate 90% 80% 70% 60% 50% 40% 30% 20% 10%
Default LP 97.9 94.7 88.9 80.7 66.9 44.8 29.5 15.0 6.4
Optimized LP 98.7 98.5 98.4 97.6 96.7 94.6 89.5 59.5 32.4
Dapper + LP 98.1 98.4 98.5 97.9 97.9 98.1 98.3 97.6 96.3
Default LS 98.9 98.4 98.3 97.4 96.8 95.8 95.0 93.9 92.0
Optimized LS 99.1 98.4 98.4 97.9 97.8 97.1 96.4 95.7 92.5
Dapper + LS 97.2 96.9 97.2 96.5 97.5 97.4 96.5 96.5 95.0
(b) G-Measure
Label Rate 90% 80% 70% 60% 50% 40% 30% 20% 10%
Default LP 97.9 95.0 90.0 83.8 75.1 64.4 58.6 54.1 51.6
Optimized LP 98.7 98.5 98.4 97.6 96.7 94.8 90.3 71.0 59.6
Dapper + LP 98.1 98.4 98.5 97.9 97.9 98.1 98.3 97.6 96.3
Default LS 98.9 98.4 98.3 97.5 96.8 95.8 95.1 94.1 92.3
Optimized LS 99.1 98.4 98.4 97.9 97.8 97.1 96.4 95.7 92.8
Dapper + LS 97.2 97.0 97.3 96.6 97.5 97.5 96.5 96.5 95.0
(c) AUC-ROC
Label Rate 90% 80% 70% 60% 50% 40% 30% 20% 10%
Default LP 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Optimized LP 0.2 0.4 0.5 0.5 1.2 1.2 1.6 0.3 0.2
Dapper + LP 3.2 2.9 2.6 3.3 3.0 1.3 0.9 1.6 2.4
Default LS 0.2 0.4 0.6 0.8 1.3 1.3 1.5 2.1 2.7
Optimized LS 0.2 0.4 0.6 0.5 0.8 1.0 1.2 1.7 2.2
Dapper + LS 3.7 5.4 4.7 4.5 4.4 3.6 4.8 4.5 6.3
(d) False Positive Rate
Label Rate 90% 80% 70% 60% 50% 40% 30% 20% 10%
Default LP 97.7 94.5 88.9 80.7 66.9 44.8 29.5 15.0 6.4
Optimized LP 98.2 97.9 97.6 96.7 94.7 92.7 87.0 59.1 32.2
Dapper + LP 93.4 94.1 94.7 93.0 93.5 96.1 96.8 95.2 92.6
Default LS 98.7 97.8 97.2 96.2 94.8 93.8 92.6 90.6 87.9
Optimized LS 98.8 97.8 97.4 97.1 96.5 95.5 94.5 93.0 89.2
Dapper + LS 91.9 89.5 90.7 90.2 91.3 92.2 89.7 90.1 86.5
(e) F-Measure
Table 7. Results of the CIC-IDS-2017 dataset. The best results of each metric from different treatments in each label rate are highlighted in blue color. LP,LS= label propagation and label spreading (described in §3.1 and §3.2).

Our study is structured around the following research questions:

[enhanced,width=3.33in,size=fbox,colback=blue!5,drop shadow southwest,sharp corners] RQ1. Can we use less labeled training data with default SSL?

This research question explores whether we can use less labeled training data than supervised learning with default semi-supervised learning algorithms. Specifically, we will try to build predictors using of the labelled data.

To answer this question, we first present results from two baseline treatments: 1) results from prior works which published and evaluated the original datasets; 2) results from sampled datasets with 100% labeled training datasets used in training. Table 4 presents the baseline results and all the results come from the same classifier, i.e., random forest. There are several notes about the results: 1) Some prior works did not report results of some metrics, e.g. g-measure, which are denoted as N/A in the table; 2) Some prior work did not present details of data split design, hence we have no idea the exact size of training dataset to get those results, hence we denote as a portion of the whole original data size; 3) All the results are from the same machine learning algorithm, i.e., random forest, and the summary results indicate that all the datasets are able to achieve good performance, e.g., about 90% recall, except the sampled dataset from Twitter spam, which can only achieve about 60% recall with 100% labeled training data.

Table 5, Table 6 and Table 7 present the results of Twitter spam dataset, malware URLs dataset and CIC-IDS-2017 dataset, respectively. Each table reports the results from evaluation metrics as defined in Section 4.2. Note that we use different label rate of the training dataset, and we report the results from each label rate.

There are several observations from these results:

  1. For each datasets, with default label propagation (Default LP) algorithm, the recall results show a decreasing trend when the label rate decreases. When the label rate is as low as 10%, the recall results of label propagation is close to zero.

  2. The results of default label spreading (Default LS) vary among all datasets. For example, for Twitter spam in Table 5, label spreading algorithm can achieve as low as half of recall of results when compared with supervised learning with 100% labeled training data in Table 4. For the rest two datasets, the recall results are much better, as we can achieve about 90% of original recall performance even the label rate is low as 10%.

RQ1 Default semi-supervised learning algorithm such as label propagation cannot achieve ideal performance when label rate is low, while label spreading algorithm is much better than label propagation, but still cannot compared with supervised learning results for some datasets.

[enhanced,width=3.33in,size=fbox,colback=blue!5,drop shadow southwest,sharp corners] RQ2. Will hyperparameter optimization on SSL help improve the results?

We also present the results of all datasets with optimized semi-supervised learning algorithms. In this treatment, we only apply Bayesian optimization to seach the hyperparameter space of label propagation and label spreading, and use default random forest algorithm as well as not using SMOTE.

The results of using optimized semi-supervised learning algorithms are also several folds:

  1. For the Twitter spam dataset, the optimized label propagation algorithm (Optimized LP) cannot improve the recall results when the label rate is as low as 10%, while the optimized label spreading algorithm (Optimized LS) can improve the recall from 29.2% to 39.6%, but still cannot achieve 58.3% recall with 100% label dataset used, and even far below 92.9% original recall from prior work (Chen et al., 2015b).

  2. For the malware URLs dataset and CIC-IDS-2017 dataset, the improved performance over default label propagation (Default LP) is obvious, but still not enough. However, for the label spreading algorithm, the optimized improvement is slight, since the default label spreading algorithm (Default LS) already has achieved a high performance.

  3. Furthermore, by comparing optimized label propagation and optimized label spreading from all result tables, it is not hard to say optimized label spreading algorithm outperforms optimized label propagation, especially when the label rate is low. For example, for the Twitter spam dataset, the recall results from optimized label spreading (Optimized LS) is 39.6%, which is far better than 0.0% recall from optimized label propagation (Optimized LP).

  4. For other metrics such as AUC-ROC and F-measure, the advantages of optimization is similar to the recall results over default SSL settings. Besides, optimized label spreading is also better than optimized label propagation in these metrics.

RQ2 Compared with default settings, hyperparameter optimization on semi-supervised algorithms can alleviate the decreasing trend of metrics such as recall and g-measure in label propagation, but still not good enough. Besides, the optimized label spreading algorithm show slight improvement over default label spreading.

[enhanced,width=3.33in,size=fbox,colback=blue!5,drop shadow southwest,sharp corners] RQ3. Can we endorse the merits of the Dapper framework?

We now present the results of Dapper as we introduce before in Section 3.3. Note that we hypothesize standard semi-supervised learning algorithms do not adequately address the class imbalance issue. Figure 2 from the Twitter spam dataset validates our hypothesis. As we observe from the figure, the percentage of minority class in original labeled dataset is about 4.6%. When decreasing the label rate, the imbalanced issue even get worse. For example, with only 10% of labeled data, the ratio drops to about 0.5% with label spreading and 0.3% with label propagation. Several other prior studies also report similar findings (Zhang et al., 2017; Iscen et al., 2019). This might result from the pseudo-labeling process, which infers the original minority class even to majority class. This finding also indicates the class imbalance issue still remains, and should not be ignored, which motivates us to add SMOTE to resample the mixed dataset in the adaptive Dapper framework.

Compared with default semi-supervised learning algorithms and optimized semi-supervised learning algorithms only, we make the following remarks on Dapper:

  1. In recall, g-measure and AUC-ROC, Dapper is more advantageous over other treatments in different levels. Even with label propagation, in the Twitter spam dataset and CIC-IDS-2017 dataset, Dapper can greatly increase the recall performance with 10% label rate.

  2. As Dapper with label spreading (Dapper + LS) is slightly better than with label propagation (Dapper + LP) on Twitter spam and CIC-IDS-2017, but for malware URLs dataset, Dapper with label spreading is far better than with label propagation.

  3. What’s more important, the results of Dapper is almost stable under different label rates.

  4. However, we have to note that, Dapper might bring in an increment of false positive rate (e.g., in Twitter spam and CIC-IDS-2017). Since the minority class ratio is lower than our pre-defined threshold, Dapper adaptively applies optimized SMOTE in the framework, which might cause the issue. We argue that, considering the improvement of important metrics such as recall, the trade-off of such increment in false positive rate is still acceptable.

Figure 2. The percentage of minority class under different label rates after using default label propagation and default label spreading in the Twitter spam dataset.
Algorithm Dataset
Twitter
Spam
Malicious
URLs
CIC-IDS-2017
Default LP ¡ 1 ¡ 2 ¡ 2
Optimized LP ¡ 3 ¡ 10 ¡ 10
Dapper + LP ¡ 4 ¡ 12 ¡ 15
Default LS ¡ 1 ¡ 2 ¡ 2
Optimized LS ¡ 2 ¡ 6 ¡ 6
Dapper + LS ¡ 3 ¡ 10 ¡ 10
Table 8. Average runtime (in minutes) of different treatments. We set the number of Bayesian Optimization trails to 100.

Lastly, let’s revisit the results from Table 4. The last column of this table comes from Dapper, in which label spreading is selected as the SSL learner, and only 10% of labeled training data is used. Compared with results from column 3 in which 100% labeled data is used in a supervised paradigm, Dapper is close or even better in recall (with an acceptable trade-off in false positive rate). Compared with prior works which publish the dataset, Dapper shows obvious advantage in the labeled data size required. In addition, Table 8 shows the average runtime of different treatments, which also indicates the Dapper framework is also practical to use. The result also suggests that Dapper is a promising alternative to reduce the size of labeled data required to train a useful model.

RQ3 The adaptive Dapper framework with label spreading provides a close or even better performance than supervised learning with 100% labeled training data, but with as low as 10% of original labeled data required.

6. Threats to Validity

Evaluation Bias. In our work, we choose some popular evaluation metrics for classification tasks and use g-measure as the optimization objective. We do not use other metrics because relevant information is not available to us, or we think they are not suitable enough for this specific task (e.g., precision).

Optimizer Bias. Dapper framework optimizes semi-supervised learning algorithm, machine learning classifier, or SMOTE with Bayesian Optimization. We do not claim Bayesian Optimization is the only best choice, but argue that Bayesian Optimization is fast to run and also more promising than other methods discussed in Section 2.4, and we believe Bayesian Optimization is good enough in our study.

Learner Bias.

Research into automatic classifiers is a large and active field. Different machine learning algorithms have been developed to solve various classification problem tasks. Any data-mining study, such as this paper, can only use a small subset of the known classification algorithms. We select the random forest classifier, commonly used in similar classification tasks, for this work. In the future, we plan to explore more popular classifiers such as support vector machines (SVM), XGBoost 

(Chen et al., 2015a), and so on.

Implementation Bias. The implementation of semi-supervised learning algorithms and the random forest classifier are from the Scikit-learn library, the implementation of Bayesian Optimization is from the hyperopt library (Bergstra et al., 2013), and the implementation of the tunable SMOTE is from scratch by following the idea from  (Agrawal and Menzies, 2018) without using existing available libraries. Different implementation of the above algorithms might have impact on the performance results, and might even change the conclusions from this work.

Input Bias. Our results come from the space of hyperparameter optimization listed in Table 2. In theory, other ranges might lead to other results. That said, our goal here is not to offer the best optimization but to argue that the optimized algorithms provided by Dapper can help reduce the ratio of labeled data required to a low degree, while still achieve promising performance, just by itself. For those purposes, we would argue that our current hyperparameter space suffices.

7. Conclusion

When labeled data is scarce, it can be hard to build adequately good prediction models. Prior works in software security have tried to address this issue with semi-supervised learning using a small pool of existing labels to infer the labels of unlabeled data. Those works usually do not explore SSL hyperparameter optimization (or even data rebalancing with SSL). This paper checks if that was a deficiency in prior works.

To perform that check, we propose Dapper that explores the hyperparameter space of existing semi-supervised learning algorithms, i.e., label propagation and label spreading, and machine learning classifier. When the percentage of minority class is low, Dapper further adaptively integrates an optimized oversampler SMOTE into the framework to address the class imbalance issue. Experimental results with three datasets show that Dapper’s hyperparameter optimization and rebalancing combination can efficiently improve classification performance, even when most labels (90%) are unavailable. In some datasets, we even observe better results with Dapper (using 10% of the data) than using 100% of all the labels. Based on those results, we recommend using hyperparameter optimization when dealing with label shortages for security tasks.

References

  • (1)
  • Agrawal and Menzies (2018) Amritanshu Agrawal and Tim Menzies. 2018. Is” Better Data” Better Than” Better Data Miners”?. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 1050–1061.
  • Alabdulmohsin et al. (2016) Ibrahim Alabdulmohsin, YuFei Han, Yun Shen, and Xiangliang Zhang. 2016. Content-agnostic malware detection in heterogeneous malicious distribution graph. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 2395–2400.
  • Bergstra et al. (2011) James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. Advances in neural information processing systems 24 (2011).
  • Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of machine learning research 13, 2 (2012).
  • Bergstra et al. (2013) James Bergstra, Dan Yamins, David D Cox, et al. 2013. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in science conference

    , Vol. 13. Citeseer, 20.

  • Blum and Mitchell (1998) Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In

    Proceedings of the eleventh annual conference on Computational learning theory

    . 92–100.
  • Celebi and Aydin (2016) M Emre Celebi and Kemal Aydin. 2016. Unsupervised learning algorithms. Springer.
  • Chawla et al. (2002) Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique.

    Journal of artificial intelligence research

    16 (2002), 321–357.
  • Chen et al. (2015b) Chao Chen, Jun Zhang, Xiao Chen, Yang Xiang, and Wanlei Zhou. 2015b. 6 million spam tweets: A large ground truth for timely Twitter spam detection. In 2015 IEEE international conference on communications (ICC). IEEE, 7065–7070.
  • Chen et al. (2015a) Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, et al. 2015a.

    Xgboost: extreme gradient boosting.

    R package version 0.4-2 1, 4 (2015), 1–4.
  • Crawford et al. (2015) Michael Crawford, Taghi M Khoshgoftaar, Joseph D Prusa, Aaron N Richter, and Hamzah Al Najada. 2015. Survey of review spam detection using machine learning techniques. Journal of Big Data 2, 1 (2015), 1–24.
  • Gharib et al. (2016) Amirhossein Gharib, Iman Sharafaldin, Arash Habibi Lashkari, and Ali A Ghorbani. 2016. An evaluation framework for intrusion detection dataset. In 2016 International Conference on Information Science and Security (ICISS). IEEE, 1–6.
  • Hastie et al. (2009) Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. Overview of supervised learning. In The elements of statistical learning. Springer, 9–41.
  • Hutter et al. (2011) Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization. Springer, 507–523.
  • Inc (2021) Amazon Mechanical Turk Inc. 2021. Amazon Mechanical Turk. https://www.mturk.com/
  • Iscen et al. (2019) Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. 2019. Label propagation for deep semi-supervised learning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    . 5070–5079.
  • Kolosnjaji et al. (2016) Bojan Kolosnjaji, Apostolis Zarras, Tamas Lengyel, George Webster, and Claudia Eckert. 2016. Adaptive semantics-aware malware classification. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 419–439.
  • Le et al. (2021) Duc C Le, Nur Zincir-Heywood, and Malcolm Heywood. 2021. Training regime influences to semi-supervised learning for insider threat detection. In 2021 IEEE Security and Privacy Workshops (SPW). IEEE, 13–18.
  • Lee et al. (2013) Dong-Hyun Lee et al. 2013.

    Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In

    Workshop on challenges in representation learning, ICML, Vol. 3. 896.
  • Lessmann et al. (2005) Stefan Lessmann, Robert Stahlbock, and Sven F Crone. 2005.

    Optimizing hyperparameters of support vector machines by genetic algorithms.. In

    IC-AI. 74–82.
  • Lorenzo et al. (2017) Pablo Ribalta Lorenzo, Jakub Nalepa, Michal Kawulok, Luciano Sanchez Ramos, and José Ranilla Pastor. 2017. Particle swarm optimization for hyper-parameter selection in deep neural networks. In

    Proceedings of the genetic and evolutionary computation conference

    . 481–488.
  • Mamun et al. (2016) Mohammad Saiful Islam Mamun, Mohammad Ahmad Rathore, Arash Habibi Lashkari, Natalia Stakhanova, and Ali A Ghorbani. 2016. Detecting malicious urls using lexical analysis. In International Conference on Network and System Security. Springer, 467–482.
  • Ni et al. (2015) Ming Ni, Qianmu Li, Hong Zhang, Tao Li, and Jun Hou. 2015. File relation graph based malware detection using label propagation. In International Conference on Web Information Systems Engineering. Springer, 164–176.
  • Nunes et al. (2016) Eric Nunes, Ahmad Diab, Andrew Gunn, Ericsson Marin, Vineet Mishra, Vivin Paliath, John Robertson, Jana Shakarian, Amanda Thart, and Paulo Shakarian. 2016. Darknet and deepnet mining for proactive cybersecurity threat intelligence. In 2016 IEEE Conference on Intelligence and Security Informatics (ISI). IEEE, 7–12.
  • Pallaprolu et al. (2016) Sai C Pallaprolu, Josephine M Namayanja, Vandana P Janeja, and CT Sai Adithya. 2016. Label propagation in big data to detect remote access Trojans. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, 3539–3547.
  • Resende and Drummond (2018) Paulo Angelo Alves Resende and André Costa Drummond. 2018. A survey of random forest based methods for intrusion detection systems. ACM Computing Surveys (CSUR) 51, 3 (2018), 1–36.
  • scikit-learn developers (2021) scikit-learn developers. 2021. Scikit-learn. https://scikit-learn.org/0.15/modules/label_propagation.html
  • Scudder (1965) Henry Scudder. 1965. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory 11, 3 (1965), 363–371.
  • Settles and Craven (2008) Burr Settles and Mark Craven. 2008. An analysis of active learning strategies for sequence labeling tasks. In

    Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

    . 1070–1079.
  • Shahriari et al. (2015) Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. 2015. Taking the human out of the loop: A review of Bayesian optimization. Proc. IEEE 104, 1 (2015), 148–175.
  • Sharafaldin et al. (2018) Iman Sharafaldin, Arash Habibi Lashkari, and Ali A Ghorbani. 2018. Toward generating a new intrusion detection dataset and intrusion traffic characterization.. In ICISSP. 108–116.
  • Silva et al. (2016) Nadia Felix F Da Silva, Luiz FS Coletta, and Eduardo R Hruschka. 2016.

    A survey and comparative study of tweet sentiment analysis via semi-supervised learning.

    ACM Computing Surveys (CSUR) 49, 1 (2016), 1–26.
  • Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems. 2951–2959.
  • Souri and Hosseini (2018) Alireza Souri and Rahil Hosseini. 2018. A state-of-the-art survey of malware detection approaches using data mining techniques. Human-centric Computing and Information Sciences 8, 1 (2018), 1–22.
  • Storn and Price (1997) Rainer Storn and Kenneth Price. 1997.

    Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces.

    Journal of global optimization 11, 4 (1997), 341–359.
  • Taheri et al. (2020) Rahim Taheri, Reza Javidan, Mohammad Shojafar, Zahra Pooranian, Ali Miri, and Mauro Conti. 2020. On defending against label flipping attacks on malware detection systems. Neural Computing and Applications 32, 18 (2020), 14781–14800.
  • Tu et al. (2020) Huy Tu, Zhe Yu, and Tim Menzies. 2020. Better data labelling with emblem (and how that impacts defect prediction). IEEE Transactions on Software Engineering (2020).
  • Van Engelen and Hoos (2020) Jesper E Van Engelen and Holger H Hoos. 2020. A survey on semi-supervised learning. Machine Learning 109, 2 (2020), 373–440.
  • Wang et al. (2021) Haobo Wang, Zhao Li, Jiaming Huang, Pengrui Hui, Weiwei Liu, Tianlei Hu, and Gang Chen. 2021. Collaboration based multi-label propagation for fraud detection. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. 2477–2483.
  • Wang et al. (2015) Ruowen Wang, William Enck, Douglas Reeves, Xinwen Zhang, Peng Ning, Dingbang Xu, Wu Zhou, and Ahmed M Azab. 2015. Easeandroid: Automatic policy analysis and refinement for security enhanced android via large-scale semi-supervised learning. In 24th USENIX Security Symposium (USENIX Security 15). 351–366.
  • Wu et al. (2018) Tingmin Wu, Sheng Wen, Yang Xiang, and Wanlei Zhou. 2018. Twitter spam detection: Survey of new approaches and comparative study. Computers & Security 76 (2018), 265–284.
  • Yang and Shami (2020) Li Yang and Abdallah Shami. 2020. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 415 (2020), 295–316.
  • Zhang et al. (2020) Fuzhi Zhang, Xiaoyan Hao, Jinbo Chao, and Shuai Yuan. 2020. Label propagation-based approach for detecting review spammer groups on e-commerce websites. Knowledge-Based Systems 193 (2020), 105520.
  • Zhang et al. (2017) Zhi-Wu Zhang, Xiao-Yuan Jing, and Tie-Jian Wang. 2017. Label propagation based semi-supervised learning for software defect prediction. Automated Software Engineering 24, 1 (2017), 47–69.
  • Zhou et al. (2003) Dengyong Zhou, Olivier Bousquet, Thomas Lal, Jason Weston, and Bernhard Schölkopf. 2003. Learning with local and global consistency. Advances in neural information processing systems 16 (2003).
  • Zhu and Ghahramani (2002) Xiaojin Zhu and Zoubin Ghahramani. 2002. Learning from labeled and unlabeled data with label propagation. (2002).
  • Zhu and Goldberg (2009) Xiaojin Zhu and Andrew B Goldberg. 2009. Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning 3, 1 (2009), 1–130.
  • Zhu (2005) Xiaojin Jerry Zhu. 2005. Semi-supervised learning literature survey. (2005).