Countering malware is a major concern for most networked systems, since they can cause billions of dollars in loss (SecurityVentures, 2018). The growth of malware infections (CTONetworks, 2017) enables the development of multiple detection strategies, including machine learning (ML) classifiers tailored for malware, which have been adopted by the most popular AntiViruses (AVs) (Gandotra et al., 2014; Kantchelian et al., 2013; Jordaney et al., 2017). However, malware samples are very dynamic pieces of code—usually distributed over the Web (Chang et al., 2013) and constantly evolving to survive—quickly turning AVs into outdated mechanisms that present lower detection rates over time. This phenomenon is known as concept drift (Gama et al., 2014), and requires AVs to periodically update their ML classifiers (Ceschin et al., 2018). Malware concept drift is an understudied problem in the literature, with the few works that address it usually focusing on achieving high accuracy rates for a temporally localized dataset, instead of aiming for long-term detection due to malware evolution. Moreover, the community considers ML a powerful ally for malware detection, given its ability to respond faster to new threats (Gibert et al., 2020).
In this work, we evaluate the impact of concept drift on malware classifiers for two Android datasets: DREBIN (Arp et al., 2014) (130K apps) and a subset of AndroZoo (Allix et al., 2016) (350K apps). Android has more than 2 billion monthly active devices, being the most used operating system worldwide (Fergus Halliday, 2018), with almost 40% of prevalence in the operating system market, surpassing Microsoft Windows in 2018 (StatCounter, 2018)
. As a widespread platform, Android is more affected by malware evolution, rendering its AVs vulnerable to concept drift effects. For our longitudinal evaluation, we collected malware samples over nine years (2009-2018) and used them to train an Adaptive Random Forest (ARF) classifier(Gomes et al., 2017), as well as
a Stochastic Gradient Descent (SGD) classifier(Pedregosa et al., 2011). Our goal is to show that concept drift is a generalized phenomenon, and not an issue of a particular dataset. To do so, we ordered all datasets samples using their VirusTotal (VirusTotal, 2018) submission timestamp and then extracted features from their textual attributes using two algorithms (Word2Vec (Mikolov et al., 2013) and TF-IDF (Salton et al., 1975)). After that, we conducted experiments comparing both feature extractors, classifiers, as well as four drift detectors (Drift Detection Method (Gama et al., 2014), Early Drift Detection Method (Baena-Garćıa et al., 2006), ADaptive WINdowing (Bifet and Gavaldà, 2007), and Kolmogorov-Smirnov WINdowing (Raab et al., 2020)) to determine the best approach for real environments. We also compare some possible approaches to mitigate concept drift,and propose a novel method based on the data stream pipeline that outperformed current solutions by updating either the classifier or the feature extractor. We highlight the need for also updating the feature extractor, given that it is as essential as the classifier itself to achieve increased detection rates due to new features appearing over time. Therefore, we propose a realistic data stream learning pipeline, including the feature extractor in the loop. Finally, we discuss the impact of changes on the Android ecosystem in our classifiers by comparing feature changes detected over time, and the implications of our findings. It is worth notice that all code and datasets used will be publicly available111https://github.com/fabriciojoc/Fast-Furious-Malware-Data-Streams, as well as the implementation of our data stream learning pipeline using scikit-multiflow (Montiel et al., 2018)222https://github.com/fabriciojoc/scikit-multiflow
, an open-source ML framework for stream data.
This article is organized as follows: we compare our work with the literature in Section 2; we introduce our methodology in Section 3; we describe the machine learning background in Section 4; we present the experiments results in Section 5; we discuss our findings in Section 6, and draw our conclusions in Section 7.
2 Related Work
The literature on malware detection using machine learning is extensive (Gandotra et al., 2014). Usually, the primary concern of most works is to achieve 100% of accuracy using different representations and models, ignoring the fact that malware samples evolve as time goes by. Few papers consider concept drift in this context, such as Masud et al., which were (at the best of our knowledge) the first to treat malware detection as a data stream classification problem, proposing an ensemble of classifiers trained from consecutive chunks of data using -fold partitioning of them and reducing classification error compared to other ensembles (Masud et al., 2008)
. They also presented a feature extraction and selection technique for data streams that do not have any fixed feature set based on information gain. Bach et al. combined two models: a stable one, based on all data, and a reactive one, based on a short window of recent data, to determine when to replace the current stable model by computing the difference in their accuracy, assuming that the stable one performs worse than the reactive when the concept changes(Bach and Maloof, 2008).
Singh et al. proposed two measures to track concept drift in static features of malware families: relative temporal similarity (based on the similarity score between two time-ordered pairs of samples and are used to infer the direction of concept drift) and meta-features (based on summarization of information from a large number of features)(Singh et al., 2012), claiming to provide paths to further exploration of drift in malware detection models. Narayanan et al. presented an online machine learning based framework, named DroidOL to handle concept drift and detect malware (Narayanan et al., 2016) using control-flow sub-graph features in an online classifier, which adapts to the malware drift by updating the model more aggressively when the error is large and less aggressively, otherwise. Deo et al. proposed the use of Venn-Abers predictors to measure the quality of classification tasks and identify concept drift (Deo et al., 2016).
Jordaney et al. presented Transcend, a framework to identify concept drift in classification models which compares the samples used to train the models with those seen during deployment, computing algorithm credibility and confidence to measure the quality of the produced results and detect concept drift (Jordaney et al., 2017)
. Anderson et al. have shown that, by using reinforcement learning to generate adversarial samples, it is possible to retrain a model and make these attacks less effective, also protecting it against possible drifts, given that this technique hardens a model against worst-case inputs(Anderson et al., 2018). Pendlebury et al. reported that some results are inflated by spatial bias, caused by the wrong distribution of training and testing sets, and temporal bias, caused by incorrect time splits into these same sets (Pendlebury et al., 2018). They introduced an evaluation framework called Tesseract to compare malware classifiers in a realistic setting and confirmed that earlier published results are biased. Ceschin et al. compared a set of experiments that use batch machine-learning models with ones that take into account the change of concept (data streams), emphasizing the need to update the decision model immediately after a concept drift occurs (Ceschin et al., 2018). The authors also show that the malware concept drift is related to their concept evolution due to the appearance of new malware families.
Mariconti et al. created MaMaDroid , an Android malware
detection system that uses a behavioral model, in the form of a Markov
chain, to extract features from the API calls performed by an
, an Android malware detection system that uses a behavioral model, in the form of a Markov chain, to extract features from the API calls performed by an app(Onwuzurike et al., 2019). The solution proposed was tested with a dataset containing 44K samples, collected over six years, and presented a good detection rate and the ability to keep its performance for long periods (at least five years according to their experiments). Cai et al. compared their approach (Cai, 2018; Cai and Jenkins, 2018), which uses 52 selected metrics concerning sensitive data accesses via APIs (using dynamic analysis) as features, with MaMaDroid. According to their experiments, their approach managed to remain stable for five years, while MaMaDroid only kept the same performance for two years. A very similar approach was compared with other four methods in literature and all of them presented an overall f1score bellow 72% (Fu and Cai, 2019).
Xu et al. proposed DroidEvolver, an Android malware detection system that can be automatically updated without any human involvement, requiring neither retraining nor true labels to update itself (Xu et al., 2019). The authors use online learning techniques with evolving feature sets and pseudo labels, keeping a pool of different detection models and calculating a juvenilization indicator that determines when to update its feature set and each detection model. Finally, Gibert et al. presented research challenges of state-of-the-art techniques for malware classification, exemplifying the concept drift as one of them (Gibert et al., 2020).
Zhang et al. designed APIGraph, a framework to detect evolved Android malware that groups similar API calls into clusters based on the Android API official documentation (Zhang et al., 2020). Applying the aforementioned technique to other Android malware classifiers, the authors reduced the labeling efforts when combined with Tesseract (Pendlebury et al., 2018).
Our main contribution in this work is to apply data stream based machine learning algorithms to malware classification, proposing an important improvement in the pipeline that makes feature vectors correspond to the actual concept. The proposed solution, at the best of our knowledge, was not considered before and is as important as updating the classifier itself. To test and validate our proposal, we use two datasets containing almost 480K android apps, showing that it outperforms traditional data stream solutions. We also include an analysis of how certain features change over time, correlating it with a cybersecurity background.
3.1 Threat Model and Assumptions
Our threat model considers an antivirus engine (AV) for the Android platform since it is the market share leader. Thus, successful malware infections affect a large number of users. For scientific purposes, we considered an AV entirely based on Machine Learning (ML) as implemented by many of the malware detectors cited in Section 2. In practice, however, it does not imply that an AV must use only this detection method: it can be complemented with any other detection approach envisioned by the AV vendor. Our detection model is completely static (features are retrieved directly from the APK files) because static malware detection is the most popular and fastest way to triage malware samples.Similarly to the above discussion, the use of static detectors do not imply that an AV should not use dynamic components, but that our research focuses on improving the static detection component. In our proposed AV model, the APK files are inspected as soon as they are placed in the Android filesystem, i.e., it does not rely on any other system information except the APK files themselves. It is worth emphasizing that our goal is not to implement an actual AV but to highlight the need for updating ML models based on classifiers. Therefore, we simulated the behavior of an online AV using offline experiments that use data streams, simplifying our solution implementation. The details of the simulated ML model are presented below.
3.2 Data Stream
Since our goal is to evaluate the occurrence of concept drift in malware classifiers, we analyzed malware detection evolution over time using a data stream abstraction for the input data. In a traditional data stream learning problem that includes concept drift, the classifier is updated with new samples when a change occurs—usually the ones that caused the drift (Gama et al., 2014). Our data stream pipeline also considers the feature extractor under changes, according to the following five steps shown in Figure 1:
Obtain a new sample from the raw data stream;
Extract features from using a feature extractor , trained with previous data;
Predict the class of the new sample using the classifier , trained with previous data;
With the prediction from , update the drift detector to check the drift level (defined by authors (Montiel et al., 2018));
According to the drift level, three paths can be followed, all of them making the pipeline restarts at Step 1:
Normal: incrementally update with ;
Warning: incrementally update with and add to a buffer;
Drift: retrain both and using only the data collected during the warning level (from the buffer build during this level), creating a new extractor and classifier.
A challenge to detect concept drift in malware classifiers is to properly identify the sample’s date to allow temporal evaluations. Since malware samples are collected in the wild and they may be spreading for some time, no actual creation date is available. As an approximation for it, we considered each sample’s first appearance in VirusTotal (VirusTotal, 2018), a website that analyzes thousands of samples every day. Malware samples were ordered by their “first seen” dates, which allows us to create a data stream representing a real-world scenario, where new samples are released daily, thus requiring malware classifiers to be updated (Pendlebury et al., 2018).
In our experiments, we considered attributes vectors provided by the authors of DREBIN (Arp et al., 2014), composed of ten textual attributes (API calls, permissions, URLs, etc), which are publicly available to download and contain benign and malicious Android applications. We show DREBIN’s distribution in Figure 1(a). We also considered a subset of Android applications reports provided by AndroZoo API (Allix et al., 2016), composed of eight textual attributes (resources names, source code classes and methods, manifest permissions etc.) and contains benign and malicious applications. We show the distribution of our AndroZoo subset in Figure 1(b): it keeps the same goodware and malware distribution as the original dataset, which originally is composed by most of 10 million apps.
It is important to notice that we are using the attributes that were already extracted statically from them (Arp et al., 2014; Allix et al., 2016), since we do not have access to applications binaries, packages, or source codes. In addition, the timing information provided by the authors of the DREBIN (Arp et al., 2014) and VirusTotal differs. According to Arp et al. (Arp et al., 2014)
, their samples were collected from August 2010 to October 2012. However, our version is based on VirusTotal appearance date, which showed us that very few samples were already analyzed by VirusTotal (48 in 2009) and some of them were just analyzed after the establishment of the collection (37 in 2013 and 2014), probably following the dataset publicly release. We do not show these samples in Figure2 for better visualization.
Furthermore, both datasets reflect two characteristics of the real world that challenge the development of efficient ML malware detectors: (i) long-term variations (2009-2014, for DREBIN and 2016-2018 for AndroZoo); and (ii) class imbalance. For example, in DREBIN, whereas more than 40K goodware were collected in Sep/2009,only 1,750 malware samples were collected in the same period. Whereas class imbalance is out of this work’s scope, we considered both datasets as suitable for our evaluations due to the long-term characteristic, which challenges ML classifiers to learn multiple concepts over time. Finally, to evidence the difference of both datasets, we created heat maps containing the prevalence of a subset of malware families created using the intersection of families from them (54 families).
4 Machine Learning Algorithms
A typical way to represent malware for ML is to create a vector using the sample textual attributes, such as API calls, permissions, URLs, providers, intents activities, service receivers, and others. In our work, we represented them using Word2Vec (Mikolov et al., 2013) and TF-IDF (Salton et al., 1975), since they are widely used representations for text classification. Besides, both use the training data to generate the representation of all samples, which allows us to test our hypothesis that the drift affects not only the classifier but also the feature extractor.
4.2 Concept Drift Detectors
We used four concept drift detection algorithms in our work: DDM (Drift Detection Method (Gama et al., 2014)), EDDM (Early Drift Detection Method (Baena-Garćıa et al., 2006)), ADWIN (ADaptive WINdowing (Bifet and Gavaldà, 2007)), and KSWIN (Kolmogorov-Smirnov WINdowing) (Raab et al., 2020)
. Both DDM and EDDM are online supervised methods based on sequential error monitoring, i.e., each incoming example is processed separately estimating the sequential error rate. Therefore, they assume that the increase of consecutive error rate suggests the occurrence of concept drift. DDM directly uses the error rate, while EDDM uses the distance-error rate, which measures the number of examples between two errors(Baena-Garćıa et al., 2006). These errors trigger two levels: warning and drift. The warning level suggests that the concept starts to drift, then an alternative classifier is updated using the samples which rely on this level. The drift level suggests that the concept drift occurred, and the alternative classifier build during the warning level replaces the current classifier. ADWIN keeps statistics from sliding windows of variable size, used to compute the average of the change observed by cutting them in different points. If the difference between two windows is greater than a pre-defined threshold, a concept drift is detected, and the data from the first window is discarded (Bifet and Gavaldà, 2007). KSWIN uses a sliding window of fixed size to compare the most recent samples of the window with the remaining ones by using Kolmogorov-Smirnov (KS) statistical test (Raab et al., 2020). Different from the other two methods, ADWIN and KSWIN has no warning level, which made it necessary to adapt our data stream ML pipeline. For instance, when a change occurs in ADWIN, out of the window data is discarded and the remaining samples are used to retrain the both the classifier and feature extractor. Finally, when a change occurs in KSWIN, only the most recent samples of the window are kept and used to retrain the classifier and feature extractor.
In all evaluations, we developed classification models leveraging Word2Vec and TF-IDF representations and normalized using a MinMax technique (Pedregosa et al., 2011). For all cases, we used as classifiers Adaptive Random Forest (Gomes et al., 2017) without its internal drift detectors (working as a Random Forest for data streams), since its widespread use in the malware detection literature and has the best overall performance (Ceschin et al., 2018), and Stochastic Gradient Descent (SGD) classifier (Pedregosa et al., 2011), which is one of the fastest online classifiers in scikit-learn (Pedregosa et al., 2011). Both the classifier and drift detectors were configured using the same hyper-parameters proposed by the authors (Gomes et al., 2017; Montiel et al., 2018).
5.1 The Best-Case Scenario for AVs (ML Cross-Validation)
In the first experiment, we classify all samples together to compare which feature extraction algorithm is the best and report baseline results. We tested several parameters for both algorithms and fixed the vocabulary size in for TF-IDF (top-100 features ordered by term frequency) and created projections with dimensions for Word2Vec, resulting in and features for each app in both cases, for DREBIN and AndroZoo, respectively. All results are reported after 10-fold cross-validation procedures, a method commonly used in ML to evaluate models because its results are less prone to biases (note that we are training new classifiers and feature extractors at every iteration of the cross-validation process). In practice, folding the dataset implies that the AV company has a mixed view of both past and future threats, despite temporal effects, which is the best scenario for AV operation and ML evaluation.
|Classifier||Algorithm||DREBIN Dataset||AndroZoo Dataset|
Table 1 presents the results obtained in this experiment for both DREBIN and AndroZoo datasets using Adaptive Random Forest (ARF) and Stochastic Gradient Descent (SGD), highlighting the performance of TF-IDF, which was better than Word2Vec in all metrics, except in precision when classifying the DREBIN dataset. It means that Word2Vec is slightly better to detect goodware (particularly in DREBIN dataset) since its precision is higher (i.e., fewer FPs) and TF-IDF is better to detect malware due to its higher recall (i.e., less FNs). In general, we conclude that TF-IDF is better than Word2Vec since its accuracy and f1score are higher. Moreover, we notice that its f1score is not as high as the accuracy, which is near 100%, indicating that one of the classes (malware) is more difficult to predict than the other (goodware). Regardless of small differences, we observe that ML classifiers perform significantly well when samples of all periods are mixed, even in a more complex dataset such as AndroZoo, since they can learn features from all periods.
5.2 On Classification Failure (Temporal Classification)
Although currently used classification methodology helps reducing dataset biases, it would demand knowledge about future threats to work properly. AV companies train their classifiers using data from past samples and leverage them to predict future threats, expecting to present the same characteristics as past ones. However, malware samples are very dynamic, thus this strategy is the worst-case scenario for AV companies. To demonstrate the effects of predicting future threats based on past data, we split our datasets in two: we used the first half (oldest samples) to train our classifiers, which were then used to predict the newest samples from the second half. The results of Table 2 indicate a drop in all metrics when compared to the 10-fold experiment in both DREBIN and AndroZoo datasets and also sugest the occurrence of concept drift on malware samples, given that the recall is much smaller than the previous experiment.
|Classifier||Algorithm||DREBIN Dataset||AndroZoo Dataset|
Due to dataset imbalance in both datasets, we notice a bias toward goodware detection (very small accuracy decrease) and significant qualitative differences for f1score and recall (much worse than before). The precision score was very similar to the cross-validation experiment, and better in the case of TF-IDF when classifying DREBIN, showing that goodware samples remain similar in the “future” while malware samples evolved. Word2Vec and TF-IDF lost, in average, about 16 percentage points of their f1score in DREBIN and about 19 percentage points for this same metric in AndroZoo. In addition, the model’s recall rates significantly dropped in both Word2Vec and TF-IDF when classifying DREBIN and AndroZoo, regardless of the models used. This indicates that detecting unforeseen malware only from a single set of past data is a hard task. These results highlight the need of developing better continuous learning approaches for effective malware detection.
5.3 Real-World Scenario (Windowed Classifier)
Since static classifiers are a bad strategy, AV companies adopt continuous updating procedures as samples are collected and identified. From a ML perspective, they adopt an incremental stream learning method (Pinage et al., 2016), which we call Incremental Windowed Classifier (IWC). Notice that this is the same approach used by other authors in the literature, but they propose the use of distinct attributes and features only, instead of a new stream pipeline (Zhang et al., 2020; Cai, 2020). To evaluate the impact of this approach, we divided the datasets into two groups, one containing samples released until a certain month of a year for training, and the other with samples released one month after that said month for testing. For example, considering Jan/2012, the training set contained samples that were created until Jan/2012 (cumulative) and the validation set contained samples created only in Feb/2012 (a month later). We tested both Adaptive Random Forest (ARF) (Gomes et al., 2017) and Stochastic Gradient Descent (SGD) classifier (Pedregosa et al., 2011) with TF-IDF (100 features for each textual attribute), given its better performance in previous experiments, and excluded months with no samples. Every month, both classifier and feature extractor were retrained, generating new classifiers and feature extractors.
The results for the DREBIN dataset shown in Figures 3(a) and 3(c) indicate a drop in both precision and recall rates. For Adaptive Random Forest, precision drops to about 85% in April/2012 and increases in the following months to almost 100% in August/2013. The Recall starts with the worst result (about 40%), in January/2012 and reaches almost 100% in August/2013. The average recall was 70.3% and average precision, 95.36%. For SGD, the worst precision is reported in October/2012 (about 30%), increasing to almost 100% in August/2013. The same happen with the recall, which drops to almost 20% in October/2012 and increases to almost 100% in August/2013. Finally, the average recall was 66.78% and average precision, 85.22%.
On the one hand, the growth of precision and recall in some periods indicate that continuously updating the classifier might increase classification performance in comparison to using a single classifier tested with samples from a past period because the classifier learns new patterns for the existing concepts (features) that may have been introduced over time. On the other hand, the drop in the precision and recall rates in some periods indicate the presence of new concepts (e.g., new malware features) that were not fully handled by the classifiers. When looking at the AndroZoo dataset results, as shown inFigures 3(b) and 3(d), it is possible to note a drastic fall in recall as time goes by in both cases, dropping from almost 70% (ARF) and 60% (SGD) in January/2017 to less than 5% in March/2019, not exceeding 10% by the end of the stream. This indicates that AndroZoo is a much more complex dataset with very different malware in distinct periods, given that the recall did not behave as this same experiment using the DREBIN dataset. In contrast, the precision remained unstable. For Adaptive Random Forest, it reaches 100% in some periods, such as May/2017, and drops to almost 50%, in June/2017, with quite similar behavior from the DREBIN dataset. For SGD, it also reaches 100% in some periods, such as April/2017, and drops to almost 30% in April/2018. These results indicate that more than continuous retraining, AV companies need to develop classifiers fully able to learn new concepts.
5.4 Concept Drift Detection using Data Stream Pipeline (Fast & Furious – F&F)
Although the previous results indicate that continuously learning from the past is the best strategy for detecting forthcoming threats, concept drift remains challenging, even more when considering results from the AndroZoo dataset. Therefore, we present an evaluation of a drift detector approach that could be deployed by AV companies to automatically update their classifier models. In our experiments, we used both Adaptive Random Forest (Gomes et al., 2017) and Stochastic Gradient Descent as the classifiers, DDM, EDDM, ADWIN, and KSWIN as drift detectors (Montiel et al., 2018) (with the same parameters as the authors), and TF-IDF (using 100 features for each textual attribute) as a feature extractor. In the DREBIN dataset, we initialized the base classifier with data from the first year, given that in the first months we have just a few malware showing up. In the AndroZoo dataset, we initialized it with data from the first month. We tested all drift detection algorithms in two circumstances when creating a new classifier, according to our data stream pipeline: (i) Update: we just update the classifier with new samples (collected in warning level or ADWIN window), which reflects the fastest approach for AV companies to react to new threats; (ii) Retrain: we extract all features from the raw data (also collected in warning level or ADWIN window) and all models are built from scratch again, i.e., both classifier and feature extractor are retrained and a new vocabulary is built based on the words in the training set for each textual attribute (more complete approach, but time-consuming for AV companies). To compare our solution with another similar in the literature, we implemented our version of DroidEvolver (Xu et al., 2019), replicating their approach using the same feature representation as ours. It is important to report here that we tried to use the authors’ source code, but they were not working properly due to dependencies that could not be installed. Thus, we implemented an approximation of their method by analyzing their paper and code, using and with three compatible online learning methods from scikit-learn (Pedregosa et al., 2011): SGDClassifier, PassiveAggressiveClassifier, and Perceptron. Finally, during the execution of DroidEvolver, we noticed that it was detecting at least one drift in one of its classifiers each iteration, making it necessary to create a new parameter that checks for drift in an interval of steps (iterations), in our case steps was selected according to our analysis.
Table 3 presents the results for DroidEvolver and our methods for both datasets. For DREBIN, when using Adaptive Random Forest, EDDM outperforms DDM methods’ classification performance in all scenarios, which was the opposite when using SGD classifier. For both classifiers, ADWIN outperforms EDDM, DDM, and KSWIN, providing the best overall performance for retraining as well as for learning new features. However, only when using Adaptive Random Forest, ADWIN with the update is better for precision, making it slightly better in reducing false positives, and KSWIN is better for recall, making it better in reducing false negatives. Moreover, retraining both the feature extractor and classifier makes it detect fewer drift points than the updating approach (18 vs. 20 when using ARF, 13 vs 14 when using SGD). Overall, SGD outperformed Adaptive Random Forest when classifying DREBIN, presenting an improvement in f1score of almost 16 percentage points. In comparison to DroidEvolver, our approach outperformed it in all metrics, with a relatively large margin. In Figure 5, it is possible to observe the prequential error of Adaptive Random Forest with ADWIN using Update (Figure 4(a)) and Retrain (Figure 4(b)) strategies. Despite the Retrain strategy is similar to the update approach in some points, it has less drift points, a lower prequential error and, consequently, a better classification performance.
When classifying AndroZoo, DDM outperforms EDDM, which was not a good drift detector for this specific dataset. ADWIN with retraining was the best method again, detecting 50 and 33 drift points versus 78 and 30 when using the update approach with Adaptive Random Forest and SGD classifier, respectively, outperforming all other methods, including DroidEvolver again. Finally, these results suggest that AV companies should keep investigating specific samples to increase overall detection.
|Classifier||Method||DREBIN Dataset||AndroZoo Dataset|
5.5 Multiple Time Spans
In the previous experiment, we showed that the use of drift detectors with the Retrain strategy improved the classification performance. Although the experiment simulates a real-world stream, there might be potential biases from the use of the very same training and test sets. By proposing a new experiment, we mitigated the risk of biasing evaluations, since we have tested our solutions under different circumstances by using multiple time spans and reporting the average result.
To evaluate the effectiveness of our approach in multiple conditions,
we applied our data stream pipeline to different training and test sets of our dataset: we splitted the two datasets (sorted by the samples’ timestamps) into eleven folds (thus 10 consecutive epochs), with every fold containing the same amount of data, similar to a
To evaluate the effectiveness of our approach in multiple conditions, we applied our data stream pipeline to different training and test sets of our dataset: we splitted the two datasets (sorted by the samples’ timestamps) into eleven folds (thus 10 consecutive epochs), with every fold containing the same amount of data, similar to ak-fold cross-validation scheme. However, instead of using a single set for training as done in each iteration of k-fold cross-validation, we increment the training set i with the fold i+1, and remove it from the test set at every iteration. This way, we create a cumulative training set and simulate the same scenario as the previous experiment, but starting the stream in different parts. In the end, we produced ten distinct results that present the effectiveness of our method under varied conditions, which worked as a k-fold cross-validation. However, we focused on collecting the F1Score of each iteration).
In the current experiment, we used all the methods presented in the previous section: both classifiers (Adaptive Random Forest and SGD), all drift detectors (DDM, EDDM, ADWIN, and KSWIN), and both methods ((U)pdate and (R)etrain). To accomplish a better presentation of the results, we chose a boxplot visualization containing the distribution of all F1Scores (black dots) and their average for each method (white dots), as shown in Figure 6.
In Figures 5(a) and 5(c) , we present the results for the DREBIN dataset using Adaptive Random Forest and SGD as classifiers, respectively. The IWC method performed much better than the others with Random Forest; in the case of using ten folds to predict the last one, it reaches a better performance than the other methods, which indicates that the last fold may not be affected by concept drift. However, when we look at the other methods, KSWIN with retrain performed best, achieving a higher average F1Score and a low standard deviation. Also, as we saw in the previous section, SGD performed better than the Adaptive Random Forest applied to this dataset.
, we present the results for the DREBIN dataset using Adaptive Random Forest and SGD as classifiers, respectively. The IWC method performed much better than the others with Random Forest; in the case of using ten folds to predict the last one, it reaches a better performance than the other methods, which indicates that the last fold may not be affected by concept drift. However, when we look at the other methods, KSWIN with retrain performed best, achieving a higher average F1Score and a low standard deviation. Also, as we saw in the previous section, SGD performed better than the Adaptive Random Forest applied to this dataset.
In Figures 5(b) and 5(d), we present the results for the AndroZoo dataset using Adaptive Random Forest and SGD as classifiers, respectively. In this scenario, IWC performance is much worse than any other method, which we believe is evidenced due to the complexity of this dataset (almost 350K samples). Again, in both classifiers, KSWIN with retrain method performed best, achieving the higher average F1Score, despite Adaptive Random Forest presenting better overall results and being almost 4 percentage points higher than SGD.
Finally, the analysis of the results allows us to observe that (i) the more data we have in the data stream, the most difficult it becomes for IWC to keep a good performance; (ii) using KSWIN drift detector with retrain is the most recommended method for static android malware detection data streams; and (iii) we ensure that our results do not have potential biases.
5.6 Understanding Malware Evolution
After we confirmed that concept drift is prevalent in these malware datasets, we delved into evolution details to understand the reasons behind it and the lessons that mining a malware dataset might teach us. To do so, we analyzed vocabulary changes over time for both datasets and correlated our findings of them.
DREBIN Dataset Evolution. Figure 6(a) shows API calls’ vocabulary change for the first six detected drift points that actually presented changes (green and red words mean introduced and removed from the vocabulary, respectively). We identified periodic trends and system evolution as the two main reasons for the constant change of malware samples. The first occurs because malware creators compromise systems according to the available infection vectors (e.g., periodic events exploitable via social engineering). Also, attackers usually shift their operations to distinct targets when their infection strategies become so popular that AVs detect them. Therefore, some features periodically enter and leave the vocabulary (e.g., setcontentview). The second occurs due to changes in the Android platform, causing reactions from malware creators to handle the new scenario, either by supporting newly introduced APIs (e.g., getauthtoken (Android, 2018a)), or by unsupporting deprecated and modified APIs (e.g., deleted keyguard (Android, 2018b) or the deprecated fbreader (FBReader, 2018) intent). Handling platform evolution is required to keep malware samples working on newer systems, whereas ensuring maliciousness. In this sense, the removal of a feature like DELETE_PACKAGE permission from the vocabulary can be explained by the fact that Android changed permission’s transparency behavior to an explicit confirmation request (Android, 2016), which makes malware abusing those features less effective. The most noticeable case of malware evolution involves SMS sending, a feature known to be abused by plenty of samples (Sarma et al., 2012; Hamandi et al., 2012; Luo et al., 2013). We identified that APIs were periodically added and removed from the vocabulary (e.g., sendmultiparttextmessage, part of the Android SMS subsystem (Android, 2018c), had its use restricted by Android over time, until being finally blocked (Cimpanu, 2018)). This will certainly cause modifications in newer malware and probably incur in classifiers’ drifting once again. Therefore, considering concept drift is essential to define any realistic threat model and malware detector.
AndroZoo Dataset Evolution. We conducted the same experiments performed on the Drebin dataset on the AndroZoo dataset to evaluate dataset influence on concept drift occurrence, as shown in Figure 6(b). We highlight that developing a malware classifier for the AndroZoo dataset is a distinct challenge than developing a malware classifier for the Drebin dataset because the AndroZoo dataset presents more malware families (1.166 distinct families according to Euphony (Hurier et al., 2017) labels) than the Drebin dataset (178 distinct families according to Euphony labels). The difference in the dataset complexity is reflected in the number of drift points identified for each experiment. Whereas we identified 18 drift points for the Drebin dataset, the AndroZoo dataset presented 50 drift points. Despite the difference in the number of drifting points, the reasons behind such drifts have been revealed very similar when we delve into vocabulary change details. When we look to the manifest_action vocabulary, we notice that the sms_received feature leaves and returns to the vocabulary many times. This behavior reflects the arms race between Google and attackers for SMS sending permission (Cimpanu, 2018). This same occurrence has been observed in the DREBIN dataset. Similarly, when we look to the manifest_category vocabulary, we notice that the facebook feature also leaves and returns to the vocabulary many times. This happens because the attackers often rely on Facebook name for distributing “Trojanized” applications that resemble that original Facebook app. Although Google often removes these apps from the official store, attackers periodically come up with new ways of bypassing AppStore’s verification routines, thus resulting in the presented infection waves regarding the Facebook name. Finally, this same behavior is observed in the resource_entry vocabulary regarding the amazonaws feature. Attackers often store their malicious payloads on popular cloud servers to avoid detection (Rossow et al., 2013). Even though Amazon sinkhole malicious domains when detected, attackers periodically find new ways to survive security scans, thus also resulting in the observed vocabulary waves.
The results of our experiments show that despite presenting distinct complexities, both datasets have undergone drift occurrences. This shows that concept drift is not a dataset-exclusive issue, but a characteristic inherently associated with the malware classification problem. Therefore, overall malware classification research work (and primarily the ones leveraging these two popular datasets) should consider concept drift occurrence in their evaluations to gather more accurate samples information.
We here discuss our findings and their implications.
What the metrics say. While the accuracy of distinct classifiers is very similar in some experiments, we highlight that malware detectors must be evaluated using the correct metrics, as any binary classification. Therefore, we rely on f1score, recall, and precision to gain further knowledge on malware detection. AV companies should adopt the same reasoning when evaluating their classifiers. Experiments 5.2 and 5.3 indicate Android malware evolution, which imposes difficulties for established classifiers and corroborates the fact that samples changed. However, goodware generally kept the same concept over time, suggesting that creating benign profiles may be better than modeling everchanging malicious activities.
Feature drift, concept drift and evolution. Using a fixed ML model is not enough to predict future threats. In experiment 5.2, we notice a significant drop in f1score and recall (a metric that indicates the ability to correctly classify malware). Hence, a fixed ML model is prone to misclassify novel threats. Besides, continuous learning helps increasing detection but is still subject to drift, as shown by AndroZoo in experiment 5.3, when recall remains below 10% for a long period. If we compare experiments 5.3 and 5.4 when classifying DREBIN, it is possible to observe that the Incremental Windowed Classifier update still outperforms DDM with the update when using Adaptive Random Forest, but is not better than DDM with retraining (except in precision). However, when comparing it with EDDM and ADWIN, we notice that both are significantly better, and it is still highly prone to drift (Figure 4). In response to these results, concept drift detectors help to improve classification performance. When comparing DDM with retraining and EDDM and ADWIN (with both update and retrain) to IWC (Table 3), we can see that the former outperformed the latter, advocating the need of using drift detectors. This is evident when using SGD, once IWC was outperformced by all other methods, and is even more evident when classifying AndroZoo, given that IWC lost the ability to detect new malware as time goes by. In this case, ADWIN with retraining was able to outperform all the other methods again, even a closely related work (DroidEvolver (Xu et al., 2019)), giving us insight that this is a valid method to be used in practice. In addition, even more important than just using a drift detector, reconsidering the classifier’s feature set is required to completely overcome concept drift. According to experiment 5.4, we can infer that retraining the classifier and feature extraction models every time a drift occurs (using the data stream pipeline we proposed) is better than just updating only the classifier. This implies that not only the representation of malware becomes obsolete as time goes by, but also the vocabulary used to build them. It means that every textual attribute used by applications can change, i.e., new API calls, permissions, URLs, for example, emerge, requiring a vocabulary update, i.e., it indicates that discovering new specific features might help in increasing detection rates. Thus, we conclude that malware detection is not just a concept drift problem, but also in essence a feature drift detection problem (1).
Our solution in practice. To implement our solution in practice, we need to model the installation, update, and removal of applications as a stream. This can be done by considering the initial set of applications installed in a stock phone as ground truth and thus subsequent deployment of applications as the stream. To reduce the delay between the identification of a drift point and the classifier update, ideally, multiple classifiers (the current one and the candidates to replace it) should be running in parallel. However, this parallel execution is too much expensive to be performed on endpoints. Therefore, we consider that the best usage scenario for our approach is its deployment on the App’s Stores distributing the applications. Therefore, each time a new application is submitted to an App Store, it is verified according to our cycle. To cover threats distributed by alternative markets, this concept can be extended to any centralized processing entity. For instance, an AV can upload the suspicious application to a cloud server and perform its checks according to our cycle. To speed up the upload process, the AV can make the current feature extractor available to the endpoint so as the endpoint does not need to upload the entire application but only its feature. In this scenario, the AV update is not composed of new signatures, but of new feature extractors.
Drift and evolution are common problems in malware detection. Despite being more complex, (more apps and malware families) all characteristics present in DREBIN are drastically shown in AndroZoo, evidencing that feature drift, concept drift, and evolution are present in malware detection problems in practice and not only in a single dataset. This suggests that AV companies should keep enhancing their models and knowledge database constantly.
Limitations and future work. One of the limitations of our approach is that it relies on ground truth labels that may be available with a certain delay, given that known goodware and malware are needed in order to train and update the classifier (as any AV). Thus, future researches to reduce these delays are an important step to improve any ML solution that does not rely on their own labels, such as DroidEvolver (Xu et al., 2019) (outperformed by our approach).
It is important to note that our work considers only the attributes vectors provided by DREBIN (Arp et al., 2014) and AndroZoo (Allix et al., 2016) datasets’ authors. Due to this fact, we were unable to consider other types of attributes, such as API semantics (Zhang et al., 2020) or behavioral profiling (dynamic analysis) (Cai, 2020). Therefore, we cannot directly compare our work with theirs, due to different threat models. To compare different attributes using our data stream pipeline under the same circumstances, it is necessary to download all the APKs from both datasets. This is left as future work.
Finally, we make our data stream learning pipeline available as an extension of scikit-multiflow (Montiel et al., 2018), aiming at encouraging other researchers to contribute with new strategies that address feature and concept drift.
In this article, we evaluated the impact of concept drift on malware classifiers for Android malware samples to understand how fast classifiers expire. We analyzed 480K sample Android apps from two datasets (DREBIN and AndroZoo) collected over nine years (2009-2018) using two representations (Word2Vec and TF-IDF), two classifiers (Adaptive Random Forest and Stochastic Gradient Descent classifier) and four drift detectors (DDM, EDDM, ADWIN, and KSWIN). Our results show that resetting the classifier only after changes are detected is better than periodically resetting it based on a fixed window length. We also point the need to update the feature extractor (besides the classifier) to achieve increased detection rates, due to new features that may appear over time. This strategy was the best in all scenarios presented, even using a complex dataset, such as AndroZoo. The implementation is available as an extension to scikit-multiflow (Montiel et al., 2018)333https://github.com/fabriciojoc/scikit-multiflow. Our results highlight the need for developing new strategies to update the classifiers inherent to AVs.
-  (2017) A survey on feature drift adaptation: definition, benchmark, challenges and future directions. Journal of Systems and Software. Cited by: §6.
- AndroZoo: collecting millions of android apps for the research community. In Int. Conf. on Min. Soft. Rep., External Links: Cited by: §1, §3.3, §3.3, §6.
- Learning to evade static pe machine learning malware models via reinforcement learning. External Links: Cited by: §2.
- Android 7.0 behavior changes. Note: https://tinyurl.com/yxlpc4gb Cited by: §5.6.
- AccountManager. Note: https://tinyurl.com/ybsdz76e Cited by: §5.6.
- Device admin deprecation. Note: https://tinyurl.com/yygfkc3m Cited by: §5.6.
- SmsManager. Note: https://tinyurl.com/y3sz4zzw Cited by: §5.6.
- DREBIN: effective and explainable detection of android malware in your pocket. In NDSS, Cited by: §1, §3.3, §3.3, §6.
- Paired learners for concept drift. In IEEE Int. Conf. on Data Mining, Cited by: §2.
- Early drift detection method. Cited by: §1, §4.2.
- Learning from time-changing data with adaptive windowing. In SIAM Int. Conf. on Data Mining, Cited by: §1, §4.2.
- Towards sustainable android malware detection. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings, ICSE ’18, New York, NY, USA, pp. 350–351. External Links: Cited by: §2.
- A preliminary study on the sustainability of android malware detection. External Links: Cited by: §2.
- Assessing and improving malware detection sustainability through app evolution studies. 29 (2). External Links: Cited by: §5.3, §6.
- The need for speed: an analysis of brazilian malware classifers. IEEE Sec.&Priv.. External Links: Cited by: §1, §2, §4.3.
- Analyzing and defending against web-based malware. ACM Comput. Surv. 45 (4). External Links: Cited by: §1.
- Google restricts which android apps can request call log and sms permissions. Zdnet. Note: https://tinyurl.com/y3y5gfhx Cited by: §5.6, §5.6.
- Malware infections grow 4x in just one quarter. Note: https://tinyurl.com/yyyugke7 Cited by: §1.
Prescience: probabilistic guidance on the retraining conundrum for malware detection.
Proceedings of the ACM Workshop on Artificial Intelligence and Security, Cited by: §2.
- For old android devices. Note: https://tinyurl.com/y2odlk9w Cited by: §5.6.
- What are the most common kinds of Android malware?. Note: https://tinyurl.com/wcmr9m2 Cited by: §1.
- On the deterioration of learning-based malware detectors for android. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), Vol. , pp. 272–273. External Links: Cited by: §2.
- A survey on concept drift adaptation. ACM Comput. Surv.. Cited by: §1, §1, §3.2, §4.2.
- Malware analysis and classification: a survey. Journal of Information Security. Cited by: §1, §2.
- The rise of machine learning for detection and classification of malware: research developments, trends and challenges. Journal of Network and Computer Applications. Cited by: §1, §2.
- Adaptive random forests for evolving data stream classification. Machine Learning. Cited by: §1, §4.3, §5.3, §5.4.
- Android sms botnet: a new perspective. In Int. Symp. on Mobility Management and Wireless Access, Cited by: §5.6.
- Euphony: harmonious unification of cacophonous anti-virus vendor labels for android malware. In Int. Conf. on Min. Soft. Rep., Cited by: §5.6.
- Transcend: detecting concept drift in malware classification models. In USENIX Security Symposium, Cited by: §1, §2.
- Approaches to adversarial drift. In ACM Workshop on A.I. and Security, External Links: Cited by: §1.
- Real-time detection and prevention of android sms permission abuses. In Int. W. on Sec. in Embedded Sys. and Smartphones, External Links: Cited by: §5.6.
- Cloud-based malware detection for evolving data streams. ACM Trans. Manage. Inf. Syst.. Cited by: §2.
- Distributed representations of words and phrases and their compositionality. CoRR. External Links: Cited by: §1, §4.1.
- Scikit-multiflow: a multi-output streaming framework. Journal of Machine Learning Research. Cited by: §1, item 4, §4.3, §5.4, §6, §7.
- Adaptive and scalable android malware detection through online learning. In IJCNN, Vol. . Cited by: §2.
- MaMaDroid: detecting android malware by building markov chains of behavioral models (extended version). ACM Trans. Priv. Secur. 22 (2). External Links: Cited by: §2.
- Scikit-learn: machine learning in Python. Journal of Machine Learning Research. Cited by: §1, §4.3, §5.3, §5.4.
- TESSERACT: eliminating experimental bias in malware classification across space and time. CoRR. External Links: Cited by: §2, §2, §3.3.
- Classification systems in dynamic environments. WIREs: Data Mining and Knowledge Discovery. Cited by: §5.3.
- Reactive soft prototype computing for concept drift streams. Neurocomputing 416, pp. 340–351. External Links: Cited by: §1, §4.2.
- Large-scale analysis of malware downloaders. In Detection of Intrusions and Malware, and Vulnerability Assessment, U. Flegel, E. Markatos, and W. Robertson (Eds.), Berlin, Heidelberg. Cited by: §5.6.
- A vector space model for automatic indexing. Commun. ACM. Cited by: §1, §4.1.
- Android permissions: a perspective combining risks and benefits. In Symp. on Access Control Models and Technologies, Cited by: §5.6.
- Global ransomware damage costs predicted to exceed $8 billion in 2018. Note: https://tinyurl.com/y499nvsh Cited by: §1.
- Tracking concept drift in malware families. In Proceedings of the ACM Workshop on Security and Artificial Intelligence, Cited by: §2.
- Operating System Market. Note: https://tinyurl.com/tykvtqm Cited by: §1.
- Online malware and url scanner.. Note: https://virustotal.com/ Cited by: §1, §3.3.
- DroidEvolver: self-evolving android malware detection system. In 2019 IEEE EuroS&P, Cited by: §2, §5.4, Table 3, §6, §6.
- Enhancing state-of-the-art classifiers with api semantics to detect evolved android malware. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, CCS ’20, New York, NY, USA, pp. 757–770. External Links: Cited by: §2, §5.3, §6.