Portable, Data-Driven Malware Detection using Language Processing and Machine Learning Techniques on Behavioral Analysis Reports

12/26/2018 ∙ by ElMouatez Billah Karbab, et al. ∙ Concordia University 0

In response to the volume and sophistication of malicious software or malware, security investigators rely on dynamic analysis for malware detection to thwart obfuscation and packing issues. Dynamic analysis is the process of executing binary samples to produce reports that summarise their runtime behaviors. The investigator uses these reports to detect malware and attribute threat type leveraging manually chosen features. However, the diversity of malware and the execution environments makes manual approaches not scalable because the investigator needs to manually engineer fingerprinting features for new environments. In this paper, we propose, MalDy (mal die), a portable (plug and play) malware detection and family threat attribution framework using supervised machine learning techniques. The key idea of MalDy portability is the modeling of the behavioral reports into a sequence of words, along with advanced natural language processing (NLP) and machine learning (ML) techniques for automatic engineering of relevant security features to detect and attribute malware without the investigator intervention. More precisely, we propose to use bag-of-words (BoW) NLP model to formulate the behavioral reports. Afterward, we build ML ensembles on top of BoW features. We extensively evaluate MalDy on various datasets from different platforms (Android and Win32) and execution environments. The evaluation shows the effectiveness and the portability MalDy across the spectrum of the analyses and settings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Malware investigation is an important and time consuming task for security investigators. The daily volume of malware raises the automation necessity of the detection and threat attribution tasks. The diversity of platforms and architectures makes the malware investigation more challenging. The investigator has to deal with a variety of malware scenarios from Win32 to Android. Also, nowadays malware targets all CPU architectures from x86 to ARM and MIPS that heavily influence the binary structure. The diversity of malware adopts the need for portable tools, methods, and techniques in the toolbox of the security investigator for malware detection and threat attribution.

Binary code static analysis is a valuable tool to investigate malware in general. It has been used effectively and efficiently in many solutions arp2014drebin , Hu13MutantX , karbab2018maldozer , DBLP:journals/corr/abs-1712-08996 , and Mariconti2017MaMaDroid in PC and Android realms. However, the use of static analysis could be problematic on heavily obfuscated and custom packed malware. Solutions, such as Hu13MutantX , address those issues partially but they are platform/architecture dependent and cover only simple evading techniques. On the other hand, the dynamic analysis solutions Toward2017WillemsHF , Bayer09Scalable , Wong2016Intellidroid , DynaLog2016Alzaylaee , malrec2018Giorgio , DBLP:journals/corr/abs-1806-08893 , DBLP:conf/IEEEares/KarbabD18 , Kernel2011Isohara , Automatic2011Rieck , Irolla2016Glassbox show more robustness to evading techniques such as obfuscation and packing. Dynamic analysis’s main drawback is its hungry to computation resources Wang2015DROIT , Graziano2015Needles . Also, it may be blocked by anti-emulation techniques, but this is less common compared to the binary obfuscation and packing. For this reason, dynamic (also behavioral) analysis is still the default choice and the first analysis for malware by security companies.

The static and behavioral analyses are sources for security features, which the security investigator uses to decide about the binary maliciousness. Manual inspection of these features is a tedious task and could be automated using machine learning techniques. For this reason, the majority of the state-of-the-art malware detection solutions use machine learning Mariconti2017MaMaDroid , arp2014drebin

. We could classify these solutions’ methodologies into supervised and unsupervised. The supervised approach, such as

Nataraj2011comparative for Win32 and wu2014droiddolphin , Sen2016StormDroid for Android, is the most used in malware investigation Martinelli2016find , Sen2016StormDroid . The supervised machine learning process starts by training a classification model on a train-set. Afterward, we use this model on new samples in a production environment. Second, the unsupervised approach, such as Automatic2011Rieck , karbab2016cypider , Scalable2009Bayer , DBLP:journals/corr/KarbabDAM17 , karbab2017dysign

, in which the authors cluster the malware samples into groups based on their similarity. Unsupervised learning is more common in malware family clustering

Automatic2011Rieck , and it is less common in malware detection karbab2016cypider .

In this paper, we focus on supervised machine learning techniques along with behavioral (dynamic or runtime) analyses to investigate malicious software. Dynamic and runtime analyses execute binary samples to collect their behavioral reports. The dynamic analysis makes the execution in a sandbox environment (emulation) where malware detection is an off-line task. The runtime analysis is the process of collecting behavioral reports from production machines. The security practitioner aims to obtain these reports to make an online malware checking without disturbing the running system.

1.1 Problem Statement

The state-of-the-art solutions, such as in Sen2016StormDroid , kharraz2016unveil , sgandurra2016automated , rely on manual security features investigation in the detection process. For example, StormDroid Sen2016StormDroid used Sendsms and Recvnet dynamic features, which have been chosen base on a statistical analysis, for Android malware detection. Another example, the authors in Kolbitsch2009Effective used explicit features to build behavior’s graphs for Win32 malware detection . The security features may change based on the execution environment despite the targeted platform. For instance, the authors Sen2016StormDroid and DynaLog2016Alzaylaee used different security features due to the difference between the execution environments. In the context of the security investigation, we are looking for a portable framework for malware detection based on the behavioral reports across a variety of platforms, architectures, and execution environments. The security investigator would rely on this plug and play framework with a minimum effort. We plug the behavioral analysis reports for the training and apply (play) the produced classification model on new reports (same type) without an explicit security feature engineering as in Sen2016StormDroid , Kolbitsch2009Effective , Chen2017Semi ; and this process works virtually on any behavioral reports.

1.2 MalDy

We propose, MalDy, a portable and generic framework for malware detection and family threat investigation based on behavioral reports. MalDy aims to be a utility on the security investigator toolbox to leverage existing behavioral reports to build a malware investigation tool without prior knowledge regarding the behavior model, malware platform, architecture, or the execution environment. More precisely, MalDy is portable because of the automatic mining of relevant security features to allow moving MalDy to learn new environments’ behavioral reports without a security expert intervention. Formally, MalDy framework is built on top natural language processing (NLP) modeling and supervised machine learning techniques. The main idea is to formalize the behavioral report, agnostically to the execution environment, into a bag of words (BoW) where the features are the reports’ words. Afterward, we leverage machine learning techniques to automatically discover relevant security features that help differentiate and attribute malware. The result is MalDy, portable (Section 8.2), effective (Section 8.1), and efficient (Section 8.3) framework for malware investigation.

1.3 Result Summary

We extensively evaluate MalDy on different datasets, from various platforms, under multiple settings to show the framework portability, effectiveness, efficiency, and its suitability for general purpose malware investigation. First, we experiment on Android malware behavioral reports of MalGenome zhou2012dissecting , Drebin arp2014drebin , and Maldozer karbab2018maldozer datasets along with benign samples from AndroZoo Allix2016AndroZoo and PlayDrone 111https://archive.org/details/android_apps repositories. The reports were generated using Droidbox droidbox_github sandbox. MalDy achieved 99.61%, 99.62%, 93.39% f1-score on the detection task on the previous datasets respectively. Second, we apply MalDy on behavioral reports (20k samples from 15 Win32 malware family) provided by ThreatTrack security 222https://www.threattrack.com (ThreatAnalyzer sandbox). Again, MalDy shows high accuracy on the family attribution task, 94.86% f1-score, under different evaluation settings. Despite the difference between the evaluation datasets, MalDy shows high effectiveness under the same hyper-parameters with minimum overhead during the production, only 0.03 seconds runtime per one behavioral report on modern machines.

1.4 Contributions

  • New Framework: We propose and explore a data-driven approach to behavioral reports for malware investigation (Section 5). We leverage a word-based security feature engineering (Section 6) instead of the manual specific security features to achieve high portability across different malware platforms and settings.

  • BoW and ML: We design and implement the proposed framework using the bag of words (BoW) model (Section 5.4) and machine learning (ML) techniques (Section 5.4). The design is inspired from NLP solutions where the word frequency is the key for feature engineering.

  • Application and Evaluation: We utilize the proposed framework for Android Malware detection using behavioral reports from DroidBox droidbox_github sandbox (Section 8). We extensively evaluate the framework on large reference datasets namely, Malgenome zhou2012dissecting , Drebin arp2014drebin , and Maldozer karbab2018maldozer (Section 7). To evaluate the portability, we conduct a further evaluation on Win32 Malware reports (Section 8.2) provided by a third-party security company. MalDy shows high accuracy in all the evaluation tasks.

2 Threat Model

We position MalDy as a generic malware investigator tool. MalDy current design considers only behavioral reports. Therefore, MalDy is by design resilient to binary code static analysis issues like packing, compression, and dynamic loading. MalDy performance depends on the quality of the collected reports. The more security information and features are provided about the malware samples in the reports the higher MalDy could differentiate malware from benign and attribute to known families. The execution time and the random event generator may have a considerable impact on MalDy because they affect the quality of the behavioral reports. First, the execution time affect the amount of information in the reports. Therefore, small execution time may result little information to fingerprint malware. Second, random event generator may not produce the right events to trigger certain malware behaviors; this loads to false negatives. Anti-emulation techniques, used to evade Dynamic analysis, could be problematic for MalDy framework. However, this issue is related to the choice the underlying execution environment. First, this problem is less critical for a runtime execution environment because we collect the behavioral reports from real machines (no emulation). This scenario presumes that all the processes are benign and we check for malicious behaviors. Second, the security practitioner could replace the sandbox tool with a resilient alternative since MalDy is agnostic to the underlying execution environment.

3 Overview

The execution of a binary sample (or app) produces textual logs whether in a controlled environment (software sandbox) or production ones. The logs, a sequence of statements, are the result of the app events, this depends on the granularity of the logs. Furthermore, each statement is a sequence of words that give a more granular description on the actual app event. From a security investigation perspective, the app behaviors are summarized in an execution report, which is a sequence of statements and each statement is a sequence of words. We argue that malicious apps have distinguishable behaviors from benign apps and this difference is translated into words in the behavioral report. Also, we argue that similar malicious apps (same malware family) behaviors are translated to similar words.

<open_key~key="HKEY_LOCAL_MACHINE\Software\Microsoft\Windows
NT\CurrentVersion\AppCompatFlags\Layers"/> <open_key
key="HKEY_CURRENT_USER\Software\Microsoft\Windows
NT\CurrentVersion\AppCompatFlags\Layers"/> <open_key
key="HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\
LanmanWorkstation\NetworkProvider"/>
</registry_section> <process_section> <enum_processes
apifunction="Process32First" quantity="84"/> <open_process targetpid="308"
desiredaccess="PROCESS_ALL_ACCESS PROCESS_CREATE_PROCESS PROCESS_CREATE_THREAD
PROCESS_DUP_HANDLE PROCESS_QUERY_INFORMATION PROCESS_SET_INFORMATION
PROCESS_TERMINATE PROCESS_VM_OPERATION PROCESS_VM_READ PROCESS_VM_WRITE
PROCESS_SET_SESSIONID PROCESS_SET_QUOTA SYNCHRONIZE"
apifunction="NtOpenProcess" successful="1"/>
Figure 1: Win32 Malware Behavioral Report Snippet (ThreatAnalyzer, www.threattrack.com)

Nowadays, there are many software sandbox solutions for malware investigations. CWSandbox (2006-2011) is one of the first sandbox solutions for production use. Later, CWSandbox becomes ThreatAnalyzer 333https://www.threattrack.com/malware-analysis.aspx, owned by ThreatTrack Security. TheatAnalyzer is a sandbox system for Win32 malware, and it produces behavioral reports that cover most of the malware behavior aspects such as a file, network, register access records. Figure 1 shows a snippet from a behavioral report generated by ThreatAnalyzer. For android malware, we use DroidBox droidbox_github , a well-established sandbox environment based on Android software emulator android_emulator provided by Google Android SDK android_sdk . Running an app may not lead to a sufficient coverage of the executed app. As such, to simulate the user interaction with the apps, we leverage MonkeyRunner monkeyrunner , which produces random UI actions aiming for a broader execution coverage. However, this makes the app execution non-deterministic since MonkeyRunner generates random actions. Figure 2 shows a snippet from the behavioral report generated using DroidBox.

"accessedfiles": { "1546331488": "/proc/1006/cmdline","2044518634":
"/data/com.macte.JigsawPuzzle.Romantic/shared_prefs/com.apperhand.global.xml",
"296117026":
"/data/com.macte.JigsawPuzzle.Romantic/shared_prefs/com.apperhand.global.xml",
"592194838": "/data/data/com.km.installer/shared_prefs/TimeInfo.xml",
"956474991": "/proc/992/cmdline"},"apkName": "fe3a6f2d4c","closenet":
{},"cryptousage": {},"dataleaks": {},"dexclass": { "0.2725639343261719": {
    "path": "/data/app/com.km.installer-1.apk", "type": "dexload"}
Figure 2: Android Malware Behavioral Report Snippet (DroidBox droidbox_github )
Figure 3: MalDy Methodology Overview

4 Notation

  • is the global dataset used to build and report MalDy performance in the various tasks. We use build set to train and tune the hyper-parameters of MalDy models. The test set is use to measure the final performance of MalDy, which is reported in the evaluation section. is divided randomly and equally to (50%) and (50%). To build the sub-datasets, we employ the stratified random split on the main dataset.

  • Build set, , is composed of the train set and validation set and used to build MalDy ensembles.

  • Build size is the total number of reports used to build MalDy. The train set takes 90% of the build set and the rest is used as a validation set .

  • The train set, , is the training dataset of MalDy machine learning models.

  • The size of is the number of reports in the train set.

  • The validation set, , is the dataset used to tune the trained model. We choose the hyper-parameters that achieve the best scores on the validation set.

  • The size of is the number of reports in the validation set.

  • A single record in is composed of a single report and its label . The label meaning depends on the investigation task. In the detection task, a positive means malware, and a negative means benign. In the family attribution task, a positive means the sample is part of the current model malware family and a negative is not.

  • We use to compute and report back the final performance results as presented in the evaluation section (Section 8).

  • is the size the and it represent 50% of the global dataset .

5 Methodology

In this section, we present the general approach of MalDy as illustrated in Figure 3. The section describes the approach based on the chronological order of the building steps.

5.1 Behavioral Reports Generation

MalDy Framework starts from a dataset of behavioral reports with known labels. We consider two primary sources for such reports based on the collection environment. First, We collect the reports from a software sandbox environment Toward2017WillemsHF , in which we execute the binary program, malware or benign, in a controlled system (mostly virtual machines). The main usage of sandboxing in security investigation is to check and analyze the maliciousness of programs. Second, we could collect the behavioral reports from a production system in the form of system logs of the running apps. The goal is to investigate the sanity of the apps during the executions and there is no malicious activity. As presented in Section 3, MalDy employs a word-based approach to model the behavioral reports, but it is not clear yet how to use the report’s sequence of words in MalDy Framework.

5.2 Report Vectorization

In this section, we answer the question: how can we model the words in the behavioral report to fit in our classification component? Previous solutions Sen2016StormDroid Automatic2011Rieck select specific features from the behavioral reports by: (i) extract relevant security features (ii) manually inspect and select from these features Sen2016StormDroid . This process involves manual work from the security investigator. Also, it is not scalable since the investigator needs to redo this process manually for each new type of behavioral report. In other words, we are looking for features (words in our case) representation that makes an automatic feature engineering without the intervention of a security expert. For this purpose, MalDy proposes to employ Bag of Word (BoW) NLP model. Specifically, we leverage term frequency-inverse document frequency (TFIDF) itidf_wiki or feature hashing (trick) (FH) qinfeng09hashk

. MalDy has two variants based on the chosen BoW technique whether TFIDF or FH. These techniques generate fixed length vectors for the behavioral reports using words’ frequencies. TFIDF and FH are presented in more detail in Section

6. At this point, we formulate the reports into features vectors, and we are looking to build classification models.

5.3 Build Models

MalDy framework utilizes a supervised machine learning to build its malware investigation models. To this point, MalDy is composed of a set of models, each model has a specific purpose. First, we have the threat detection model that finds out the maliciousness likelihood of a given app from its behavioral report. Afterward, the rest machine learning models aim to investigate individual family threats separately. MalDy uses a model for each possible threat that the investigator is checking for. In our case, we have a malware detection model along with a set of malware family attribution models. In this phase, we build each model separately using . All the models are conducting a binary classification to provide the likelihood of a specific threat. In the process of building MalDy models, we evaluate different classification algorithms to compare their performance. Furthermore, we tune up each ML algorithm classification performance under an array of hyper-parameters (different for each ML algorithm). The latter is a completely automatic process; the investigator only needs to provide . We train each investigation model on and tune its performance on by finding the kptimum algorithm hyper-parameters as presented in Algorithm 1. Afterward, we determine the optimum decision threshold for each model using it performance on . At the ends this stage, we have list of optimum models’ tuples , the cardinality of list is number of explored classification algorithms. A tuple defines the optimum hyper-parameters and decision threshold for ML classification algorithm .

Input : : build set
Output : : optimum models’ tuples
1 for c in MLAlgorithms do
2       score = 0 for params in c.params_array do
3             model = train() ;
4             s, th = validate(model, ) ;
5             if s score then
6                   ct = ;
7                  
8             end if
9            
10       end for
11      .add(ct)
12 end for
return
Algorithm 1 Build Models Algorithm

5.4 Ensemble Composition

Previously, we discuss the process of building and tunning individual classification model for specific investigation tasks (malware detection, family one threat attribution, family two threat attribution, etc.). In this phase, we construct an ensemble model (outperforms single models) from a set of models generated using the optimum parameters computed previously (Section 5.3). We take each set of optimally trained models for a specific threat investigation task and unify them into an ensemble . The latter utilizes the majority voting mechanism between the individual model’s outcomes for a specific investigation task. Equation 1 shows the computation of the final outcome for one ensemble , where is the weight given for a single model. The current implementation gives equal weights for the ensemble’s models. We consider exploring variations for future work. This phase produces MalDy ensembles, , an ensemble for each threat and the outcome is the likelihood this threat to be positive.

(1)

5.4.1 Ensemble Prediction Process

MalDy prediction process is divided into two phases as depicted in Algorithm 2. First, given a behavioral report, we generate the feature vector using TFIDF or FH vectorization techniques. Afterward, the detection ensemble checks the maliciousness likelihood of the feature vector . If the maliciousness detection is positive, we proceed to the family threat attribution. Since the family threat ensembles, , are independent, we compute the outcomes of each family ensemble . MalDy flags a malware family threat if and only if the majority voting is above a given voting threshold (computed using

). In the case there is no family threat flagged by the family ensembles, MalDy will tag the current sample as an unknown threat. Also, in the case of multiple families are flagged, MalDy will select the family with the highest probability, and provide the security investigator with the flagged families sorted by the likelihood probability. The separation between the family attribution models makes MalDy more flexible to update. Adding a new family threat will need only to train, tune, and calibrate the family model without affecting the rest of the framework ensembles.

Input : : Report
Output : : Decision
1 = ;
2 = ;
3 = Vectorize() ;
4 detection_result = ;
5 if detection_result 0 then
6       return detection_result ;
7      
8 end if
9for  in  do
10       family_result = (x) ;
11      
12 end for
13return detection_result, family_result ;
Algorithm 2 Prediction Algorithm

6 Framework

In this section, we present in more detail the key techniques used in MalDy framework namely, n-grams

ngram2004AbouAssaleh , feature hashing (FH), and term frequency inverse document frequency (TFIDF). Furthermore, we present the explored and tuned machine learning algorithms during the models building phase (Section 6.2).

6.1 Feature Engineering

In this section, we describe the components of the MalDy related to the automatic security feature engineering process.

6.1.1 Common N-Gram Analysis (CNG)

A key tool in MalDy feature engineering process is the common N-gram analysis (CNG) ngram2004AbouAssaleh or simply N-gram. N-gram tool has been extensively used in text analyses and natural language processing in general and its applications such as automatic text classification and authorship attribution ngram2004AbouAssaleh . Simply, n-gram computes the contiguous sequences of n items from a large sequence. In the context of MalDy, we compute word n-grams on behavioral reports by counting the word sequences of size n. Notice that the n-grams are extracted using a moving forward window (of size n) by one step and incrementing the counter of the found feature (word sequence in the window) by one. The window size n is a hyper-parameter in MalDy framework. N-gram computation happens simultaneously with the vectorization using FH or TFIDF in the form of a pipeline to prevent computation and memory issues due to the high dimensionality of the n-grams. From a security investigation perspective, n-grams tool can produce distinguishable features between the different variations of an event log compared to single word (1-grams) features. The performance of the malware investigation is highly affected by the features generated using n-grams (where ). Based on BoW model, MalDy considers the count of unique n-grams as features that will be leveraged by through a pipeline to the FH or TFIDF.

6.1.2 Feature Hashing

The first approach to vectorize the behavioral reports is to employ feature hashing (FH) qinfeng09hashk (also called hashing trick) along with n-grams. Feature hashing is a machine learning preprocessing technique for compacting an arbitrary number of features into a fixed-length feature vector. The feature hashing algorithm, described in Algorithm 3, takes as input the report N-grams generator and the target length L of the feature vector. The output is a feature vector with a fixed size L. We normalize using the euclidean norm (also called L2 norm). As shown in Formula 2, the euclidean norm is the square root of the sum of the squared vector values.

(2)
Input : X_seq: Report Word Sequence,
L: Feature Vector Length
Output : FH: Feature Hashing Vector
1 ngrams = Ngram_Generator(X_seq);
2 FH = new feature_vector[L];
3 for Item in ngrams do
4       H = hash(Item) ;
5       feature_index = H mod L ;
6       FH[feature_index] += 1 ;
7      
8 end for
9// normalization
10 FH = FH / ;
Algorithm 3 Feature Vector Computation

Previous researches Weinbergeretal09 ; qinfeng09hashk have shown that the hash kernel approximately preserves the vector distance. Also, the computational cost incurred by using the hashing technique for reducing a dimensionality grows logarithmically with the number of samples and groups. Furthermore, it helps to control the length of the compressed vector in an associated feature space. Algorithm 3 illustrates the overall process of computing the compacted feature vector.

6.1.3 Term Frequency-Inverse Document Frequency

TFIDF itidf_wiki is the second possible approach for behavioral reports vectorization that also leverages N-grams tool. It is a well-known technique adopted in the fields of information retrieval (IR) and natural language processing (NLP). It computes feature vectors of input behavioral reports by considering the relative frequency of the n-grams in the individual reports compared to the whole reports dataset. Let be a set of behavioral documents, where is the number of reports, and let be a report, where is the number of n-grams in . TFIDF of n-gram and report is the product of term frequency of in and the inverse document frequency of , as shown in Formula 3. The term frequency (Formula 4) is the occurrence number of in . Finally, the inverse document frequency of (Formula 5) represents the number of documents divided by the number of documents that contain in the logarithmic form. Similarly to the feature hashing (Section 6.1.2), we normalize the produced vector using L2 norm (see Formula 2. The computation of TFIDF is very scalable, which enhance MalDy efficiency.

(3)
(4)
(5)

6.2 Machine Learning Algorithms

Table 1

shows the candidate machine learning classification algorithms for MalDy framework. The candidates represent the most used classification algorithms and come from different learning categories such as tree-based. Also, all these algorithms have efficient public implementations. We chose to exclude the logistic regression from the candidate list due to the superiority of SVM in almost all cases. KNN may consume a lot of memory resources during the production because it needs all the training dataset to be deployed in the production environment. However, we keep KNN in MalDy candidate list because of its unique fast update feature. Updating KNN in a production environment requires only update the train set, and we do not need to retrain the model. This option could be very helpful in certain malware investigation cases. Considering other ML classifiers is considered for future work design and implementation.

Classifier Category Classifier Algorithm Chosen
CART
Tree Random Forest
Extremely Randomized Trees
General K-Nearest Neighbor (KNN)
Support Vector Machine (SVM)
Logistic Regression
XGBoost
Table 1: Explored Machine Learning Classifiers

7 Evaluation Datasets

Table 2 presents the different datasets used to evaluate MalDy framework. We focus on the Android and Win32 platforms to prove the portability of MalDy, and other platforms are considered for a further future research. All the used datasets are publicly available except the Win32 Malware dataset, which is provided by a third-party security vendor. The behavioral reports are generated using DroidBox droidbox_github and ThreatAnalayzer 444threattrack.com for Android and Win32 respectively.

Platform Dataset Sandbox Tag #Sample/#Family
MalGenome zhou2012dissecting D Malware 1k/10
Android Drebin arp2014drebin D Malware 5k/10
Maldozer karbab2018maldozer D Malware 20k/20
AndroZoo Allix2016AndroZoo D Benign 15k/-
PlayDrone 555https://archive.org/details/android_apps D Benign 15k/-
Win32 Malware 666https://threattrack.com/ T Malware 20k/15
Table 2: Evaluation Datasets. D: DroidBox, T: ThreatAnalyzer

8 MalDy Evaluation

(a) General
(b) Malgenome
(c) Drebin
(d) Maldozer
Figure 4: MalDy Effectiveness Performance

In the section, we evaluate MalDy framework on different datasets and various settings. Specifically, we question the effectiveness of the word-based approach for malware detection and family attribution on Android behavior reports (Section 8.1). We verify the portability and MalDy concept on other platforms (Win32 malware) behavioral reports (Section 8.2). Finally, We measure the efficiency of MalDy under different machine learning classifiers and vectorization techniques (Section 8.3). During the evaluation, we answer some other questions related to the comparison between the vectorization techniques (Section 8.1.2, and the used classifiers in terms of effectiveness and efficiency (Section 8.1.1). Also, we show the effect of train-set’s size (Section 8.2.2) and the usage of machine learning ensemble technique (Section 8.1.3) on the framework performance.

8.1 Effectiveness

The most important question in this research is: Can MalDy framework detect malware and make family attribution using a word-based model on behavioral reports? In other words, how effective this approach? Figure 4 shows the detection and the attribution performance under various settings and datasets. The settings are the used classifiers in ML ensembles and their hyper-parameters, as shown in Table 4. Figure 4(a) depicts the overall performance of MalDy. In the detection, MalDy achieves 90% f1-score (100% maximum and about 80% minimum) in most cases. On the other hand, in the attribution task, MalDy shows over 80% f1-score in the various settings. More granular results for each dataset are shown in Figures 4(b), 4(c), and 4(d) for Malgenome zhou2012dissecting , Drebin arp2014drebin , and Maldozer karbab2018maldozer datasets respectively. Notice that Figure 4(a) combines the performance of based (worst), tuned, and ensemble models, and summaries the results in Table 3.

Detection (F1 %) Attribution (F1 %)
Base Tuned Ens Base Tuned Ens
General
mean 86.06 90.47 94.21 63.42 67.91 73.82
std 6.67 6.71 6.53 15.94 15.92 14.68
min 69.56 73.63 77.48 30.14 34.76 40.75
25% 83.58 88.14 90.97 50.90 55.58 69.07
50% 85.29 89.62 96.63 68.81 73.31 78.21
75% 91.94 96.50 99.58 73.60 78.07 84.52
max 92.81 97.63 100.0 86.09 90.41 93.78
Genome
mean 88.78 93.23 97.06 71.19 75.67 79.92
std 5.26 5.46 4.80 16.66 16.76 16.81
min 77.46 81.69 85.23 36.10 40.10 44.09
25% 85.21 89.48 97.43 72.36 77.03 81.47
50% 91.82 96.29 99.04 76.66 81.46 86.16
75% 92.13 96.68 99.71 80.72 84.82 88.61
max 92.81 97.63 100.0 86.09 90.41 93.78
Drebin
mean 88.92 93.34 97.18 65.97 70.37 76.47
std 4.93 4.83 4.65 9.23 9.14 9.82
min 78.36 83.35 85.37 47.75 52.40 55.10
25% 84.95 89.34 96.56 61.67 65.88 75.05
50% 91.60 95.86 99.47 69.62 74.30 80.16
75% 92.25 96.53 100.0 72.68 76.91 81.61
max 92.78 97.55 100.0 76.28 80.54 87.71
Maldozer
mean 80.48 84.85 88.38 53.11 57.68 65.06
std 6.22 6.20 5.95 16.03 15.99 13.22
min 69.56 73.63 77.48 30.14 34.76 40.75
25% 75.69 80.13 84.56 39.27 43.43 53.65
50% 84.20 88.68 91.58 56.62 61.03 71.65
75% 84.88 89.01 92.72 67.34 71.89 74.78
max 85.68 89.97 93.39 71.17 76.04 78.30
Table 3: Tuning Effect of Tuning of MalDy Performance

8.1.1 Classifier Effect

The results in Figure 5, Table 3, and the detailed Table 4 confirm the effectiveness of MalDy framework and its word-based approach. Figure 5 presents the effectiveness performance of MalDy using the different classifier for the final ensemble models. Figure 5(a) shows the combined performance of the detection and attribution in f1-score. All the ensembles achieved a good f1-score, and XGBoost ensemble shows the highest scores. Figure 5(b) confirms the previous notes for the detection task. Also, Figure 5(c) presents the malware family attribution scores per ML classifier. More details on the classifiers performance is depicted in Table 4.

(a) General
(b) Detection
(c) Attribution
Figure 5: MalDy Effectiveness per Machine Learning Classifier

8.1.2 Vectorization Effect

Figure 6 shows the effect of vectorization techniques on the detection and the attribution performance. Figure 6(a) depict the overall combined performance under the various settings. As depicted in Figure 6(a), Feature hashing and TFIDF show a very similar performance. In detection task, the vectorization techniques’ f1-score is almost identical as presented in Figure 6(b). We notice a higher overall attribution score using TFIDF compared to FH, as shown in Figure 6(c). However, we may have cases where FH outperforms TFIDF. For instance, XGBoost achieved a higher attribution score under the feature hashing vectorization, as shown in Table 4.

(a) General
(b) Detection
(c) Attribution
Figure 6: MalDy Effectiveness per Vectorization Technique
Settings Attribution F1-Score (%) Detection F1-Score (%)
Model Dataset Vector Base Tuned Ensemble Base Tuned Ensemble FPR(%)
CART Drebin Hashing 64.93 68.94 72.92 91.55 95.70 99.40 00.64
Drebin TFIDF 68.12 72.48 75.76 92.48 96.97 100.0 00.00
Genome Hashing 82.59 87.28 89.90 91.79 96.70 98.88 00.68
Genome TFIDF 86.09 90.41 93.78 92.25 96.50 100.0 00.00
Maldozer Hashing 33.65 38.56 40.75 82.59 87.18 90.00 06.92
Maldozer TFIDF 40.14 44.21 48.07 83.92 88.67 91.16 04.91
ETrees Drebin Hashing 72.84 77.27 80.41 91.65 95.77 99.54 00.23
Drebin TFIDF 71.12 76.12 78.13 92.78 97.55 100.0 00.00
Genome Hashing 74.41 79.20 81.63 91.91 96.68 99.14 00.16
Genome TFIDF 73.83 78.65 81.02 92.09 96.61 99.57 00.03
Maldozer Hashing 65.23 69.34 73.13 84.56 88.70 92.42 06.53
Maldozer TFIDF 67.14 71.85 74.42 84.84 88.94 92.74 06.41
KNN Drebin Hashing 47.75 52.40 55.10 78.36 83.35 85.37 12.86
Drebin TFIDF 51.87 56.53 59.20 82.48 86.57 90.40 05.83
Genome Hashing 36.10 40.10 44.09 77.46 81.69 85.23 07.01
Genome TFIDF 37.66 42.01 45.31 81.22 85.30 89.13 02.10
Maldozer Hashing 41.68 46.67 48.69 69.56 73.63 77.48 26.21
Maldozer TFIDF 48.02 52.73 55.31 70.94 75.36 78.51 03.86
RForest Drebin Hashing 72.63 76.80 80.46 91.54 95.95 99.12 00.99
Drebin TFIDF 72.15 76.40 79.91 92.31 96.62 100.0 00.00
Genome Hashing 78.92 83.73 86.12 91.37 95.79 98.95 00.68
Genome TFIDF 79.45 83.90 87.00 92.75 97.49 100.0 00.00
Maldozer Hashing 66.06 70.72 73.41 84.49 88.96 92.01 07.37
Maldozer TFIDF 67.96 72.04 75.89 85.07 89.41 92.72 06.10
SVM Drebin Hashing 57.35 61.95 82.92 84.50 89.33 96.08 00.86
Drebin TFIDF 63.11 67.19 87.71 85.11 89.35 96.73 01.15
Genome Hashing 69.99 74.68 86.08 85.47 89.83 96.54 00.19
Genome TFIDF 73.16 77.82 86.20 84.46 88.46 97.73 00.39
Maldozer Hashing 30.14 34.76 65.76 72.32 77.12 81.88 15.82
Maldozer TFIDF 36.69 41.09 70.18 76.82 81.14 85.46 08.56
XGBoost Drebin Hashing 76.28 80.54 84.01 92.05 96.50 99.61 00.29
Drebin TFIDF 73.53 77.88 81.18 92.23 96.45 100.0 00.00
Genome Hashing 81.80 85.84 89.75 91.86 96.09 99.62 00.32
Genome TFIDF 80.36 84.48 88.24 92.81 97.63 100.0 00.00
Maldozer Hashing 71.17 76.04 78.30 85.68 89.97 93.39 05.86
Maldozer TFIDF 69.51 74.15 76.87 85.01 89.16 92.86 06.05
Table 4: Android Malware Detection

8.1.3 Tuning Effect

Figure 7 illustrates the effect of tune and ensemble phases on the overall performance of MalDy. In the detection task, as in Figure 7(a), the ensemble improves the performance by  10% f1-score over the base model. The ensemble is composed of a set of tuned models that already outperform the base model. In the attribution task, the ensemble improves the f1-score by  9%, as shown in Figure 7(b).

(a) Detection
(b) Attribution
Figure 7: Effect of MalDy Ensemble and Tunning on the Performance

8.2 Portability

In this section, we question the portability of the MalDy by applying the framework on a new type of behavioral reports (Section 8.2.1). Also, we investigate the appropriate train-set size for MalDy to achieve a good results (Section 8.2.2). We reports only the results the attribution task on Win32 malware because we lack Win32 a benign behavioral reports dataset.

8.2.1 MalDy on Win32 Malware

Table 5 presents MalDy attribution performance in terms of f1-score. In contrast with previous results, we trains MalDy models on only 2k (10%) out of 20k reports’ dataset (Table 2). The rest of the reports have been used for testing (18k reports, or 80%). Despite that, MalDy achieved high scores that reaches 95%. The results in Table 5 illustrate the portability of MalDy which increases the utility of our framework across the different platforms and environments.

Model Ensemble F1-Score(%)
Hashing TFIDF
CART 82.35 82.74
ETrees 92.62 92.67
KNN 76.48 80.90
RForest 91.90 92.74
SVM 91.97 91.26
XGBoost 94.86 95.43
Table 5: MalDy Performance on Win32 Malware Behavioral Report

8.2.2 MalDy Train Dataset Size

Using Win32 malware dataset (Table 5), we investigate the train-set size hyper-parameter for Maldy to achieve good results. Figure 8 exhibits the outcome of our analysis for both vectorization techniques and the different classifiers. We notice the high scores of MalDy even with relatively small datasets. The latter is very clear if MalDy uses SVM ensemble, in which it achieved 87% f1-score with only 200 training samples.

8.3 Efficiency

Figure 8.3 illustrates the efficiency of MalDy by showing the average runtime require to investigate a behavioral report. The runtime is composed of the preprocessing time and the prediction time. As depicted in Figure 8.3, MalDy needs only about 0.03 second for a given report for all the ensembles and the preprocessing settings except for SVM ensemble. The latter requires from 0.2 to 0.5 seconds (depend on the preprocessing technique) to decide about a given report. Although SVM ensemble needs a small train-set to achieve good results (see Section 8.2.2), it is very expensive in production in terms of runtime. Therefore, the security investigator could customize MalDy framework to suite particular cases priorities. The efficiency experiments have been conducted on Intel(R) Xeon(R) CPU E52630 (128G RAM), we used only one CPU core.

(a) Hashing (F1-Score %)
(b) TFIDF (F1-Score %)
Figure 8: MalDy on Win32 Malware and Effect the training Size of the Performance
Figure 9: MalDy Efficiency

9 Conclusion, Limitation, and Future Work

The daily number of malware, that target the well-being of the cyberspace, is increasing exponentially, which overwhelms the security investigators. Furthermore, the diversity of the targeted platforms and architectures compounds the problem by opening new dimensions to the investigation. Behavioral analysis is an important investigation tool to analyze the binary sample and produce behavioral reports. In this work, we propose a portable, effective, and yet efficient investigation framework for malware detection and family attribution. The key concept is to model the behavioral reports using the bag of words model. Afterwards, we leverage advanced NLP and ML techniques to build discriminative machine learning ensembles. MalDy achieves over 94% f1-score in Android detection task on Malgenome, Drebin, and MalDozer datasets and more than 90% in the attribution task. We prove MalDy portability by applying the framework on Win32 malware reports where the framework achieved 94% on the attribution task. MalDy performance depends to the execution environment reporting system and the quality of the reporting affects its performance. In the current design, MalDy is not able to measure this quality to help the investigator choosing the optimum execution environment. We consider solving this issue for future research.

10 References

References

  • [1] Android Emulator - https://tinyurl.com/zlngucb, 2016.
  • [2] Android SDK - https://tinyurl.com/hn8qo9o, 2016.
  • [3] DroidBox - https://tinyurl.com/jaruzgr, 2016.
  • [4] MonkeyRunner - https://tinyurl.com/j6ruqkj, 2016.
  • [5] tf-idf - https://tinyurl.com/mcdf46g, 2016.
  • [6] T. Abou-Assaleh, N. Cercone, V. Keselj, and R. Sweidan. N-gram-based detection of new malicious code. International Computer Software and Applications Conference (COMPSAC), 2004.
  • [7] Kevin Allix, Tegawendé F Bissyandé, Jacques Klein, and Yves Le Traon. AndroZoo: collecting millions of Android apps for the research community. In International Conference on Mining Software Repositories (MSR), 2016.
  • [8] Mohammed K Alzaylaee, Suleiman Y Yerima, and Sakir Sezer. DynaLog: An automated dynamic analysis framework for characterizing Android applications. CoRR, 2016.
  • [9] Daniel Arp, Michael Spreitzenbarth, Hubner Malte, Hugo Gascon, Konrad Rieck, Malte Hubner, Hugo Gascon, and Konrad Rieck. DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket. In Symposium on Network and Distributed System Security (NDSS), 2014.
  • [10] Ulrich Bayer, Paolo Milani Comparetti, Clemens Hlauschek, Christopher Kruegel, and Engin Kirda. Scalable , Behavior-Based Malware Clustering. In Symposium on Network and Distributed System Security (NDSS), 2009.
  • [11] Ulrich Bayer, Paolo Milani Comparetti, Clemens Hlauschek, Christopher Krügel, Engin Kirda, Christopher Kruegel, and Engin Kirda. Scalable, Behavior-Based Malware Clustering. In Symposium on Network and Distributed System Security (NDSS), 2009.
  • [12] Li Chen, Mingwei Zhang, Chih-yuan Yang, and Ravi Sahita. Semi-supervised Classification for Dynamic Android Malware Detection. ACM Conference on Computer and Communications Security (CCS), 2017.
  • [13] Sen Chen, Minhui Xue, Zhushou Tang, Lihua Xu, Haojin Zhu, Nyu Shanghai, Zhushou Tang, Lihua Xu, and Haojin Zhu. StormDroid: A Streaminglized Machine Learning-Based System for Detecting Android Malware. In ACM Symposium on Information, Computer and Communications Security (ASIACCS), 2016.
  • [14] Mariano Graziano, Davide Canali, Leyla Bilge, Andrea Lanzi, and Davide Balzarotti. Needles in a Haystack: Mining Information from Public Dynamic Analysis Sandboxes for Malware Intelligence. USENIX Security Symposium, 2015.
  • [15] Xin Hu, Sandeep Bhatkar, Kent Griffin, and Kang G Shin. MutantX-S: Scalable Malware Clustering Based on Static Features. USENIX Annual Technical Conference, 2013.
  • [16] Paul Irolla and Eric Filiol. Glassbox: Dynamic Analysis Platform for Malware Android Applications on Real Devices. CoRR, 2016.
  • [17] Takamasa Isohara, Keisuke Takemori, and Ayumu Kubota. Kernel-based behavior analysis for android malware detection. International Conference on Computational Intelligence and Security (CIS), 2011.
  • [18] ElMouatez Billah Karbab and Mourad Debbabi. Automatic investigation framework for android malware cyber-infrastructures. CoRR, 2018.
  • [19] ElMouatez Billah Karbab and Mourad Debbabi. Togather: Automatic investigation of android malware cyber-infrastructures. In International Conference on Availability, Reliability and Security, (ARES), 2018.
  • [20] ElMouatez Billah Karbab, Mourad Debbabi, Saed Alrabaee, and Djedjiga Mouheb. Dysign: Dynamic fingerprinting for the automatic detection of android malware. CoRR, 2017.
  • [21] ElMouatez Billah. Karbab, Mourad Debbabi, Saed Alrabaee, and Djedjiga Mouheb. DySign: Dynamic fingerprinting for the automatic detection of android malware. International Conference on Malicious and Unwanted Software (MALWARE), 2017.
  • [22] ElMouatez Billah. Karbab, Mourad Debbabi, Abdelouahid Derhab, and Djedjiga Mouheb. Cypider: Building Community-Based Cyber-Defense Infrastructure for Android Malware Detection. In ACM Computer Security Applications Conference (ACSAC), 2016.
  • [23] ElMouatez Billah Karbab, Mourad Debbabi, Abdelouahid Derhab, and Djedjiga Mouheb.

    Android malware detection using deep learning on API method sequences.

    CoRR, 2017.
  • [24] ElMouatez Billah. Karbab, Mourad Debbabi, Abdelouahid Derhab, and Djedjiga Mouheb. MalDozer: Automatic framework for android malware detection using deep learning. Digital Investigation, 2018.
  • [25] Amin Kharraz, Sajjad Arshad, Collin Mulliner, William K Robertson, Engin Kirda, Amin Kharaz, Sajjad Arshad, Collin Mulliner, William K Robertson, Engin Kirda, and Amin Kharraz. UNVEIL: A Large-Scale, Automated Approach to Detecting Ransomware. In USENIX Security Symposium. USENIX Association, aug 2016.
  • [26] Clemens Kolbitsch, Paolo Milani Comparetti, Christopher Kruegel, Engin Kirda, Xiaoyong Zhou, Xiaofeng Wang, U C Santa Barbara, and Sophia Antipolis. Effective and Efficient Malware Detection at the End Host. USENIX Security Symposium, 2009.
  • [27] Enrico Mariconti, Lucky Onwuzurike, Panagiotis Andriotis, Emiliano De Cristofaro, Gordon Ross, and Gianluca Stringhini.

    MaMaDroid: Detecting Android Malware by Building Markov Chains of Behavioral Models.

    In Symposium on Network and Distributed System Security (NDSS), 2017.
  • [28] Fabio Martinelli, Francesco Mercaldo, Andrea Saracino, and Corrado Aaron Visaggio. I find your behavior disturbing: Static and dynamic app behavioral analysis for detection of Android malware. Conference on Privacy, Security and Trust (PST), 2016.
  • [29] Lakshmanan Nataraj, Vinod Yegneswaran, Phillip Porras, and Jian Zhang. A comparative assessment of malware classification using binary texture analysis and dynamic analysis. In

    ACM workshop on Security and Artificial Intelligence (AISec)

    , 2011.
  • [30] Konrad Rieck, Philipp Trinius, Carsten Willems, and Thorsten Holz. Automatic analysis of malware behavior using machine learning. Journal of Computer Security, 2011.
  • [31] Giorgio Severi, Tim Leek, and Brendan Dolan-gavitt. Malrec: Compact Full-Trace Malware Recording for Retrospective Deep Analysis. Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), 2018.
  • [32] Daniele Sgandurra, Luis Muñoz-González, Rabih Mohsen, and Emil C. Lupu. Automated Dynamic Analysis of Ransomware: Benefits, Limitations and use for Detection. CoRR, 2016.
  • [33] Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alexander J Smola, Alexander L Strehl, and Vishy Vishwanathan. Hash Kernels. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2009.
  • [34] Chi-Wei Wang and Shiuhpyng Winston Shieh. DROIT: Dynamic Alternation of Dual-Level Tainting for Malware Analysis. J. Inf. Sci. Eng., 31(1):111–129, 2015.
  • [35] Kilian Weinberger, Anirban Dasgupta, Josh Attenberg, John Langford, and Alex Smola. Feature Hashing for Large Scale Multitask Learning. Annual International Conference on Machine Learning (ICML), 2009.
  • [36] Carsten Willems, Thorsten Holz, Felix C Freiling, Garsten Willems, Thorsten Holz, and Felix C Freiling. Toward automated dynamic malware analysis using CWSandbox. IEEE Symposium on Security and Privacy (SP), 5(2):32–39, 2007.
  • [37] Michelle Y Wong and David Lie. Intellidroid: A targeted input generator for the dynamic analysis of android malware. In Symposium on Network and Distributed System Security (NDSS), 2016.
  • [38] Wen-Chieh Wu and Shih-Hao Hung. DroidDolphin: a dynamic android malware detection framework using big data and machine learning. In Conference on Research in Adaptive and Convergent Systems (RACS). ACM, 2014.
  • [39] Yajin Zhou and Xuxian Jiang. Dissecting android malware: Characterization and evolution. In IEEE Symposium on Security and Privacy (SP), 2012.