In recent years, we have witnessed the explosive growth of smartphone usage after global sales surpassed the sales of basic mobile phones (or feature phones) in early 2013 (Newshub, 2013). Nowadays, smartphones are used everywhere in our daily life, e.g., for online shopping, mobile games, online banking, personal heath care, and even as remote controllers. According to a survey on global mobile OS market shares (statista, 2017), Android is the visibly dominant mobile operating system with a market share as of the second quarter of 2017. Android is now powering not only smartphones but also tablets, TVs, wearable devices and even IoT with Android Things. The rapid growth of smartphone usage and the huge market share of the Android OS have not only brought about the opportunities for mobile application development, but also the challenges needed to defend devices from Android-targeting malware. According to Kaspersky’s Mobile Malware Evolution 2016 Report (Unuchek, 2017), the number of malicious installation packages amounted to in 2016—almost three times more than that in 2015. Also, the distribution of malware through Google Play and other online app stores is growing.
To win the battle and protect mobile phone users, a number of anti-virus companies (e.g., Norton, McAfee, Symantec, Kingsoft) provide software products as a major defence against these kinds of threats. These products typically use a signature-based method to recognize threats. For example, (Venugopal and Hu, 2008)
proposed a signature-based malware detection method that is well suited for mobile devices. In this method, a unique signature is generated for each known type of malware, and detection involves searching for a signature match for a given application. However, this can be easily evaded by attackers. Example counter-methods involve changing signatures using code obfuscation or repackaging. Due to an increasing number of malware installations and the rapidly changing malware patterns, it is necessary to develop more advanced detection methods in order to protect end users. To overcome this issue, the heuristic-based method was introduced since the late 1990s. This method is based on explicit expert rules that distinguish malware. However, these rules are prone to expert bias and have a hard time keeping pace with the speed of malware evolution.
In recent years, there has been an increasing trend of developing automatic and intelligent malware detection methods, using machine learning to overcome the challenges mentioned above. These techniques are capable of discovering certain patterns to detect previously unseen malware samples and identifying the malware families of malicious samples. These systems can be classified into two categories: dynamic analysis and static analysis. Dynamic analysis (Wu and Hung, 2014; Enck et al., 2014; Tam et al., 2015) involves accumulating information regarding API calls, environmental variables and data transmission during the execution of an application. For example, DroidDophin (Wu and Hung, 2014) uses DroidBox and APE to record 13 activity features. Another example is CopperDroid (Tam et al., 2015), which is a dynamic analysis system based on Virtual Machine Introspection (VMI) that extracts operating system interactions and process communication as features, in which both intra-process and inter-process communications are considered. Dynamic analysis is known to give precise predictions with low false positive rates. However, one drawback of dynamic analysis is that it can only detect malicious behavior if the corresponding code is executed. Therefore, the coverage of code for dynamic analysis is important: if the execution path is not well-designed to cover all the possible paths that may trigger malicious behavior, the analysis will result in a high false negative rate. Furthermore, dynamic analysis may have a low time efficiency and a long execution time if the paths that are not related to triggering malicious behavior are included and executed.
Static analysis on the other hand has two phases: feature extraction and classification. In the first phase, various features such as API calls and binary strings are extracted from an original file sample. In the second phase, machine learning is used to automatically categorize the file sample into several classes based on a vectorized feature representation of the file. Various machine-learning-based malware detection methods could differ in both phases. For example, DroidMat(Wu et al., 2012) performs static analysis on the manifest file and source code of an Android app to extract multiple features, including permissions, hardware resources and API calls. It then uses -means clustering and -NN classification to detect malware. DREBIN (Arp et al., 2014)
Some recent work also tried to combine dynamic and static analysis (Wong and Lie, 2016; Fu et al., 2017; Jiang and Xuxian, 2013; Ge et al., 2011). For example, (Jiang and Xuxian, 2013; Ge et al., 2011) use dynamic analysis to prune the false positives from the static analysis. But sometimes doing so may increase the number of false positives if a particular path is not executed during the dynamic analysis phase (Wong and Lie, 2016). On the other hand, IntelliDroid (Wong and Lie, 2016) and LeakSemantic (Fu et al., 2017) use static analysis to guide dynamic analysis. IntelliDroid uses static analysis to extract execution paths that have high chances of triggering malicious behavior, e.g., location leakage through networks, and then injects the extracted paths into the dynamic analysis stage, allowing the application to only execute those chosen paths. It then monitors the application’s behavior during execution. As a result, IntelliDroid can improve time efficiency without damaging code coverage. LeakSemantic (Fu et al., 2017), which is similar to IntelliDroid, only focuses on monitoring sensitive network transmissions.
In this paper, we argue that a successful Android malware detection solution critically depends on an effective yet concise feature representation, as well as a classifier that is able to model the interactions among features for malicious behavior discovery. We present a carefully crafted feature extraction procedure that covers all critical features. After feature encoding, we will obtain a long and highly sparse vector representation for each app. In order to handle feature sparsity as well as to model feature interactions in an efficient and effective manner, we propose to apply factorization machine (FM) as the final classifier for malware detection. Specifically, the highlights of our contributions are summarized as follows:
Effective feature representation. We extract types of features from Android manifest file and source code, including app components, permissions, intent filter, restricted APIs, suspicious APIs and used permissions. Compared to existing works (e.g., DREBIN (Arp et al., 2014)), our extracted features are simpler and effective, yet without redundancy (e.g. URLs).
A factorization-machine-based model
. Based on two observations of the feature representation of a given application, we propose a factorization-machine-based approach for malware detection. Different from the classifiers used in previous work, e.g., SVM, Naive Bayes, etc., Factorization machine can not only model the interactions across features, but is also able to handle the sparsity of vector representations. To the best of our knowledge, this is the first time that factorization machine is applied to Android malware detection after a careful feature extraction procedure.
Boosting malware detection performance. We evaluated our model on two typical malware datasets: the DREBIN dataset (Arp et al., 2014), involving malware samples and the Android Malware Dataset (AMD) (Wei et al., 2017), involving malware samples. On the DREBIN dataset, we achieved a test result of detection rate with a false positives. For the AMD dataset, a detection rate of was achieved with a false positive rate. These results outperform all state-of-the-art methods for Android malware detection using machine learning as well as most of the existing Anti-Virus engines on VirusTotal.
Malware family identification. We also evaluated our model’s performance on malware family identification, which is an important task for malware attribution. For this task, our model is trained to identify a certain malware family among samples from other families as well as clean files. We achieved an average detection accuracy of with an average false positive rate of for malware families from the DREBIN dataset. On the AMD dataset, our method managed to achieve an average detection rate of with a false positive rate.
The remainder of this paper is organized as follows. We first introduce the background of the Android system and describe our malware detection system and feature sets in detail in Sec. 2. Details about the proposed machine learning models and the motivations to use factorization machine are given in Sec. 3. Experimental results regarding malware detection performance, running time evaluation, as well as malware family detection performance are presented in Sec. 4. Sec. 5 discusses the related works, while the limitations of our current system are discussed in Sec. 6. Finally, we conclude the paper in Sec. 7.
2. Background and System Overview
In this section we will first introduce some background information for the Android system and Android application files (apk files). In the second half we briefly introduce the architecture of our malware detection system.
Android applications are written in Java and executed within a custom Java Virtual Machine (JVM), and each application package is contained in a jar file with the extension of apk. Android applications consist of many components of differing types, which are the essential building blocks of the application. Each component has an entry point through which the system or a user can enter the application and applications interact via components. Therefore, it is critical to analyze the component APIs for security concerns. There are four fundamental building blocks of applications on the Android platform.
Activities serve as the entry point for a user’s interaction with an app, and are also central to how a user navigates within an app or between apps.
Services are components that can perform long-running operations in the background without providing a user interface. A service can be started by other application components and will continue to run in the background even if the user switches to another application. In addition, components can be bound to services to interact with them, and even perform inter-process communication (IPC). For example, services can handle network transactions, play music, perform file I/O, or interact with content providers, all of which can occur in the background.
Broadcast receivers Android apps can send and receive broadcast messages from the Android system and other Android apps. These broadcasts are sent when an event of interest occurs. For example, the Android system sends broadcasts when various system events occur, such as when the system boots up or the device starts charging. Apps can also send custom broadcasts, for example, to notify other apps of something that they might be interested in, such as the completion of a download.
Apps can subscribe to receive specific broadcasts. When a broadcast is sent, the system automatically routes broadcasts to apps that have subscribed to receive that particular type of broadcast. Generally speaking, broadcasts can be used as a messaging system across apps and outside of the normal user flow
Content providers are components that are used to manage access to structured data sets, encapsulate data and provide mechanisms for defining data security. A content provider is a standard interface for connecting data in one process to code running in another process.
All components must be declared in the application manifest file before it can actually be used. Communications between different components are through intents and intent filters. Intents are messaging objects that can be used to request actions from other application components. An intent filter is an expression declared in the application manifest file that specifies the intent type that the component will receive.
2.2. System overview
Our malware detection system consists of four parts: Unpacking and Decompile, Feature Extraction, Encoding, and Prediction - all shown below in Fig 1. By the end of this section, we will have a detailed introduction for each part.
2.2.1. Unpacking and Decompiling
The original data we recieve for each application is an Android apk file. Each apk file is actually a zipped file that consists of the application source code, resources, assets, and manifest file. The application source code is encoded as dex files (i.e., Dalvik Executable Files) that can be interpreted by the Dalvik VM. The manifest file consists of a number of declarations and specifications. Finally, other resources may contain images, HTML files, etc.. Unfortunately, the dex files, as executable code, are hard to understand and therefore need to be converted into readable formats such as smali code or even Java code. Smali code is an intermediate form decompiled from the dex files. The takeaway is that, after unzipping the apk file we still need to decompile dex code before we can continue to feature extraction.
There are some popular tools available for decompiling dex code such as APKTool (Connor Tumbleson, Ryszard Wiśniewski, 2018) or baksmali (Ben Gruver et al., 2018)), which can unpack the apk file and decompile the dex files to smali code. In our system, we use aapt, which is a Android SDK tool, to extract manifest file information into a readable text file, and use APKtool to convert the classes.dex file into smali code. After this step, we obtain readable source code and the manifest file AndroidManifest.txt for each Android app, based on that representative features will be extracted.
2.2.2. Feature extraction
Feature engineering is the most important part for training a machine learning model. The upper bound of the performance of the model is directly depend on the used features. Through study of the Android system and tons of previous work, we finally decided to extract kind of features from both the source code and manifest file. From the manifest file we extract the following four types of features:
App components: As we know, an app contains several components of four types: activities, services, content providers and broadcast receivers. Those components, declared in the manifest file, define different user interfaces and interfaces to the Android system. The names of these components are collected to help identifying variants of well-known malware, for example the DroidKungFu family share the name of particular services (Arp et al., 2014).
Hardware features: If an application wants to request access to the hardware components of the device, such as its camera, GPS or sensors, then those features must be declared in the manifest file. Requesting certain hardware components may have security implications, For example, requesting of GPS and network modules may be a sign of location leakage.
Permissions: Android system use a permission mechanism to protect the privacy of users. An app must request permission to access sensitive data (e.g. SMS), system features (e.g. camera) and restricted APIs. Note that the permission system is one of the most important security mechanism in Android. Many operations need specific permissions to be executed and these permissions are granted by users upon installation. Malware usually tends to request a special set of permissions. Similar ideas also apply to hardware resources.
Intent filter: Intent filters declared within the declaration of components in the manifest file are import tools for inter-component and inter-application communication. Intent filters define a special entry point for a component as well as the application. Intent filters can be used for eavesdropping specific intents. Malware is sensitive to a special set of system events. Thus, intent filters can be hints.
Furthermore, We also extract another three types of features from the decompiled application source code (e.g., smali code):
Restricted APIs: In the Android system, some special APIs related to sensitive data access are protected by permissions. If an app calls these APIs without requesting corresponding permissions, it may be a sign of root exploits.
Suspicious APIs: We should be aware of a special set of APIs that can lead to malicious behavior without requesting permissions. For example, cryptography functions in the Java library and some math functions need no permission to be used. However, these functions can be used by malware for code obfuscation. Thus, attention should be paid to the unusual usage of these functions. We will mark these types of functions as suspicious APIs, following in the footsteps of Drebin (Arp et al., 2014).
Used permissions: We first extract all API calls from the app source code, and use this to build a set of permissions that are actually used in the app by looking up a predefined dictionary that links an API to its required permission(s).
As seen in the previous section, all of the features are with string values, so we need to encode them before they can be fed into a classifier. Here we use an -dimensional indicator to encode each application into a feature representation, where is the feature dimension. To be specific, suppose all the extracted features form a feature set with size , then each app will be represented as a -dimensional indicator, where each dimension is either or indicating whether the corresponding feature appears in the app. There are two things need to be noticed. First, the feature set size is often very large and grows as the dataset size become larger. Second, the number of features extracted for each app is relatively very small compared with the feature dimension, so we would often get a large, highly sparse vector representation for each app. We will further discuss this in the next section.
To show the effectiveness of our feature representation in distinguishing malware and clean files, we further apply t-SNE (Maaten and Hinton, 2008) algorithm on already encoded samples from the DREBIN dataset () and clean dataset () for visualization. The result is shown in Fig 2, it is not hard to tell that all the samples are nicely spaced apart and grouped together with their respective labels.
After encoding, we would get a vector representation of each application, based on which we can then apply machine learning algorithms for automatic malware detection. There are several learning algorithms that can be used for classification, for examples, DREBIN (Arp et al., 2014) uses support vector machine (SVM), (Sahs and Khan, 2012)
uses one-class SVM with kernels and as general classifier. Deep neural networks are also widely used in malware detection(Yuan et al., 2014; Yuan et al., 2016). In this paper, instead of randomly choosing a general classifier to get a good prediction model by tuning the parameters, we first make observations on the vector representations of Android application and then choose the factorization machine model that fits our problem the best. Notice that FM model can also be used for malware family identification. Details are presented in the next section.
3. Factorization machine for malware detection
At the core of our malware detection scheme is classification. Generally, a classification problem in machine learning is to infer a function for all possible to predict how much it belongs to a class, e.g., the malware class in this paper. To find such a function, we are given a set of samples, each of which has been marked as a “malicious” or “benign”. This initial dataset is used to teach the machine.
After proper pre-processing has been performed on each Android application file, it is then converted into a feature vector in accordance with Sec. 2. In this section, we introduce the modeling of malware detection based on factorization machine (Rendle, 2010). This machine has demonstrated high efficiency in learning high-order interaction representation between sparse features with numerous applications in various fields.
3.1. Feature representation and first-order classifiers
We begin our modeling from feature representation in Android malware detection. Suppose we have two applications A and B, and each requests three permissions as illustrated in Fig. 3. As there are five unique permissions requested by A and B, we can then create a vector such that each entry represents exactly one permission, e.g., the first entry as a blue block represents the permission SEND_MSG and the second entry represents the permission BIND_ADMIN. As a result, we can write and . It is straightforward to extend this idea to all kinds of extracted features as discussed in Sec. 2. The formal name for this scheme in literature is one-hot encoding.
There are some popular models for a scalable and stable solution of classification. One popular solution is support vector machine
(SVM), which attempts to find a hyperplane that separates malware samples from benign ones with a maximal margin. A maximal margin solution usually performs better in many machine learning tasks, so SVM has been widely applied in many fields in addition to Android malware detection(Arp et al., 2014). More specifically, an SVM model attempts to find a line such that
are trainable parameters such that it can maximize the margin. To predict the probability of whether a given
is a malware, we can further use a sigmoid function to calculate such value:
Here we denote (or
for short) as the estimated probability of being a malware. Given a set of samples, the optimal coefficients of can be obtained by solving the following optimization problem:
where denotes the vector norm, and is the sample label, if a sample is malware then , otherwise .
However, these models are not suitable for Android malware detection for two reasons. Firstly, the feature vectors from one-hot encoding are highly sparse. For example, samples in the benchmark dataset DREBIN (Arp et al., 2014) will be encoded into vectors with entries, in which only nonzero elements are found in average. Secondly, these models only exploit the first-order features, they do not take interactions among entries into account. To make matters worse, the severity of these problems is amplified in Android malware detection because the high sparsity of features implies that each feature vector can provide little information for classification should a model only exploits the information from nonzero values.
3.2. Second-order feature crossing and factorization machine
To overcome these issues, we attempted to incorporate feature interactions. Let us take some toy examples to see how the relationship between two features can facilitate prediction. If an application requests the GPS hardware feature as well as network modules permission, it is likely that this application may attempt to send geo-location information to a command & control server, therefore it is more prone to perform malicious behaviors. Another example is that some malware samples like BaseBridge can collect personal/device information and send it elsewhere via SMS messages. They will request two permissions, READ_PHONE_STATE and SEND_SMS, and this behavior is hardly seen in benign examples.
A natural method for learning interactions of different features is through basis expansion or feature-crossing:
By assigning a weight for each pair of and , we have the easiest way to capture pairwise interactions. However, it is not efficient here due to the large number of parameters: this model has free parameters. In the DREBIN dataset, for example, the input vector has a length of but the number of nonzero entries is about on average. In this case, full feature crossing like would necessitate four billion weights. This brings heavy burdens on the training process since the model becomes too complicated and it requires a large scale of data for training. Needless to say, the end result is very time-consuming. In additon, it looks much worse for sparse data in the case of Android malware detection, as each sample only activates an extremely small portion of entries in
when using popular algorithms like stochastic gradient descent (SGD).
Popular techniques to overcome these issues are low-rank or dimension reduction methods, such as using a factorization machine (FM) (Rendle, 2010). More specifically, FM assumes that is with the largest rank of and therefore, we can decompose . If we denote as the -th row of , FM will train a hidden vector for each and models the pairwise interaction weight as the inner product of the corresponding hidden vectors of entries and :
where denotes the dot product of two vectors of length :
In practice, the hyperparameteris much smaller than the feature dimension (). Thus, the number of parameters to be estimated reduces from to .
We can further improve the performance of FM by using more sophisticated feature engineering schemes for cross terms. For example, by using “partial FM”, which only involves interactions between selected features, e.g., between Used permissions and Permissions, thus ignoring crossed terms that are not relevant to malicious behavior discovery.
In this section, we evaluate the performance of our factorization-machine-based Android malware detection system. We apply our system to malware detection tasks and malware family identification tasks, based on two public benchmark datasets: DREBIN (Arp et al., 2014) and AMD (Wei et al., 2017). In addition to detection performance evaluation, we further evaluate efficiency in terms of processing time and detection time for all tasks.
|# of malicious apps correctly detected|
|# of benign apps correctly classified|
|# of false prediction as malicious|
|# of false prediction as clean|
4.1. Experiment Setup
We will start with a brief description for each malware dataset:
DREBIN: it is a dataset with malware files collected from August 2010 to October 2012. All malware samples are labeled by one of 179 malware families. This is one of the most popular benchmark dataset for Android malware detection.
AMD: the Android Malware Dataset contains samples that are categorized in 135 varieties among 71 malware families. This dataset consists of samples that collected from 2010 to 2016. This is one of the largest, and the newest dataset at the preparation of this paper. This dataset provides more recent Android malware evolving trends.
Along with these malware datasets, we also collect a number of real-world Android applications collected from the Internet. Resources of these files include Apkpure (Apkpure.com, 2018) with 5400 samples, 700 samples from 360.com and 13K commercial applications from the HKUST Wake Lock Misuse Detection Project (Liu et al., 2016). In summary, we have collected real-world applications.
Although these Android applications are mostly collected from well-known Android markets and research projects, we should ensure whether they are clean. To do so, we uploaded all these collected files to the VirusTotal service, a public anti-virus service with 78 popular engines, and inspected scanning reports from the VirusTotal service for each file. Each engine in VirusTotal would show one of three detection results: True for “malicious”, False for “clean”, and NK for “not known”, respectively. If an application has more than one True result, we label it as malware; otherwise, we label it as clean. As a result, only out of K collected samples passed all scanners on the VirusTotal service, and we will only use these samples in further experiments.
Details of these two datasets are shown in Table 3. When doing experiments on the AMD dataset, we evaluated on all these clean files. When evaluating on the DREBIN dataset, we randomly sampled clean files to match the number of malware samples in this dataset. To simplify our terminologies, the DREBIN dataset (or the AMD dataset) consists of both clean samples and malware samples in the subsequent of this section.
Here we make some comparison between the DREBIN and the AMD datasets for further experiments. Table 4 shows a detailed breakdown of these two datasets in evaluation. As we can see in this table, the overall feature set size grows from to as the dataset size grows from to . Note that app components, intent filters and permissions are the three sets that grow the most. The former two are defined manually by the developer so they tend to have different values. For permissions, even though there are a fixed number of system permissions developer can apply in manifest file, they can still declare self-defined permissions, e.g., com.zing.znews.permission.C2D_MESSAGE. So the permission set can also grow as the dataset grows. The remaining feature sets have a relevant fixed size because they are linked to the Android system APIs or smart phone hardware, which have a limited size.
4.1.2. Evaluation tasks
We evaluate detection performance and run-time performance of our proposed system on two separate tasks, and compared with several baseline algorithms as well as existing signature-based commercial anti-virus engines that available in the VirusTotal service. Specifically, we focus on the following three aspects:
Malware detection: in this kind of experiments, we compare our trained factorization-machine-based system with some baseline machine-learning based detection algorithms. In addition, we also send all samples, including malware samples, to the VirusTotal service to compare with commercial anti-virus engines.
Malware family identification: in this kind of tasks, each sample will be sent into our system and our system will response whether the input sample belongs to a specific malware family. Here we regard clean files as a special family named “clean”.
Run-time: to further evaluate efficiency of our proposed system, we analyze the run-time in terms of processing time and detection time. The processing time counts from the beginning to the phase of feature encoding, while the detection time counts on the classification phase.
For performance evaluation tasks, we evaluate the detection performance and family identification performance using the measures shown in Table 1, and we focus on the following four metrics: precision, recall, F1 and False Positive Rate (FPR). Note that in the literature, recall and false positive rate correspond to malware detection rate and false alarm rate for the detection system.
Moreover, the dataset is split into training () and testing () sets in both experiments. All models are trained with 4-fold cross validation for hyper parameter tuning and then tested on the testing set for performance evaluation. We repeated this procedure 5 times and then averaged the results. The baseline algorithm and our proposed method are trained and tested in the same manner.
|Dataset||# samples||# families|
|Dataset||# malware||# Clean files||Total|
4.2. Detection Performance
In this subsection, we conduct two sets of experiments to show the performance of our proposed FM based malware detection model as well as other baseline algorithms. Comparison with existing anti-virus engines are also included.
4.2.1. Comparison with baseline Algorithms
We first evaluated our proposed FM based method and compared it with other existing baseline algorithms, including SVM, which is used in DREBIN (Arp et al., 2014), classical machine learning algorithms such as Naive Bayes (Wu et al., 2012)
and neural networks e.g., multi-layer perceptron(Ruck et al., 1990).
Table 5 shows the test result of different algorithms on the DREBIN set. As we can see, FM achieves the best performance for precision with a score of and false positive rate when the threshold is set to
. Multilayer perceptron classifier (MLP) gives the same precision and false positive rate scores, but it gives a better recall and the bestscores. SVM algorithm gives a recall score or detection rate of with a false positive close to the result given in DREBIN (Arp et al., 2014), which is and respectively. However, it is still not comparable with the result given by FM and MLP. Naive Bayes with three different kernels, Gaussian, Multinomial and Bernoulli, all have a very high recall scores, but at the cost of high false positive rates and low precision, resulting in bad overall performance and low scores.
. Here the name of Algorithm “NB-Gaussian” refers to the naive Bayes classifier using Gaussian kernel, and the same for “NB-Bernoulli” and “NB-Multinomial”. The algorithm “MLP” refers to the multiple layer perceptron algorithm.
ROC curves on the DREBIN test set are also shown in Fig 6. Obviously, FM and MLP algorithms give the best performance with an area under the curve (AUC) score of under the accuracy of . Naive Bayes with multinomial and Bernoulli kernel follow with AUC values of and , respectively. Naive Bayes with Gaussian kernel and SVM are the worst with a AUC scores of and . For these two curves, the true positive rate grows slowly as the false positive grows. That is to say, a high true positive rate is at the cost of high false positive rate. Notice that in Table 5 SVM is better than Naive Bayes with Bernoulli kernel, with similar score and much better false positive rate. But with the ROC curve we now can see by adjusting the threshold of Naive Bayes with Bernoulli kernel will always outperform SVM under the same false positive rate limitation.
In summary, on the DREBIN data set FM and MLP achieve the best performance, MLP has a slightly better overall performance with a higher score. We will further discuss this later.
The same experiment procedure is then conducted on the larger AMD data set. The result is shown in Table 6 from which we can see, on the large data set FM gives the best performance under all metrics with a score and FPR of and . MLP follows with a and score and FPR. Compared with what shown using the DREBIN dataset, where MLP is slightly better than FM with a higher score, one can see the FM’s advantage in dealing with high sparse vector becoming more obvious as the feature space size or the sparsity of the vector representation grows, leading to a better performance than other algorithms including MLP. We can say with confidence that our FM method would outperform other algorithms with a larger margin on an even larger data set. The SVM method has similar results to what we got with the DREBIN set with a FPR of and score of . The same with Naive Bayes with three different kernels, they still give high recall score at the cost of high false positive rate.
ROC curves on the AMD set are shown in Fig 7, and are similar to the result on the DREBIN set in Fig 6. FM and MLP have the best performance with a AUC value, and Naive Bayes with Multinomial and Bernoulli kernels follow with AUC values of and respectively. Then SVM with and Naive Bayes with Gaussian kernel . One can also notice that ROC curves of FM, MLP, Naive Bayes with Bernoulli and Multinomial kernels are always above the ROC curve of SVM which is the chosen classifier in DREBIN (Arp et al., 2014) and (Dimjašević et al., 2016; Peiravian and Zhu, 2013).
In this set of experiments, we compared the performance of our FM based malware detection method with other baseline algorithms used in previous works on two datasets, DREBIN and AMD. We can see the efficiency and advantage of using FM algorithm for malware detection especially when the dataset is large. Our method on the DREBIN dataset reported a detection rate of with a false positive rate. It significantly outperforms DREBIN (Arp et al., 2014), which has a detection rate of and false positive rate of .
The experiment results also prove the correctness of our observation that interaction terms are important for revealing malicious behavior patterns. SVM and Naive Bayes directly use the vector representation to learn the classifiers. On the other hand, FM achieved a much better performance by adding interaction terms. MLP as a universal approximator (Hornik et al., 1989) can also output excellent results but more data will be needed to train the model, especially when the input vector is highly sparse.
4.2.2. Comparison with AV engines
We also compared the performance of our malware detection algorithm with existing Anti-Virus engines on VirusTotal (Vir, [n. d.]). The critical point to mention is that all of the ’truly’ clean files used in our experiments are actually labeled by these AV engines using the rule described in subsection 4.1. Therefore, AV engines are supposed to have a better false positive rate than their normal performance. Another thing is, even though we got the scan results of all AV engines from VirusTotal, here we just list the ones with the best performance or ones that are already popular and widely used in security programs such as, Kaspersky, Cylance and McAfee.
Table 7 lists parts of the AV engines’ scanning result on the testing split of the DREBIN set. Comparing to the result of our FM method we can see that our method outperforms most of the AV engines, with a precision of and FPR. For score, our method outperforms out of AV engines, which is remarkable. And for those engines that have better recall or score than what our method presented would often have either much worse precision or FPR, e.g. , and . Only several AV engines have a comparable overall results, e.g. , .
Table 8 shows the scan results of the test part on the AMD dataset. This time, our FM model outperforms all of the AV engines with a score of . Also for FPR our model still outperforms out of AV engines, this is also remarkable cause all AV engines are supposed to at least have a good FPR. We also found that on this data set most of the AV engines (include those that are not shown here) give worse recall score compared with what they get on the DREBIN set, but still have a good FPR. This is reasonable, because on one hand, remember that those AV engines all use signature-based malware detection method so they cannot discriminate against a piece of malware unless said malware has been seen before and recorded in the AV’s database. To clarify, they can not detect newly emerged malware without a database update. On the other hand, as mentioned before, all malware in the DREBIN dataset was collected before 2014 and the AMD set are was released last year, containing a fresher stock of malware. So the detection rate for some AV engines could fall if they did not include the most recently malware in their data base. For FPR, in the case of a signature-based AV, they don’t have to keep clean file records, because if no match was found in the database for a given signature the scanner would then not report the file as malicious. That’s why most of the AV engines would have a worse detection performance on the AMD set than what they have on the DREBIN set while maintaining good false positive rates.
4.3. Malware Family Detection
Another important task for Android malware detection is malware family classification. To evaluate our model on this task, we built another two data sets. Details about those two data sets are shown in the first column of Table 9 and Table 10 respectively. All samples from the largest malware families in the DREBIN set and another clean files form the first data set. The second dataset contains all samples from the largest malware families in the AMD dataset and also clean applications. To shown our model’s capacity to distinguish one malware family from other families as well as clean files, each time we label all the samples from one family as ”True” and samples from all the other families as ”False” then shuffle and randomly split the dataset into training (), to train a new model, and test set () to evaluate the model. Notice that if the ”clean” family is labeled as ‘True’ this is then actually a malware detection task.
Table 9 gives the experimental results of malware family detection on the DREBIN set. As is shown, our model can achieve a weighted average detection accuracy of with an average false positive rate of . In particular, all families show a recall score above , precision score above , score above and false-positive rate below . For family Plankton our model produces a perfect result with an score of and FPR of . This is much better than what was presented with DREBIN (Arp et al., 2014), where of the largest malware families are involved without clean samples, it got an average detection rate of , false-positive rate of and detected all families at rates above .
The results on the AMD set are shown in Table 10. On this dataset, our method managed an average detection rate of with a false positive rate. All families show a recall score above , precision above, score above and false positive below . We can tell that the overall performance on the AMD set is slightly better than what was produced with the DREBIN set. This may because the samples for each family on the AMD set are much larger than we have on DREBIN set.
In summary, the FM method achieves great performance for malware family classification and can be used to predict the family of a piece of malware if enough training samples are provided.
4.4. Processing Time Evaluation
In this section, we evaluate the processing time efficiency of our malware detection model. As is shown in previous sections, our system consists of four parts. Normally, the last two phases, encoding and classification, take much less time than feature extraction and decompiling. Also, for different applications these two phases would often take a fixed processing time due to the fixed feature space size. Therefore, we focused on evaluating the processing time for unpacking, decompiling and feature extraction, then give out an average processing time for all applications on the encoding and prediction phase.
The evaluation was done on a virtual machine hosted on ESXi. The VM is running Ubuntu 16.04 with a memory of G and 2 CPUs. We randomly sampled AMD samples, clean files and all the DREBIN samples to test this experiment. The results are shown in Fig 11; the three figures in the first row show the relation between dex source code size and processing time. The figures in the second row show the relation between apk file and processing time. We can tell from the figure that the processing time and dex code size almost have a linear relation and for samples in those three datasets the slops are approximately similar to . There is no fixed relation between apk file size and processing time and this is because apart from the dex code and manifest file, an apk also contains other resource files like HTML, figures, etc. In some applications these files may take a lot of space, for example games, while for others this is not the case. The histogram of processing time and dex code size of all samples are shown in Fig 9 and Fig 10 respectively. We can see, over of samples have a dex code size of less than and over samples have a processing time of less than seconds. On the same samples we also measured the mean time for encoding and prediction. The former took our system an average ms and ms for the latter.
Compared with DREBIN (Arp et al., 2014), it seems that our system does not have much of an advantage in processing time. However, this is not the case. To begin with, the test is done on a system that is not fully integrated, the output of Smalisca is first written into a json file and then reload into RAM for further processing. The I/O between RAM and flash storage would often take a long time. Secondly, the feature sets used in our system are simpler and smaller than sets used in DREBIN, so under same condition our system should take less processing time than DREBIN.
5. Related Works
Static analysis of Android applications focuses on analyzing internal components of an application, it is able to explore all possible execution paths in malware samples and has long been used for detections of malicious behaviors and application vulnerabilities. This analysis typically based on source code or binary analysis to search for malicious patterns.
Some works focus on the detection of specific malicious behavior such as privacy breach and over privilege. For example, (Kim et al., 2012; Gibler et al., 2012) goes through source code with a predefined source and sinks to find a potential private breach. (Fu et al., 2017) further examines all the URL addresses to see if the app is trying to steal users’ private information. Stowaway (Felt et al., 2011) detects over privilege in Android applications by comparing the maximum set of permissions needed for an app with the actual request permissions. (Fuchs et al., 2009) uses data flow analysis for security certification. However, static taint-analysis and over privilege are prune to false positive.
Other works (Arp et al., 2014; Wu et al., 2012; Hou et al., 2017; Dimjašević et al., 2016; Peiravian and Zhu, 2013) try to directly classify an application as malicious or benign through permissions requests analysis for application installation (e.g. (Aafer et al., 2013; Felt et al., 2012; Gorla et al., 2014)), control flow (e.g. (Liang et al., 2013; Liang et al., 2014)), or signature-based detection (e.g. (Feng et al., 2014; Grace et al., 2012)). These works take different approaches in both the feature extraction and the classification phase. Peiravian and Zhu (Peiravian and Zhu, 2013)
used permission and API calls as features and SVM, decision trees and ensemble as classifiers.(Dimjašević et al., 2016)2017) built a structured heterogeneous information network (HIN) with Android application and related system APIs as nodes and their rich relationships as links, and then used meta-path for malware detection. DREBIN (Arp et al., 2014), which extracted features from manifest files and source code, including permissions, hardware, system API calls and even all the URLs, and then uses SVM as the final classifier for malware detection. Differentiating ourselves from existing works and instead of only focusing on feature engineering and ignoring the importance of choosing a suitable algorithm. After acquiring the feature representations of apps, we first make two observations. Then the optimum machine learning algorithm that handles our problem the best is chosen for malware detection according to the observations.
Lots of recent works are trying to find malicious behavior patterns through control flow graphs or call graphs. AppContext (Yang et al., 2015) classifies applications using machine learning based on the contexts that trigger security-sensitive behaviors. It builds a call graph from an application source code and extracts the context factors through information flow analysis. It is then able to obtain the features for the machine learning algorithms from the extracted context. In this paper, 633 benign applications from the Google Play store and 202 malicious samples were analyzed. AppContext correctly identifies 192 of the malware applications with an accuracy. Gascon et al. (Gascon et al., 2013) also utilized call graphs to detect malware. After extraction of call graphs from Android applications, a linear-time graph kernel is applied in order to map call graphs to features. These features are given as input to SVMs to distinguish between benign and malicious applications. They conducted experiments on 135,792 benign and 12,158 malware applications, detecting of the malware with of false positives. This kind of method relies heavily on the accuracy of call graph extraction. However, current works like FlowDroid (Arzt et al., 2014) and IC3 (Octeau et al., 2015, 2016) cannot fully solve the construction of Inter-component control flow graphs (ICFG), especially the inter-component links with intents and intent filters.
6. Limitations and future works
Our system takes advantage of machine learning to recognize malicious behavior patterns for malware detection. While machine learning techniques provide a powerful tool for automatically inferring models, they require a representative dataset for training. That is, the quality of the detection model depends on the availability and quality of both malware and benign applications. While it is straightforward to collect benign applications, gathering recent malware samples is not that easy and requires some technical effort. Fortunately, offline analysis methods, such as DroidRanger (Zhou et al., 2012) and RiskRanker (Grace et al., 2012), can help to acquire malware and provide the samples for updating and maintaining a representative dataset in order to continuously update our model.
Another limitation for our system is processing time. We plan to integrate our system into Wedge networks’ in-line, real-time security solution which only allows us to have millisecond-scale processing time. As mentioned before, for encoding and prediction our system takes about ms, however, decompiling and feature extraction is on the order of seconds. Fortunately, we still have space to improve our system’s time efficiency such as reducing I/O and finishing all work at once in main memory (RAM), or even using Application-specific integrated circuits (ASIC) such as FPGAs for speed up. In addition, we note that decompiling apk files can fail when using some existing tools. In our experiments, we observed such failures for some files, and we found that malware samples are more likely to fail in decompiling. This is in our expectation as malware samples may use some additional techniques like code obfuscation that may lead to failures of decompiling, so the ability of decompiling also limits that of Android malware detection.
Dynamic analysis, on it’s own, is insufficent to protect Android end-users. However, in some cases only static analysis is necessary, for example real time protection systems. In this paper, instead of focusing only on feature engineering, we further find the importance of including interaction terms across features for the discovery of malicious behavior patterns. The features used to represent an application are app components, hardware features, permissions, intent filters from the manifest file and restricted APIs, suspicious APIs and used permissions from source code. Based on the extracted features, a highly sparse vector representation was constructed for each application using one-hot encoding. We then propose a factorization-machine-based malware detection system to handle the high sparsity of vector representation and model interaction terms at the same time. To the best of our knowledge, this is the first for using FM models for malware detection. A comprehensive experimental study on real sample malware collections, DREBIN and AMD datasets, and clean applications collected from online app stores was performed to show the effectiveness of our system on malware detection and malware family identification tasks. Promising experimental results demonstrate that our method outperforms existing state-of-art Android malware detection techniques as well as commercial antivirus engines. Furthermore, we also evaluated the processing time efficiency of our model.
- Vir ([n. d.]) [n. d.]. VirusTotal. ([n. d.]). https://www.virustotal.com/#/home/upload [Online; accessed 9-May-2018].
- Aafer et al. (2013) Yousra Aafer, Wenliang Du, and Heng Yin. 2013. Droidapiminer: Mining api-level features for robust malware detection in android. In International conference on security and privacy in communication systems. Springer, 86–103.
- Apkpure.com (2018) Apkpure.com. 2018. apkpure. (2018). https://apkpure.com/ [Online; accessed 9-May-2018].
- Arp et al. (2014) Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck, and CERT Siemens. 2014. DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket.. In Ndss, Vol. 14. 23–26.
- Arzt et al. (2014) Steven Arzt, Siegfried Rasthofer, Christian Fritz, Eric Bodden, Alexandre Bartel, Jacques Klein, Yves Le Traon, Damien Octeau, and Patrick McDaniel. 2014. Flowdroid: Precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for android apps. Acm Sigplan Notices 49, 6 (2014), 259–269.
- Ben Gruver et al. (2018) Ben Gruver et al. 2018. Baksmali. (2018). https://github.com/JesusFreke/smali [Online; accessed 9-May-2018].
- Connor Tumbleson, Ryszard Wiśniewski (2018) Connor Tumbleson, Ryszard Wiśniewski. 2018. APKtool. (2018). https://ibotpeaches.github.io/Apktool/ [Online; accessed 9-May-2018].
- Dimjašević et al. (2016) Marko Dimjašević, Simone Atzeni, Ivo Ugrina, and Zvonimir Rakamaric. 2016. Evaluation of android malware detection based on system calls. In Proceedings of the 2016 ACM on International Workshop on Security And Privacy Analytics. ACM, 1–8.
- Enck et al. (2014) William Enck, Peter Gilbert, Seungyeop Han, Vasant Tendulkar, Byung-Gon Chun, Landon P Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N Sheth. 2014. TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM Transactions on Computer Systems (TOCS) 32, 2 (2014), 5.
- Felt et al. (2011) Adrienne Porter Felt, Erika Chin, Steve Hanna, Dawn Song, and David Wagner. 2011. Android permissions demystified. In Proceedings of the 18th ACM conference on Computer and communications security. ACM, 627–638.
- Felt et al. (2012) Adrienne Porter Felt, Elizabeth Ha, Serge Egelman, Ariel Haney, Erika Chin, and David Wagner. 2012. Android permissions: User attention, comprehension, and behavior. In Proceedings of the eighth symposium on usable privacy and security. ACM, 3.
- Feng et al. (2014) Yu Feng, Saswat Anand, Isil Dillig, and Alex Aiken. 2014. Apposcopy: Semantics-based detection of android malware through static analysis. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 576–587.
- Fu et al. (2017) Hao Fu, Zizhan Zheng, Somdutta Bose, Matt Bishop, and Prasant Mohapatra. 2017. Leaksemantic: Identifying abnormal sensitive network transmissions in mobile applications. In INFOCOM 2017-IEEE Conference on Computer Communications, IEEE. IEEE, 1–9.
- Fuchs et al. (2009) Adam P Fuchs, Avik Chaudhuri, and Jeffrey S Foster. 2009. Scandroid: Automated security certification of android. Technical Report.
et al. (2013)
Hugo Gascon, Fabian
Yamaguchi, Daniel Arp, and Konrad
Structural detection of android malware using
embedded call graphs. In
Proceedings of the 2013 ACM workshop on Artificial intelligence and security. ACM, 45–54.
- Ge et al. (2011) Xi Ge, Kunal Taneja, Tao Xie, and Nikolai Tillmann. 2011. DyTa: dynamic symbolic execution guided with static verification results. In Software Engineering (ICSE), 2011 33rd International Conference on. IEEE, 992–994.
- Gibler et al. (2012) Clint Gibler, Jonathan Crussell, Jeremy Erickson, and Hao Chen. 2012. AndroidLeaks: automatically detecting potential privacy leaks in android applications on a large scale. In International Conference on Trust and Trustworthy Computing. Springer, 291–307.
- Gorla et al. (2014) Alessandra Gorla, Ilaria Tavecchia, Florian Gross, and Andreas Zeller. 2014. Checking app behavior against app descriptions. In Proceedings of the 36th International Conference on Software Engineering. ACM, 1025–1035.
- Grace et al. (2012) Michael Grace, Yajin Zhou, Qiang Zhang, Shihong Zou, and Xuxian Jiang. 2012. Riskranker: scalable and accurate zero-day android malware detection. In Proceedings of the 10th international conference on Mobile systems, applications, and services. ACM, 281–294.
- Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural networks 2, 5 (1989), 359–366.
- Hou et al. (2017) Shifu Hou, Yanfang Ye, Yangqiu Song, and Melih Abdulhayoglu. 2017. Hindroid: An intelligent android malware detection system based on structured heterogeneous information network. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1507–1515.
- Jiang and Xuxian (2013) Yajin Zhou Xuxian Jiang and Z Xuxian. 2013. Detecting passive content leaks and pollution in android applications. In Proceedings of the 20th Network and Distributed System Security Symposium (NDSS).
- Kim et al. (2012) Jinyung Kim, Yongho Yoon, Kwangkeun Yi, Junbum Shin, and SWRD Center. 2012. ScanDal: Static analyzer for detecting privacy leaks in android applications. MoST 12 (2012).
- Liang et al. (2013) Shuying Liang, Andrew W Keep, Matthew Might, Steven Lyde, Thomas Gilray, Petey Aldous, and David Van Horn. 2013. Sound and precise malware analysis for android via pushdown reachability and entry-point saturation. In Proceedings of the Third ACM workshop on Security and privacy in smartphones & mobile devices. ACM, 21–32.
- Liang et al. (2014) Shuying Liang, Weibin Sun, and Matthew Might. 2014. Fast flow analysis with godel hashes. In Source Code Analysis and Manipulation (SCAM), 2014 IEEE 14th International Working Conference on. IEEE, 225–234.
- Liu et al. (2016) Yepang Liu, Chang Xu, Shing-Chi Cheung, and Valerio Terragni. 2016. Understanding and detecting wake lock misuses for android applications. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 396–409.
- Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579–2605.
- Newshub (2013) Newshub. April, 2013. Smartphones now outsell ’dumb’ phones. (April, 2013). Retrieved April 10, 2018 from http://www.newshub.co.nz/technology/smartphones-now-outsell-dumb-phones-2013042912
- Octeau et al. (2016) Damien Octeau, Somesh Jha, Matthew Dering, Patrick McDaniel, Alexandre Bartel, Li Li, Jacques Klein, and Yves Le Traon. 2016. Combining static analysis with probabilistic models to enable market-scale android inter-component analysis. In ACM SIGPLAN Notices, Vol. 51. ACM, 469–484.
- Octeau et al. (2015) Damien Octeau, Daniel Luchaup, Matthew Dering, Somesh Jha, and Patrick McDaniel. 2015. Composite constant propagation: Application to android inter-component communication analysis. In Proceedings of the 37th International Conference on Software Engineering-Volume 1. IEEE Press, 77–88.
- Peiravian and Zhu (2013) Naser Peiravian and Xingquan Zhu. 2013. Machine learning for android malware detection using permission and api calls. In Tools with Artificial Intelligence (ICTAI), 2013 IEEE 25th International Conference on. IEEE, 300–305.
- Rendle (2010) Steffen Rendle. 2010. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 995–1000.
- Ruck et al. (1990) Dennis W Ruck, Steven K Rogers, Matthew Kabrisky, Mark E Oxley, and Bruce W Suter. 1990. The multilayer perceptron as an approximation to a Bayes optimal discriminant function. IEEE Transactions on Neural Networks 1, 4 (1990), 296–298.
- Sahs and Khan (2012) Justin Sahs and Latifur Khan. 2012. A machine learning approach to android malware detection. In Intelligence and security informatics conference (eisic), 2012 european. IEEE, 141–147.
- statista (2017) statista. 2017. Global mobile OS market share in sales to end users from 1st quarter 2009 to 2nd quarter 2017. (2017). Retrieved April 10, 2018 from https://www.statista.com/statistics/266136/global-market-share-held-by-smartphone-operating-systems/
- Tam et al. (2015) Kimberly Tam, Salahuddin J Khan, Aristide Fattori, and Lorenzo Cavallaro. 2015. CopperDroid: Automatic Reconstruction of Android Malware Behaviors.. In NDSS.
- Unuchek (2017) Roman Unuchek. 2017. Mobile malware evolution 2016. (2017). Retrieved May 2, 2018 from https://securelist.com/mobile-malware-evolution-2016/77681/
- Venugopal and Hu (2008) Deepak Venugopal and Guoning Hu. 2008. Efficient signature based malware detection on mobile devices. Mobile Information Systems 4, 1 (2008), 33–49.
- Wei et al. (2017) Fengguo Wei, Yuping Li, Sankardas Roy, Xinming Ou, and Wu Zhou. 2017. Deep Ground Truth Analysis of Current Android Malware. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA’17). Springer, Bonn, Germany, 252–276.
- Wong and Lie (2016) Michelle Y Wong and David Lie. 2016. IntelliDroid: A Targeted Input Generator for the Dynamic Analysis of Android Malware.. In NDSS, Vol. 16. 21–24.
- Wu et al. (2012) Dong-Jie Wu, Ching-Hao Mao, Te-En Wei, Hahn-Ming Lee, and Kuo-Ping Wu. 2012. Droidmat: Android malware detection through manifest and api calls tracing. In Information Security (Asia JCIS), 2012 Seventh Asia Joint Conference on. IEEE, 62–69.
- Wu and Hung (2014) Wen-Chieh Wu and Shih-Hao Hung. 2014. DroidDolphin: a dynamic Android malware detection framework using big data and machine learning. In Proceedings of the 2014 Conference on Research in Adaptive and Convergent Systems. ACM, 247–252.
- Yang et al. (2015) Wei Yang, Xusheng Xiao, Benjamin Andow, Sihan Li, Tao Xie, and William Enck. 2015. Appcontext: Differentiating malicious and benign mobile app behaviors using context. In Software engineering (ICSE), 2015 IEEE/ACM 37th IEEE international conference on, Vol. 1. IEEE, 303–313.
et al. (2014)
Zhenlong Yuan, Yongqiang
Lu, Zhaoguo Wang, and Yibo Xue.
Droid-sec: deep learning in android malware detection. InACM SIGCOMM Computer Communication Review, Vol. 44. ACM, 371–372.
- Yuan et al. (2016) Zhenlong Yuan, Yongqiang Lu, and Yibo Xue. 2016. Droiddetector: android malware characterization and detection using deep learning. Tsinghua Science and Technology 21, 1 (2016), 114–123.
- Zhou et al. (2012) Yajin Zhou, Zhi Wang, Wu Zhou, and Xuxian Jiang. 2012. Hey, you, get off of my market: detecting malicious apps in official and alternative android markets.. In NDSS, Vol. 25. 50–52.