Feature importance in mobile malware detection

07/27/2020 ∙ by Vasileios Kouliaridis, et al. ∙ 0

The topic of mobile malware detection on the Android platform has attracted significant attention over the last several years. However, while much research has been conducted toward mobile malware detection techniques, little attention has been devoted to feature selection and feature importance. That is, which app feature matters more when it comes to machine learning classification. After succinctly surveying all major, dated from 2012 to 2020, datasets used by state-of-the-art malware detection works in the literature, we analyse a critical mass of apps from the most contemporary and prevailing datasets, namely Drebin, VirusShare, and AndroZoo. Next, we rank the importance of app classification features pertaining to permissions and intents using the Information Gain algorithm for all the three above-mentioned datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Abstract

The topic of mobile malware detection has received a lot of attention over the last years. However, while significant research has been conducted towards mobile malware detection techniques, little work has been focused on feature selection and feature importance. In this work we provide a short survey of all major datasets used by state-of-the-art malware detection works, dated from 2012 to 2020. Furthermore, we analyze applications from the most common datasets, namely Drebin, VirusShare and AndroZoo, to get permissions and intents. Lastly, we rank features using the Information Gain algorithm to compare the importance of permissions and intents, for all three datasets.

2 Introduction

Mobile devices are a part of our every day activity. From social networks to mobile banking transactions, mobile devices are trusted by millions of people. According to statcounter [1], the Android OS is the most popular mobile OS, with a market share of more than 74%. Moreover, while the topic of mobile malware detection has received a lot of attention recently, the importance of each feature category in mobile malware classification is not clear. Current mobile malware detection approaches lean towards static anomaly-based detection [2]

. The anomaly-based detection comprises two phases. The training phase and the detection or testing phase. Anomaly-base detection usually employs machine learning to detect malicious behavior, i.e., deviation from the training phase. Lastly, the reason for the recent popularity in static analysis techniques arises from the fact that static analysis doesn’t require the app to be running, thus it is usually faster and easier to implement.

In this work we provide a short survey of all major datasets, as well as a detailed feature importance comparison between the three most common datasets, i.e., Drebin, VirusShare, and AndroZoo, used by recent malware detection works [5]. Precisely, by using the average coefficients of permissions and intents for each dataset we show the most important feature category for each dataset. Lastly, we report the top features for each dataset and discuss similarities between corpora.

The remainder of this paper is organized in the following manner. The next section discusses the related work. Section 4 details on the different datasets used to evaluate mobile malware detection approaches. Section 5 provides our results on feature importance. The last section concludes and provides pointers to future work.

3 Related work

This section presents previous work on feature importance and feature selection, however, this topic has not received much attention.

Feizollah et al. [3] categorized available features into four groups, namely, static features, dynamic features, hybrid features and applications metadata. Furthermore, the authors compare these features with regard to the difficulty of extraction and their popularity among relevant literature. Finally, the authors provide a survey of the available datasets. On the downside, the only available datasets at the time of this research were Contagio, MalGenome and Drebin.

Zhao et al. [4] proposed a feature selection algorithm called FrequenSel. According to the authors, FrequenSel selects features which are frequently used in malware and rarely used in benign apps, thus they can distinguish malware from benign apps. During their experiments, the authors evaluated their approach with 7972 apps, and their results reported an accuracy of up to 98%.

Kouliaridis et al. [5] proposed an online tool called Androtomist, which performs hybrid analysis on Android apps. The authors focused on the importance of dynamic instrumentation, as well as the improvement of detection when hybrid analysis is used in contrast to static analysis. During their experiments, the authors compared feature importance between 3 datasets, namely Drebin, VirusShare and AndroZoo. However, the authors used a small subset of each dataset during their experiments.

To the best of our knowledge, none of the above works compare feature importance across multiple datasets with a large number of samples.

4 Datasets

In the context of mobile malware detection, several corpora have been analyzed by researchers to evaluate mobile malware detection approaches. This section provides a survey in chronological order, of major mobile malware datasets used in the literature. Moreover, Table 1 compares all datasets, with regard to their size, access, and updates. As shown in Table 1, AndroZoo and VirusShare are the only datasets still being updated today.

  • Contagio [6] Contagio mini dump is a publicly available repository of mobile malware samples. The samples are collected in 2010 and the dataset contains over 189 malware samples.

  • MalGenome [7] In 2012, the MalGenome dataset was released which contains 1260 malware samples categorized into 49 different malware families. The malware samples are dated from August 2010 to October 2011. Unfortunately, the MalGenome project has stopped sharing their dataset in December 2015.

  • Drebin [8] Drebin comprises 5560 malware across 179 different families. The samples were collected between August 2010 and October 2012. Drebin is one of the most popular datasets and it is referenced in many works in the literature.

  • AMD [9] The AMD is a publicly shared dataset which contains 24,553 samples, categorized in 135 varieties among 71 malware families. The samples are dated from 2010 to 2016.

  • VirusShare [10] VirusShare is an only repository of malware samples. Access to the site is granted via invitation only. The dataset does not contain only mobile malware samples but also samples for various platforms. Furthermore, it is updated regularly and contains samples from various years.

  • AndroZoo [11] AndroZoo is a growing collection of Android apps collected from several sources, including the official Google Play store [12]. The dataset is updated regurarly and it currently contains over 12 million samples. Access to the dataset is granted by application.

Dataset Last updated Size Access
Contagio 2010 189 Public
MalGenome 2011 1,260 Unavailable
Drebin 2012 5,560 Public
AMD 2016 24,553 Public
VirusShare 2020 Unknown Invitation
AndroZoo 2020 12,498,250* Application
Table 1: Outline of major datasets (* not all samples are malicious).

5 Feature importance

A key factor that affects the performance of malware detection methods is the importance of features contained in malware samples [5]. To this end, this section presents our results on feature importance of apps collected from the most common datasets. As already pointed out in section 4, VirusShare and AndroZoo seem to be the only datasets still being updated today. Furthermore, the Drebin dataset has been used by most research works on the topic of mobile malware detection so far, thus making it ideal when comparing new detection methods with previous state-of-the-art. Precisely, we collected 1000 random malware samples from each of these three datasets, as well as 1000 random benign apps from Google play to create three balanced datasets of both malware and benign apps. The samples are dated from 2010 to 2012, 2014-2017, and 2017-2020 for the Drebin, VirusShare and AndroZoo corpora respectively. Moreover, We used the Androtomist tool [5] to extract permissions and intents for each of these datasets. Figure 1 illustrates the average feature importance scores across all three datasets. Feature importance scores are assigned by coefficients calculated as part of an Information Gain (IG) model per set of features. Coefficients and feature ranking were calculated using the Orange data mining tool [13]. Finally, Tables 2,3,4 include the top 10 features for Drebin, VirusShare and AndroZoo respectively.

Figure 1: Average Feature Importance scores on all three datasets for a varying set of feature categories.

As shown in Figure 1, in the AndroZoo corpora, intents produced a much higher score than permissions. On the contrary, permissions in the Drebin and VirusShare corpora produced a slightly higher score than intents. Moreover, by looking at the top features of Drebin and VirusShare from Tables 2 and 3, it can be deducted that there is a similarity between the top features of Drebin and VirusShare corpora. More specifically, the first three features are the same for both Drebin and VirusShare’s top 10. In total 7 out of 10 features are common in both tables. On the other hand, AndroZoo has 1 out of 10 common feature with Drebin’s top 10, and 0 out of 10 common features with VirusShare’s top 10. Lastly, all of the Androzoo’s top 10 features are intents. This further demostrates the difference in feature importance among datasets.

IG Score Feature Category
0.229442180 android.permission.INTERNET Permissions
0.213080404 android.permission.READ_PHONE_STATE Permissions
0.133550416 android.permission.SEND_SMS Permissions
0.099434066 android.permission.WRITE_EXTERNAL_STORAGE Permissions
0.096589693 android.permission.RECEIVE_BOOT_COMPLETED Permissions
0.093909562 android.permission.RECEIVE_SMS Permissions
0.085747208 android.permission.READ_SMS Permissions
0.081024699 android.intent.action.BOOT_COMPLETED Intents
0.070652567 com.google.android.c1dm.intent.RECEIVE Intents
0.068308352 android.permission.ACCESS_COARSE_LOCATION Permissions
Table 2: Top 10 features in the Drebin dataset.
IG Score Feature Category
0.230578242 android.permission.INTERNET Permissions
0.227663851 android.permission.READ_PHONE_STATE Permissions
0.171320574 android.permission.SEND_SMS Permissions
0.147704371 android.permission.RECEIVE_SMS Permissions
0.132844113 android.permission.WRITE_EXTERNAL_STORAGE Permissions
0.106708485 android.permission.READ_SMS Permissions
0.095844608 android.intent.category.HOME Intents
0.092664451 android.intent.action.DATA_SMS_RECEIVED Intents
0.064810723 android.intent.action.BOOT_COMPLETED Intents
0.061030589 android.permission.WAKE_LOCK Permissions
Table 3: Top 10 features in the VirusShare dataset.
IG Score Feature Category
0,155045813 android.intent.action.USER_PRESENT Intents
0,140128587 android.intent.action.PACKAGE_REMOVED Intents
0,120851497 android.intent.category.DEFAULT Intents
0,076919815 android.intent.action.PACKAGE_ADDED Intents
0,067279182 android.intent.category.BROWSABLE Intents
0,065269457 android.intent.action.VIEW Intents
0,058272668 com.google.android.c1dm.intent.RECEIVE Intents
0,0530116 cn.jpush.android.intent.NOTIFICATION_RECEIVED_PROXY Intents
0,05217079 android.intent.action.ACTION_POWER_CONNECTED Intents
0,051841923 org.agoo.android.intent.action.RECEIVE Intents
Table 4: Top 10 features in the AndroZoo dataset.

6 Conclusions

In this work we presented a short survey of all major datasets, dated from 2012 to 2020. Moreover, we compare the feature importance of permissions and intents across the most common datasets, namely Drebin, VirusShare and AndroZoo. Lastly, we report the most important features of each of these three datasets, as well as similarities and differences between the top features of each dataset. Our results reveal a noteworthy difference in feature importance when inspecting our most recent dataset, i.e., Androzoo. As feature work, the authors aim to enhance this research by also adding features stemming from dynamic analysis.

References

  • [1] statcounter. Available: https://gs.statcounter.com/os-market-share/mobile/worldwide, Accessed: 2020-07-26
  • [2] V. Kouliaridis, K. Barmpatsalou, G. Kambourakis, and S. Chen, A Survey on Mobile Malware Detection Techniques, IEICE Transactions on Information and Systems, 2, pp. 204-211, 2020.
  • [3] A. Feizollah, N. B. Anuar, R. Salleh, and A. W. A. Wahab, A review on feature selection in mobile malware detection, Digital Investigation, 13, pp. 22-37, 2015.
  • [4]

    K. Zhao, D. Zhang, X. Su, and W. Li, Fest: A feature extraction and selection tool for Android malware detection, In 2015 IEEE Symposium on Computers and Communication (ISCC), pp. 714-720, 2015.

  • [5] V. Kouliaridis, G. Kambourakis, D. Geneiatakis, and N. Potha, Two Anatomists Are Better than One-Dual-Level Android Malware Detection, Symmetry, 12(7), pp. 1128, 2020.
  • [6] Contagio. Available: http://contagiodump.blogspot.com/, Accessed: 2020-07-26
  • [7] Y. Zhou, X. Jiang, Dissecting Android Malware: Characterization and Evolution, Proceedings of the 33rd IEEE Symposium on Security and Privacy, 12(7), 2012.
  • [8] D. Arp, M. Spreitzenbarth, M. Huebner, H. Gascon, and K. Rieck, Drebin: Efficient and Explainable Detection of Android Malware in Your Pocket, 21th Annual Network and Distributed System Security Symposium (NDSS), 12(7), pp. 1128, 2014.
  • [9] AMD Malware Dataset. Available: http://amd.arguslab.org/, Accessed: 2020-07-26
  • [10] virusshare. Available: https://virusshare.com/, Accessed: 2020-07-26
  • [11] K. Allix, T. Bissyandé, J. Klein, and Y. Le Traon, AndroZoo: Collecting Millions of Android Apps for the Research Community, In Proceedings of the 13th International Conference on Mining Software Repositories, ACM, pp. 468-471, 2016
  • [12] Google Play. Available: https://play.google.com/, Accessed: 2020-07-26
  • [13] Orange data mining tool. Available: https://orange.biolab.si/, Accessed: 2020-07-26