The topic of mobile malware detection has received a lot of attention over the last years. However, while significant research has been conducted towards mobile malware detection techniques, little work has been focused on feature selection and feature importance. In this work we provide a short survey of all major datasets used by state-of-the-art malware detection works, dated from 2012 to 2020. Furthermore, we analyze applications from the most common datasets, namely Drebin, VirusShare and AndroZoo, to get permissions and intents. Lastly, we rank features using the Information Gain algorithm to compare the importance of permissions and intents, for all three datasets.
Mobile devices are a part of our every day activity. From social networks to mobile banking transactions, mobile devices are trusted by millions of people. According to statcounter , the Android OS is the most popular mobile OS, with a market share of more than 74%. Moreover, while the topic of mobile malware detection has received a lot of attention recently, the importance of each feature category in mobile malware classification is not clear. Current mobile malware detection approaches lean towards static anomaly-based detection 
. The anomaly-based detection comprises two phases. The training phase and the detection or testing phase. Anomaly-base detection usually employs machine learning to detect malicious behavior, i.e., deviation from the training phase. Lastly, the reason for the recent popularity in static analysis techniques arises from the fact that static analysis doesn’t require the app to be running, thus it is usually faster and easier to implement.
In this work we provide a short survey of all major datasets, as well as a detailed feature importance comparison between the three most common datasets, i.e., Drebin, VirusShare, and AndroZoo, used by recent malware detection works . Precisely, by using the average coefficients of permissions and intents for each dataset we show the most important feature category for each dataset. Lastly, we report the top features for each dataset and discuss similarities between corpora.
The remainder of this paper is organized in the following manner. The next section discusses the related work. Section 4 details on the different datasets used to evaluate mobile malware detection approaches. Section 5 provides our results on feature importance. The last section concludes and provides pointers to future work.
3 Related work
This section presents previous work on feature importance and feature selection, however, this topic has not received much attention.
Feizollah et al.  categorized available features into four groups, namely, static features, dynamic features, hybrid features and applications metadata. Furthermore, the authors compare these features with regard to the difficulty of extraction and their popularity among relevant literature. Finally, the authors provide a survey of the available datasets. On the downside, the only available datasets at the time of this research were Contagio, MalGenome and Drebin.
Zhao et al.  proposed a feature selection algorithm called FrequenSel. According to the authors, FrequenSel selects features which are frequently used in malware and rarely used in benign apps, thus they can distinguish malware from benign apps. During their experiments, the authors evaluated their approach with 7972 apps, and their results reported an accuracy of up to 98%.
Kouliaridis et al.  proposed an online tool called Androtomist, which performs hybrid analysis on Android apps. The authors focused on the importance of dynamic instrumentation, as well as the improvement of detection when hybrid analysis is used in contrast to static analysis. During their experiments, the authors compared feature importance between 3 datasets, namely Drebin, VirusShare and AndroZoo. However, the authors used a small subset of each dataset during their experiments.
To the best of our knowledge, none of the above works compare feature importance across multiple datasets with a large number of samples.
In the context of mobile malware detection, several corpora have been analyzed by researchers to evaluate mobile malware detection approaches. This section provides a survey in chronological order, of major mobile malware datasets used in the literature. Moreover, Table 1 compares all datasets, with regard to their size, access, and updates. As shown in Table 1, AndroZoo and VirusShare are the only datasets still being updated today.
Contagio  Contagio mini dump is a publicly available repository of mobile malware samples. The samples are collected in 2010 and the dataset contains over 189 malware samples.
MalGenome  In 2012, the MalGenome dataset was released which contains 1260 malware samples categorized into 49 different malware families. The malware samples are dated from August 2010 to October 2011. Unfortunately, the MalGenome project has stopped sharing their dataset in December 2015.
Drebin  Drebin comprises 5560 malware across 179 different families. The samples were collected between August 2010 and October 2012. Drebin is one of the most popular datasets and it is referenced in many works in the literature.
AMD  The AMD is a publicly shared dataset which contains 24,553 samples, categorized in 135 varieties among 71 malware families. The samples are dated from 2010 to 2016.
VirusShare  VirusShare is an only repository of malware samples. Access to the site is granted via invitation only. The dataset does not contain only mobile malware samples but also samples for various platforms. Furthermore, it is updated regularly and contains samples from various years.
5 Feature importance
A key factor that affects the performance of malware detection methods is the importance of features contained in malware samples . To this end, this section presents our results on feature importance of apps collected from the most common datasets. As already pointed out in section 4, VirusShare and AndroZoo seem to be the only datasets still being updated today. Furthermore, the Drebin dataset has been used by most research works on the topic of mobile malware detection so far, thus making it ideal when comparing new detection methods with previous state-of-the-art. Precisely, we collected 1000 random malware samples from each of these three datasets, as well as 1000 random benign apps from Google play to create three balanced datasets of both malware and benign apps. The samples are dated from 2010 to 2012, 2014-2017, and 2017-2020 for the Drebin, VirusShare and AndroZoo corpora respectively. Moreover, We used the Androtomist tool  to extract permissions and intents for each of these datasets. Figure 1 illustrates the average feature importance scores across all three datasets. Feature importance scores are assigned by coefficients calculated as part of an Information Gain (IG) model per set of features. Coefficients and feature ranking were calculated using the Orange data mining tool . Finally, Tables 2,3,4 include the top 10 features for Drebin, VirusShare and AndroZoo respectively.
As shown in Figure 1, in the AndroZoo corpora, intents produced a much higher score than permissions. On the contrary, permissions in the Drebin and VirusShare corpora produced a slightly higher score than intents. Moreover, by looking at the top features of Drebin and VirusShare from Tables 2 and 3, it can be deducted that there is a similarity between the top features of Drebin and VirusShare corpora. More specifically, the first three features are the same for both Drebin and VirusShare’s top 10. In total 7 out of 10 features are common in both tables. On the other hand, AndroZoo has 1 out of 10 common feature with Drebin’s top 10, and 0 out of 10 common features with VirusShare’s top 10. Lastly, all of the Androzoo’s top 10 features are intents. This further demostrates the difference in feature importance among datasets.
In this work we presented a short survey of all major datasets, dated from 2012 to 2020. Moreover, we compare the feature importance of permissions and intents across the most common datasets, namely Drebin, VirusShare and AndroZoo. Lastly, we report the most important features of each of these three datasets, as well as similarities and differences between the top features of each dataset. Our results reveal a noteworthy difference in feature importance when inspecting our most recent dataset, i.e., Androzoo. As feature work, the authors aim to enhance this research by also adding features stemming from dynamic analysis.
-  statcounter. Available: https://gs.statcounter.com/os-market-share/mobile/worldwide, Accessed: 2020-07-26
-  V. Kouliaridis, K. Barmpatsalou, G. Kambourakis, and S. Chen, A Survey on Mobile Malware Detection Techniques, IEICE Transactions on Information and Systems, 2, pp. 204-211, 2020.
-  A. Feizollah, N. B. Anuar, R. Salleh, and A. W. A. Wahab, A review on feature selection in mobile malware detection, Digital Investigation, 13, pp. 22-37, 2015.
K. Zhao, D. Zhang, X. Su, and W. Li, Fest: A feature extraction and selection tool for Android malware detection, In 2015 IEEE Symposium on Computers and Communication (ISCC), pp. 714-720, 2015.
-  V. Kouliaridis, G. Kambourakis, D. Geneiatakis, and N. Potha, Two Anatomists Are Better than One-Dual-Level Android Malware Detection, Symmetry, 12(7), pp. 1128, 2020.
-  Contagio. Available: http://contagiodump.blogspot.com/, Accessed: 2020-07-26
-  Y. Zhou, X. Jiang, Dissecting Android Malware: Characterization and Evolution, Proceedings of the 33rd IEEE Symposium on Security and Privacy, 12(7), 2012.
-  D. Arp, M. Spreitzenbarth, M. Huebner, H. Gascon, and K. Rieck, Drebin: Efficient and Explainable Detection of Android Malware in Your Pocket, 21th Annual Network and Distributed System Security Symposium (NDSS), 12(7), pp. 1128, 2014.
-  AMD Malware Dataset. Available: http://amd.arguslab.org/, Accessed: 2020-07-26
-  virusshare. Available: https://virusshare.com/, Accessed: 2020-07-26
-  K. Allix, T. Bissyandé, J. Klein, and Y. Le Traon, AndroZoo: Collecting Millions of Android Apps for the Research Community, In Proceedings of the 13th International Conference on Mining Software Repositories, ACM, pp. 468-471, 2016
-  Google Play. Available: https://play.google.com/, Accessed: 2020-07-26
-  Orange data mining tool. Available: https://orange.biolab.si/, Accessed: 2020-07-26