An investigation of the classifiers to detect android malicious apps

02/23/2018
by   Ashu Sharma, et al.
BITS Pilani
0

Android devices are growing exponentially and are connected through the internet accessing billion of online websites. The popularity of these devices encourages malware developer to penetrate the market with malicious apps to annoy and disrupt the victim. Although, for the detection of malicious apps different approaches are discussed. However, proposed approaches are not suffice to detect the advanced malware to limit/prevent the damages. In this, very few approaches are based on opcode occurrence to classify the malicious apps. Therefore, this paper investigates the five classifiers using opcodes occurrence as the prominent features for the detection of malicious apps. For the analysis, we use WEKA tool and found that FT detection accuracy (79.27 best among the investigated classifiers. However, true positives rate i.e. malware detection rate is highest (99.91 different number of prominent features compared to other studied classifiers. The analysis shows that overall accuracy is majorly affected by the false positives of the classifier.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/30/2019

A Survey on the Detection of Android Malicious Apps

Android-based smart devices are exponentially growing, and due to the ub...
06/23/2018

Automatic Investigation Framework for Android Malware Cyber-Infrastructures

The popularity of Android system, not only in the handset devices but al...
04/03/2019

Group-wise classification approach to improve Android malicious apps detection accuracy

In the fast-growing smart devices, Android is the most popular OS, and d...
07/16/2020

Less is More: A privacy-respecting Android malware classifier using Federated Learning

Android remains an attractive target for malware authors and as such, th...
12/20/2021

Difuzer: Uncovering Suspicious Hidden Sensitive Operations in Android Apps

One prominent tactic used to keep malicious behavior from being detected...
08/03/2018

Stimulation and Detection of Android Repackaged Malware with Active Learning

Repackaging is a technique that has been increasingly adopted by authors...
10/14/2019

Using Lexical Features for Malicious URL Detection – A Machine Learning Approach

Malicious websites are responsible for a majority of the cyber-attacks a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Android is one of the most popular operating system in for smart devices and are connected through the internet accessing billions of online websites. The exponential increase in android apps is basically due to the open source, third party distribution, free rich SDK and the very much suited java language. In this growing android apps market, it is very hard to know which apps are spam or malware content. As per statista [29] android apps are available at google play store. Also, there are many third party android apps available for the users [3], which may be malicious. Hence potential of the malicious apps or malware entering these systems is now at never seen before levels.

Due to ease of use, these devices hold sensitive information such as personal data, browsing history, shopping history, financial details, etc. [1] i.e. users are ever more frequent to use the internet consequently these devices are vulnerable to cyber threats/attacks. In this, Quick Heal Threat Research Labs in the 3rd quarter of 2015 reported that they have received samples of files at the rate of samples per day for the Android and Windows platforms and the G Data security experts expect a rapid increase in numbers of new malware samples in 2016 compare to previous years [6].

The traditional approach i.e. signature based techniques, to detect the advanced malicious android apps are no longer effective, as it uses code obfuscation techniques. However, a number of methods have been proposed on static and dynamic analysis for analyzing and detecting Android malware prior to their installation [7] [8] [12] [19] [33]. It appears that so far proposed approaches are not suffice to detect the advanced malware to limit/prevent the damages [28]

. Therefore, we investigated the five classifiers ( FT, Random forest, J48, LMT and NBT ) and present a novel approach to combat malware threat/attack by analysing the opcode occurrence in the apps. The remaining paper is organised as follows. In next section, we discuss the related work. Section 3 describe our approach to detect the malicious apps based on static analysis. The results of our approach are discussed in section 4. Finally, section 5 contains the conclusion and direction for the future work.

2 Related work

Static and dynamic analysis are the two main approaches applied for detection of android malware [28]. In static analysis, without executing the apps, the code are analysed to find a malicious pattern by extracting the features such as permissions, APIs used, control flow, data flow, broadcast receivers, intents, hardware components etc. Whereas, in the dynamic analysis the apps are examined in run time environment by monitoring the dynamic behaviour (network connections, system calls, resources usage, etc.) of the apps and the system response. However, in both the approaches selected classifiers are trained with a known dataset to differentiate the benign and malicious apps. In this Seo, et. al. by analysing the permissions, dangerous APIs and keywords associated with malicious behaviours detected potential malicious scripts in Android apps [25]

. A lightweight framework was discussed by Arp, et. al., which uses AndroidManifest.xml file and disassembled code to generate a joint vector space

[4]. Wu, et. al., approach detects the malware by analyzing AndroidManifest.xml and tracing the systems calls [32]

. Sanz, et. al., analysed five classifiers with machine learning (DT, KNN, BN, RF & SVM) for automatic malware detection by analysing different sets of Android market permissions, ratings and a number of ratings. They found that among five classifiers BN performs the best while RF second and DT worst

[22]. Vidas, et. al., developed a tool which automatically analyzes the apps to find the least permissions/privileges that are required to run the apps [30]. In this, Fuchs, et. al., method analyse the data flow across the android apps components [9]. Daniel, et. al., did a broad static analysis by embedding the features in a joint vector space, such that the typical patterns of malware can be automatically identified [4]. In the DREBIN project, a study has been done with 123,453 benign and 5,560 malware apps. Based on a set of characteristics derived from binary and metadata Gonzalez, et. al., proposed a method named as DroidKin, which can detect the similarity among the apps under various levels of obfuscation (code reordering, register reassignment, etc. [28] [27]) [11]. SVM-based malware detection scheme given by Gugian, et. al., integrates both risky permission combinations and vulnerable API calls and used them as features for the classification [24]. Saracino, et. al., 2016 [23] proposed a novel host-based malware detection system called MADAM which simultaneously analyzes and correlates the features at four levels (kernel, application, user and package) to detect and stop the malicious behaviours. Quentin et. al., uses op-code sequences to detect the malicious apps, however the approach will not detect completely different malware [14]. Later on using N-opcode, BooJoong et. al., classified the malware and reported F-measure 98% [15].

3 Our approach

A novel approach to classify the unknown android malware is shown in figure 1, which involves finding the promising features (algorithm. 1), classifiers training and its detection.

Figure 1: Flow chart of the proposed approach for detection of android malicious apps.

3.1 Data Preprocessing and Feature Selection

For the classification of unknown android malware apps, we downloaded 5531 android malware from DREBIN [4] and 2691 benign apps from google play store. The benign apps are cross verified from virustotal.com [2].

To understand the logic of android malware apps, we use freely available apktool [31] to decompress the android files. After decompressing, we kept files and discarded other created files/folders. The files contains only one class information and is equivalent to file. To find the prominent features for classification of android malware and benign, we extracted the opcodes (list of the android opcodes is available at http://pallergabor.uw.hu/androidblog/dalvik_opcodes.html) of the apps from the obtained files. We analysed the opcode occurrence of all the android apps and found that the occurrence of many opcodes in malware and benign apps differ in large. The normalized opcode occurrence of both the apps are shown in figure. 2. The mapping of the opcodes with hexadecimal representation has been kept same as given by the android developers [18]. The prominent opcodes (features), which suppose to distinguish the malicious and benign android apps are obtained as described in the algorithm. 1. For the classification, we have used Waikato Environment for Knowledge Analysis (WEKA) tool, a collection of visualisation tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to this functionality [13], in which many inbuilt classifiers are available. On the basis of studies done by Sharma and Sahay [26] [21], we selected the best classifier (Random forest [20], LMT (Logistic model trees) [17]

, NBT (Naive-Bayes tree)

[16], J48 [5] and FT (Functional Tree) [10]) for in-depth analysis by using K-fold cross-validation technique.

Figure 2: Dominant opcodes of malicious and benign android apps .

INPUT: Pre-processed data
: Number of benign android apps, : Number of malware android apps,
: Total number of prominent features required.
OUTPUT: List of prominent features

  BEGIN
  for all benign apps  do
     Compute sum of the frequencies of each opcode and normalize it.
     
  end for
  for all malware data  do
     Compute sum of the frequencies of each opcode and normalize it.
     
  end for
  for all opcode  do
     Find the difference of the normalized frequencies for each opcode .
     
  end for
  return   number of prominent opcodes as features with high .
Algorithm 1 :Feature Selection
Figure 3: Detection accuracy obtained by the selected five classifiers with different number of prominent features.
Figure 4: Best accuracy obtained by the selected five classifiers.
Figure 5: True positives obtained by selected five classifiers with different number of prominent features.
Figure 6: True negatives obtained by selected five classifiers with different number of prominent features.
Figure 7: False negatives obtained by selected five classifiers with different number of prominent features.
Figure 8: False positives obtained by selected five classifiers with different number of prominent features.

4 Result analysis

The five selected classifiers are analysed by applying supervised machine learning technique with K-fold cross validation for k = 10. For the analysis, we first obtained the top 200 promising features (algorithm 1). The accuracy of the classifiers is obtained by varying the promising features and is measured by the equation

(1)

where,
True positive, the number of malware apps correctly classified.
False negative, the number of malware apps incorrectly classified.
True negative, the number of benign apps correctly classified.
False positives, the number of benign apps incorrectly classified.

The performance of the classifier has been studied by taking 20% of available data (not used for training) with 20-200 best features, incrementing 20 features at each step and the result obtained are shown in figure. 4. From the analysis, the best accuracy is obtained by FT, Random forest, J48, LMT and NBT is approximately , , , and percent (figure 4). Among these classifiers the least fluctuation in the accuracy by varying the features is observed in Random forest. Figure 6 shows the TPR (malware detection rate) of all five classifiers with a different number of features. We found that the RF gives maximum TPR with least fluctuation compared to other classifiers.

Figure  6 shows the TNR (benign detection rate) for all five classifiers with a different number of features. Here with some exception, we observed that FT detected the benign better than the other classifiers with a different number of features. Figure 8 shows the false negatives of all selected classifier, in which compared to other classifiers the RF is good and also fluctuation is least with the number of features. Figure 8 shows the false positives of the analysed classifiers and here we observed that all the five classifier does not give a good result, hence very much affects the final accuracy. However, although the false negative of RF is not as par but the fluctuation with the number of features is least compared to other classifiers.

5 Conclusion

The threat/attack from the malicious apps in android devices are now never seen at before levels, as millions of android apps are available officially (google play store) and unofficially. Some of these available apps may be malicious, hence these devices are very much vulnerable to cyber threat/attack. The consequence will be devastating if in time counter-measures are not developed. Therefore, in this paper, we investigated five classifier FT, Random forest, J48, LMT and NBT for the detection of malicious apps. We found that among the studied classifiers, FT is the best classifiers and detect the malware with accuracy. However, true positives i.e. malware detection rate is highest () by RF and fluctuate least with the different number of prominent features compared to other studied classifiers, which is better than BooJoong et. al., F-measure (98%) [15]. The analysis shows that overall accuracy is majorly affected by the false positives of the classifier. Hence in future more detail study are required to decrease the false positive and negative ratio for overall good accuracy and in this direction work is in progress, showing impressive results.

Acknowledgments

Mr. Ashu Sharma is thankful to BITS, Pilani, K.K. Birla Goa Campus for the support to carry out his work through Ph.D. scholarship No. Ph603226/Jul. 2012/01.

References

  • [1] Threat report 3rd quarter, 2015 (2015), http://www.quickheal.co.in/resources/threat-reports
  • [2] Virustotal - free online virus, malware and url scanner (june 2016), https://www.virustotal.com/
  • [3] 9apps: Free android apps download (August 2016), http://www.9apps.com/
  • [4] Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., Rieck, K.: Drebin: Effective and explainable detection of android malware in your pocket. In: NDSS (2014)
  • [5]

    Bhargava, N., Sharma, G., Bhargava, R., Mathuria, M.: Decision tree analysis on j48 algorithm for data mining. Proceedings of International Journal of Advanced Research in Computer Science and Software Engineering 3(6) (2013)

  • [6] Data, G.: Mobile malware report. Tech. rep., G DATA (2015)
  • [7] Enck, W., Gilbert, P., Han, S., Tendulkar, V., Chun, B.G., Cox, L.P., Jung, J., McDaniel, P., Sheth, A.N.: Taintdroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM Transactions on Computer Systems (TOCS) 32(2),  5 (2014)
  • [8] Felt, A.P., Chin, E., Hanna, S., Song, D., Wagner, D.: Android permissions demystified. In: Proceedings of the 18th ACM conference on Computer and communications security. pp. 627–638. ACM (2011)
  • [9] Fuchs, A.P., Chaudhuri, A., Foster, J.S.: Scandroid: Automated security certification of android. Tech. rep., University of Maryland Department of Computer Science (2009)
  • [10] Gama, J.: Functional trees. Machine Learning 55(3), 219–250 (2004)
  • [11] Gonzalez, H., Stakhanova, N., Ghorbani, A.A.: Droidkin: Lightweight detection of android apps similarity. In: International Conference on Security and Privacy in Communication Systems. pp. 436–453. Springer (2014)
  • [12] Grace, M., Zhou, Y., Zhang, Q., Zou, S., Jiang, X.: Riskranker: scalable and accurate zero-day android malware detection. In: Proceedings of the 10th international conference on Mobile systems, applications, and services. pp. 281–294. ACM (2012)
  • [13] Holmes, G., Donkin, A., Witten, I.H.: Weka: A machine learning workbench. In: Intelligent Information Systems, 1994. Proceedings of the 1994 Second Australian and New Zealand Conference on. pp. 357–361. IEEE (1994)
  • [14] Jerome, Q., Allix, K., State, R., Engel, T.: Using opcode-sequences to detect malicious android applications. In: 2014 IEEE International Conference on Communications (ICC). pp. 914–919. IEEE (2014)
  • [15] Kang, B., Yerima, S.Y., McLaughlin, K., Sezer, S.: N-opcode analysis for android malware classification and categorization. In: Cyber Security And Protection Of Digital Services (Cyber Security), 2016 International Conference On. pp. 1–7. IEEE (2016)
  • [16]

    Kohavi, R.: Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In: KDD. vol. 96, pp. 202–207. Citeseer (1996)

  • [17] Landwehr, N., Hall, M., Frank, E.: Logistic model trees. Machine Learning 59(1-2), 161–205 (2005)
  • [18] Paller, G.: Dalvik opcodes, http://pallergabor.uw.hu/androidblog/dalvik_opcodes.html
  • [19] Reina, A., Fattori, A., Cavallaro, L.: A system call-centric analysis and stimulation technique to automatically reconstruct android malware behaviors. EuroSec, April (2013)
  • [20] Rodriguez, J.J., Kuncheva, L.I., Alonso, C.J.: Rotation forest: A new classifier ensemble method. IEEE transactions on pattern analysis and machine intelligence 28(10), 1619–1630 (2006)
  • [21] Sahay, S.K., Sharma, A.: Grouping the executables to detect malwares with high accuracy. Procedia Computer Science 78, 667–674 (2016)
  • [22] Sanz, B., Santos, I., Laorden, C., Ugarte-Pedrero, X., Bringas, P.G.: On the automatic categorisation of android applications. In: 2012 IEEE Consumer communications and networking conference (CCNC). pp. 149–153. IEEE (2012)
  • [23] Saracino, A., Sgandurra, D., Dini, G., Martinelli, F.: Madam: Effective and efficient behavior-based android malware detection and prevention (2016)
  • [24]

    Schölkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural computation 13(7), 1443–1471 (2001)

  • [25] Seo, S.H., Gupta, A., Sallam, A.M., Bertino, E., Yim, K.: Detecting mobile malware threats to homeland security through static analysis. Journal of Network and Computer Applications 38, 43–53 (2014)
  • [26] Sharma, A., Sahay, S.K.: An effective approach for classification of advanced malware with high accuracy. International Journal of Security and Its Applications 10(4), 249–266 (2016)
  • [27] Sharma, A., Sahay, S.K., Kumar, A.: Improving the detection accuracy of unknown malware by partitioning the executables in groups. In: Advanced Computing and Communication Technologies, pp. 421–431. Springer (2016)
  • [28] Sharma, A., Sahay, S.K.: Evolution and detection of polymorphic and metamorphic malwares: a survey. International Journal of Computer Applications 90(2), 7–11 (March 2014)
  • [29] Statista: Number of available applications in the google play store from december 2009 to february 2016 (August 2016), https://developer.android.com/guide/topics/security/permissions.html
  • [30] Vidas, T., Christin, N., Cranor, L.: Curbing android permission creep. In: Proceedings of the Web. vol. 2, pp. 91–96 (2011)
  • [31] Winsniewski, R.: Android–apktool: A tool for reverse engineering android apk files (2012)
  • [32] Wu, D.J., Mao, C.H., Wei, T.E., Lee, H.M., Wu, K.P.: Droidmat: Android malware detection through manifest and api calls tracing. In: Information Security (Asia JCIS), 2012 Seventh Asia Joint Conference on. pp. 62–69. IEEE (2012)
  • [33] Yan, L.K., Yin, H.: Droidscope: seamlessly reconstructing the os and dalvik semantic views for dynamic android malware analysis. In: Presented as part of the 21st USENIX Security Symposium (USENIX Security 12). pp. 569–584 (2012)