Currently, Android has more than 1.6 billion active users, which accounts for more than 70% of the global market share of mobile operating systems. As a result, the application market for android is flooded with apps. We define malicious apps or malware as Android applications that present itself to the user as benign, but secretly steals user information in the background. Although the Android application store (Google Play) verifies apps for malicious intent upon release, it does not vehemently track updates from these verified apps and can not account for third-party apps downloaded independently by the user. A report released in 2020 by McAfee Advanced Threat Research and Mobile Malware Research  suggests that malware developers roll out malware through verified apps in Google Play as updates to shield themselves from preliminary verification. Undetected malware attacks can steal sensitive and organization-crippling information from users such as photos, documents and browsing data. Data breaches are extremely disastrous for small and midsize firms and businesses. A report by the U.S. Securities and Exchange Commission states that 60% of small firms can not recuperate from data breaches and go out of business within 6 months. The IBM ”Cost of a Data Breach Report 2020” suggests companies to establish an incident response (IR) plan to determine the damage done by the breach and contain it as soon as possible. It goes on to state that companies with an IR plan save an average of 2$ million in the event of a data breach. Furthermore, the report projects an increase in the costs of data breaches due to the COVID-19 pandemic and the increase in digital reliability. This calls for a need to not only detect malicious attacks but to also identify the stolen data to assess the damage, strategically recover and prevent future attacks. Performing this can help in understanding malware trends and aid in malware prevention research.
We propose a novel two-stage machine learning approach to detect malicious attacks for any app under supervision and identify the data stolen by the attack to aid in assessment and recovery.
The course of this paper is as follows: Section 2 discusses the research works carried out with relevance to malware detection. In Section 3, we describe the dataset used in our study extensively. We elucidate the steps taken to make the data computationally feasible in Sections 4 and 5. Later, in Section 6 and 7 we outline our model architecture and describe the parameters of its evaluation. Finally, in Section 8 we report and discusses our findings.
2 Related Work
Mobile malware detection has been an active and broad area of research for the past several years. Static analysis was one of the first major mobile malware detection approaches proposed [4, 5]. Here, the source code of the target malware is analyzed to identify semantic signatures. Although static analysis can detect malware even before running the app, static analysis systems fail when the malware uses obfuscation techniques such as code encryption and repackaging. Dynamic analysis techniques [6, 7] address code obfuscation and encryption in malware detection by executing the source code of the application in an isolated environment to analyze runtime characteristics based on frequency. However, this proves to be a bottleneck in dynamic analysis systems as isolated, lab-like noiseless data is hard to achieve and be implemented in a real-world setting. Static and dynamic methods additionally require super-user (root) access since they require source code to be implemented. Furthermore, Moser et al.  suggests that the rate of developing rule-based solutions can not match the fast rate of new malware released to the world. thus, these solutions will fail to perform for new malware since they are rule-based and specific solutions.
that outperforms static and dynamic methods by modelling network usage for detection.  used various anomaly detection methods to detect malware using system and network data collected. Ronen et. al and related works go on to detect and classify the family of the detected malware by analysing dalvik bytecode from android devices. However, these works fail to address the security risk for any end user to obtain bytecode. This exposes the phone to further vulnerabilities due to the need for root access. There is a need for non-intrusive malware detection systems based on low privilege information such as usage statistics. This would allow easier user applicability and ensure better security over super-user vulnerabilities.
We propose modeling malware on usage statistics data and we consider one of the largest and most granular dataset for mobile sensor and software sampling - Sherlock Dataset As a result of the dataset’s versatility it is suitable for a multitude of use-cases. Since it does not require root access to probe its data it is safe and reproducible for malware detection. Zheng et al. explored usage patterns and the relationship between mobile usage and the state (benign/malicious) of the application for this data. Wassermann et al. used low-level system features from this dataset with sampling techniques to deal with an inherent class imbalance and detect malicious actions performed on a smartphone.
Although, current research tackles malware detection extensively they fail to address data theft classification to aid damage assessment and recovery from data breaches.
We consider using the SherLock dataset to develop a machine learning malware detection pipeline that would detect if any given app is malicious and identify the data it attempts to steal if detected. We aim to utilize network traffic data, local and global system features to implement this solution.
For our experiments we used the SherLock dataset. The SherLock Dataset, spanning over 10 billion records, involving over 50 volunteers is the result of a real-world data collection experiment to obtain low-level Android usage data alongside emulated malware. Such statistics do not require root access, therefore making any solution developed on the dataset more secure under real-world circumstances since rooting exposes a mobile phone to further vulnerabilities.
The experiment introduces two data collection agents to the mobile phones provided to the volunteers – Sherlock and Moriarty. Moriarty emulates malicious actions on the volunteer’s mobile phones randomly through the course of the experiment generating distinct labels between malicious and benign actions. Meanwhile, Sherlock logs usage attributes and statistics in the background.
|Malware Service Type||Target information|
|GPS||User coordinates (latitude and longitude)|
|URL||Web address of every page visited by the user recently|
|Audio Records||Audio records collected during the session|
|Contacts||Names and Phone numbers|
|BrowserInfo||Account details, bookmarks and browser history|
|Photos||Images from gallery|
3.1 Sherlock Data Collection Agent
One of the ways Sherlock logs phone attributes is through Pull Probes which extract data periodically at a constant sampling rate. For our experiments we consider the most frequently sampled pull probe in Sherlock named T4, which has a sampling rate of 5 seconds. T4 probes Global System Features as well as Local Application Features.
Global System Features (GSF) These features pertain to attributes with a global scope in the Android system such as network traffic, CPU and memory utilization, IO interrupts and WiFi related data. There are a total of 128 Global System Features.
Local Application Features (LAF) Alongside Global System Features, Linux-level data  for every running application is sampled. This includes process specific features such as the scheduling priority, number of bytes transferred, number of threads and kernel level features used by an application at the time instant. There are a total of 56 Local Application Features.
Local Application Features used in context with Global System Features provides a rich feature set to determine if a given app exhibits malicious behaviour.
3.2 Moriarty Malicious Agent
Moriarty presents itself to the user as a benign application, such as a Game or a Browser depending on the version of the app but covertly performs malicious actions. The malware emulated by each version is dissimilar to its precursor and targets different vulnerabilities in each version as illustrated in Table 1. The malware used by Moriarty are behavioural copies of malware found in the real-world.
The app contains labels indicating whether an action executed is benign or malicious. Furthermore, the details of the malicious actions such as the type of data stolen, number of bytes transmitted and time taken to transfer the stolen information are logged along with the labels. To collect sufficient information for the experiment the volunteers were reminded to use the Moriarty app if they have not used it for a couple of days.
For the experiment we have considered a computationally feasible subset of the SherLock dataset. It consists of data collected during the first quarter of 2016, with over 300 million records, spanning across 5 users.
4 Data Pre-processing
We aim to enable efficient data merging between Local Application Features (LAF) and Global System Features (GSF). Let g and n denote the number of GSF and LAF. Assuming there are m
apps running at the same time, each Global System Feature would correspond to multiple LAF at that instant of time. The vector space of application data (LAF at timet), denoted by for any time instant t is represented in Equation (1).
Consequently, if a relational join operation between GSF and LAF was performed it would lead to the generation of GSF duplicates for every running application with a shape of (). The size of this data denoted by is memory units. With the dataset spanning over 300 million records, it becomes essential to reduce memory consumption to expedite the data handling and modeling process. Therefore to overcome duplicates, is transformed into a row vector of shape () by performing PIVOT operation represented in Equation (2), thus obtaining a functional dependency with time.
= Set of all applications on the device
= Set of local application features
As a result of using to merge with GSF as opposed to using we obtain a shape of and size of this data denoted by is . The size comparison of the data obtained from merging GSF with and without pivot operation is illustrated in Equations (3). Thus, reducing the overall throughput as the number of applications increase.
For the first quarter of 2016 in SherLock, and with an average of running at any given time.
We observed that / was 3.2 indicating that the pivot operation was effective in reducing the size of the merged data. We obtain a dataset with 14,234 features and 5.81 million records on merging this data with Moriarty labels.
5 Feature Selection
We strive to reduce the dataset to its most informative features for smooth and utilitarian processing. On closer inspection of the 14,234 features, we discovered 12,726 features to have more than 70% of null values in them and we obtain 1508 features as a result of their removal. However, this remains significantly large for us to process, considering that we have 5.8 million records.
To further reduce the feature set, we pursue a feature selection method that ensures relevance towards our objective – malware detection and target classification. We considered LightGBM as it has proven to be fast and scalable especially when implemented on high dimensional datasets . Using this technique we reduce our feature space to 150 and 100 important features for malicious detection and target classification respectively
With the features reduced to less then 15% of 1508 features, we can now implement stepwise forward selection  – an iterative method to determine the least number of features required to obtain any given model’s best performance. Using stepwise forward selection we reduce the features required to detect malware to 10 features and the features required to determine the data targeted by malware to 16 features.
As a result of our feature selection approach the feature set is reduced to approximately 0.1% of the original feature set. Table 2 lists the most important features that were considered for modeling.
6 Modeling and Experimentation
Knowing the kind of data the malware steals could be of more use during data breach assessment compared to just detecting the presence of a malicious action. We propose a two-stage architecture illustrated in Fig. 1 to classify data targeted by a positively detected malware. Our approach detects if a malicious action occurs in the first stage and if positively detected, classifies the data targeted during the malicious action in the second stage.
6.1 Malware Detection
due to the sparse frequency of malicious records observed in the data as compared to benign(1:90). We aim to identify if anomaly detection methods are effective as per prior assumption, therefore we consider a tree-based anomaly and outlier detection method–Isolation Forest.
6.2 Target classification
We pass the values detected as malicious in the first stage to further classify the data targeted in this stage. This is a multi–class classification problem to determine the type of data targeted by the malware as seen in Table 1. We consider models–Extra trees, XGBoost and K-nearest neighbours for this task.
7 Evaluation Metrics
7.1 Malware Detection
In this paper, we propose using False Omission Rate (FOR) and False Positive Rate (FPR) to evaluate the performance of a malicious detector. Accuracy and true positive rates as considered by [8, 11, 13, 16] are not ideal choice of metrics as they evaluate the model’s performance using the true positive values of the majority class which are generally high for highly imbalanced data such as SherLock and therefore compensate the impreciseness in classifying the minority class.
We aim to reduce the number of instances where a malware is misclassifed as benign. Therefore we consider False Omission Rate (FOR) and False Positive Rate (FPR) to evaluate the performance of the malicious detector.
Illustrated in Equation (4) this metric indicates the fraction of malicious actions that go undetected by a malicious classifier.
Illustrated in Equation (5) this metric indicates the fraction of benign records that are misclassified.
Although each metric can be used individually, we propose using both FOR and FPR in conjunction to discover a detector with an overall good-fit for detecting presence of malawre. A lower FOR signifies the success of the first stage of our architecture (malware detection). Meanwhile, a lower FPR signifies a smaller error that will cascade to the next stage. Ideally both FOR and FPR need to be minimised to improve performance in data classification isn the second stage of our proposed two stage architecture.
7.2 Target Classification
Target Classification is a multi-class classification task that involves predicting what kind of data has been stolen by the malware. The different types of malware stolen were given equal importance and hence equal weights were considered for all the classes. Therefore, the average F1-score is the metric of choice used to evaluate the model in this stage.
|Classifier||False Omission Rate||False Positive Rate|
8 Results and Discussions
To evaluate the performance of our proposed architecture, we consider training and testing on all the users combined. Each user has been proportionally sampled (stratified) while splitting into 75% for training and 25% for testing. Since the proposed architecture consists of two stages, it cascades performance at each level. We are reporting the results at each stage for a deeper understanding of our performance.
8.1 Malware Detection
Tree-based classifiers prove to display superior performance for this task as illustrated in Table 3. This is due to their ability to capture discrete and categorical information more accurately.
However, contrary to prior assumption tree-based outlier detection method – Isolation forest, fails to detect malware with a FOR of 0.79. On observing the density distributions of some of the most important features in Fig.2 we discover an overlap between malicious and benign distributions. Anomaly detection methods are effective to identify outliers from distributions . Since unsupervised and anomaly detection methods rely on the malware to exist outside benign distribution these methods may fail to detect malicious activity for this data.
With the least FOR of all the models considered (illustrated in Table 3), Decision tree and Extra trees are the best malware detectors with 6.3% and 8.7% FOR respectively. However, on closer inspection of the Decision tree detector we observe that it can only achieve this accuracy at the cost of 22.2% FPR. Since this is not desirable for a performance cascading architecture as discussed in Section 7.1, we use Extra trees to determine if an action is malicious before we classify its target in the next stage of our two-stage model.
8.2 Progressive learning
It is necessary for any user to be trained using the SherLock framework before the user can successfully monitor a newly installed app from the market. The time taken by each user to train the detector with SherLock would desirably need to be reduced which can be done by minimising the required train data for the detection task. Our detector tackles this problem by combining all the users we have by performing stratified training and testing. Our detector exhibits the same accuracy with a decrease in train size as the number of users it has learned on increases. This is visualized in Fig. 3 where we consider a threshold accuracy of 0.15 FOR to analyse the change in required train data for an Extra trees detector trained on 1-5 users. To achieve the threshold accuracy when our detector had trained only on a single user, the detector required atleast 76% train data. However our detector reduces the percentage of train data required by each user as it learns from more users. When the model was trained on 5 users it required only 52.5% of the train data to achieve the threshold FOR.
8.3 Target Classification
Table 4 illustrates the results for classifying the target of the malicious actions predicted by the first stage. Due to the non-linearity posed by the datastream we consider tree-based algorithms such as ExtraTrees and XGBoost. Although XGBoost and ExtraTrees display comparable performances we prefer XGBoost to be integrated with our final pipeline since it has proven to be more scalable than the latter and displays the highest average performance of the models considered for the second stage.
With less than 9% inaccuracy in detecting malware from the first stage, we can predict with 83% certainty on what kind of data is being stolen when we use an Extra trees detector (Table 3) coupled with an XGBoost classifier (Table 4).
Furthermore, by using our feature selection approach we maintain the aforementioned model performance with the feature set reduced to approximately 0.1% of the original set. Stepwise forward selection for malware detection (illustrated in Fig. 4) reveals that we only require 10 features to determine if an action is malicious to achieve a minimum FOR and FPR of 0.087 and 0.019 respectively. Fig. 5 illustrates stepwise forward selection for target classification and suggests that we require only 16 features to categorize the type of data stolen. As a result of using such a small feature set, we minimize our throughput and processing time drastically.
In this paper, we propose and successfully test a two-stage machine learning model on the SherLock dataset to detect malicious actions in a smartphone and identify the type of data it steals. We successfully reduce one of the largest datasets for malware classification (SherLock) to 0.1% of its initial feature set using our data preprocessing techniques and feature selection techniques. Furthermore, we go on to propose using False omission rate and False Positive Rate in conjunction to evaluate malware detectors. With 8.7% inaccuracy in detecting malware from the first stage, our model can predict with 83% certainty on what kind of data is being stolen when we use an Extra trees detector coupled with a XGBoost classifier. We exhibit our detector’s robustness with the gradual decrease in the required train data from one user to achieve the aforementioned performance by training on more users and data. Anomaly/Outlier detection techniques for malware fail, since malicious actions do not lie outside benign distributions as conventionally expected.
10 Future Works
Although the proposed model reduces the percentage of train data required by a user to the minimum, malware detection is still dependent on user behaviour to work. There exists a need for a truly user-independent machine learning solution for malware detection to enhance user experience and ergonomics.
We would like to thank all the talented members at Solarillion Foundation without whom this work would have taken much longer to carry out.
-  Raj Samani, ”McAfee Mobile Threat Report Q1”, 2020, URL: https://www.mcafee.com/content/dam/consumer/en-us/docs/2020-Mobile-Threat-Report.pdf
-  U.S. Securities and Exchange Commission, ”The Need for Greater Focus on the Cybersecurity Challenges Facing Small and Midsize Businesses”, 2015, URL: https://www.sec.gov/news/statement/cybersecurity-challenges-for-small-midsize-businesses.html
-  IBM, Cost of a Data Breach Report 2020, URL: https://www.ibm.com/security/digital-assets/cost-data-breach-report/
-  Schmidt, A.-D., Bye, R., Schmidt, H.-G., Clausen, J., Kiraz, O., Yuksel, K. A., … Albayrak, S. (2009). Static Analysis of Executables for Collaborative Malware Detection on Android. 2009 IEEE International Conference on Communications.
-  A. P. Felt, E. Chin, S. Hanna, D. Song, and D. Wagner,“Android Permissions Demystied,” in Proceedings of the 18th ACM Conference on Computer and Communications Security, 2011.
-  Enck, W., Gilbert, P., Chun, B. G., Cox, L. P., Jung, J., McDaniel, P., & Sheth, A. N. (2019). TaintDroid: An information-flow tracking system for realtime privacy monitoring on smartphones. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2010.
-  T. Bläsing, L. Batyuk, A. Schmidt, S. A. Camtepe and S. Albayrak, ”An Android Application Sandbox system for suspicious software detection,” 2010 5th International Conference on Malicious and Unwanted Software, Nancy, Lorraine, 2010.
-  A. Shabtai, U. Kanonov, Y. Elovici, C. Glezer, Y. Weiss, “Andromaly: a behavioral malware detection framework for Android devices,” Journal of Intelligent Information Systems, vol. 38, no. 1, pp. 161-190, 2012.
-  A. Moser, C. Kruegel and E. Kirda, ”Limits of Static Analysis for Malware Detection,” Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).
-  Royi Ronen, Marian Radu, Corina Feuerstein, Elad Yom-Tov, Mansour Ahmadi, ”Microsoft Malware Classification Challenge”, 2018, https://arxiv.org/abs/1802.10135
-  Shanshan Wang, Chen, Z., Zhang, L., Yan, Q., Yang, B., Peng, L., & Zhongtian Jia. (2016). TrafficAV: An effective and explainable detection of mobile malware behavior using network traffic. 2016 IEEE/ACM 24th International Symposium on Quality of Service (IWQoS).
-  Chen, Zhenxiang & Yan, Qiben & Han, Hongbo & Wang, Shanshan & Peng, Lizhi & Wang, Lin & Yang, Bo. (2017). Machine Learning Based Mobile Malware Detection Using Highly Imbalanced Network Traffic. Information Sciences. 433-434.
-  A. Arora, S. Garg and S. K. Peddoju, ”Malware Detection Using Network Traffic Analysis in Android Based Mobile Devices,” 2014 Eighth International Conference on Next Generation Mobile Apps, Services and Technologies, Oxford, 2014, pp. 66-71.
-  D. Bekerman, B. Shapira, L. Rokach and A. Bar, ”Unknown malware detection using network traffic classification,” 2015 IEEE Conference on Communications and Network Security (CNS), Florence, 2015, pp. 134-142.
Y. Mirsky, A. Shabtai, L. Rokach, B. Shapira, and Y. Elovici, “Sherlock vs Moriarty: A Smartphone Dataset for Cybersecurity Research,” in Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security.
-  Zheng, Yong & Srinivasan, Sridhar. (2020). Mobile App and Malware Classifications by Mobile Usage with Time Dynamics.
-  Sarah Wassermann and Pedro Casas. 2018. BIGMOMAL: Big Data Analytics for Mobile Malware Detection. In Proceedings of the 2018 Workshop on Traffic Measurements for Cybersecurity (WTMC ’18). Association for Computing Machinery, New York, NY, USA, 33–39.
-  Memon, Laraib & Bawany, Narmeen & Shamsi, Jawwad. (2019). A COMPARISON OF MACHINE LEARNING TECHNIQUES FOR ANDROID MALWARE DETECTION USING APACHE SPARK. Journal of Engineering Science and Technology.
-  Linux Manual : https://man7.org/linux/man-pages/man5/proc.5.html
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu. ”LightGBM: A Highly Efficient Gradient Boosting Decision Tree”. Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 3149-3157.
-  Han, J., Kamber, M., & Pei, J. (2012). Data Preprocessing. Data Mining, 83–124. doi:10.1016/b978-0-12-381479-1.00003-4
-  Cheng Chen, Qingmei Zhang, Qin Ma, Bin Yu, LightGBM-PPI Cheng Chen, Qingmei Zhang, Qin Ma, Bin Yu, LightGBM-PPI: Predicting protein-protein interactions through LightGBM with multi-information fusion, Chemometrics and Intelligent Laboratory Systems, Volume 191, 2019, Pages 54-64, ISSN 0169-7439.