Don't Pick the Cherry: An Evaluation Methodology for Android Malware Detection Methods
In evaluating detection methods, the malware research community relies on scan results obtained from online platforms such as VirusTotal. Nevertheless, given the lack of standards on how to interpret the obtained data to label apps, researchers hinge on their intuitions and adopt different labeling schemes. The dynamicity of VirusTotal's results along with adoption of different labeling schemes significantly affect the accuracies achieved by any given detection method even on the same dataset, which gives subjective views on the method's performance and hinders the comparison of different malware detection techniques. In this paper, we demonstrate the effect of varying (1) time, (2) labeling schemes, and (3) attack scenarios on the performance of an ensemble of Android repackaged malware detection methods, called dejavu, using over 30,000 real-world Android apps. Our results vividly show the impact of varying the aforementioned 3 dimensions on dejavu's performance. With such results, we encourage the adoption of a standard methodology that takes into account those 3 dimensions in evaluating newly-devised methods to detect Android (repackaged) malware.
READ FULL TEXT