Learning from Context: Exploiting and Interpreting File Path Information for Better Malware Detection

05/16/2019
by   Adarsh Kyadige, et al.
0

Machine learning (ML) used for static portable executable (PE) malware detection typically employs per-file numerical feature vector representations as input with one or more target labels during training. However, there is much orthogonal information that can be gleaned from the context in which the file was seen. In this paper, we propose utilizing a static source of contextual information -- the path of the PE file -- as an auxiliary input to the classifier. While file paths are not malicious or benign in and of themselves, they do provide valuable context for a malicious/benign determination. Unlike dynamic contextual information, file paths are available with little overhead and can seamlessly be integrated into a multi-view static ML detector, yielding higher detection rates at very high throughput with minimal infrastructural changes. Here we propose a multi-view neural network, which takes feature vectors from PE file content as well as corresponding file paths as inputs and outputs a detection score. To ensure realistic evaluation, we use a dataset of approximately 10 million samples -- files and file paths from user endpoints of an actual security vendor network. We then conduct an interpretability analysis via LIME modeling to ensure that our classifier has learned a sensible representation and see which parts of the file path most contributed to change in the classifier's score. We find that our model learns useful aspects of the file path for classification, while also learning artifacts from customers testing the vendor's product, e.g., by downloading a directory of malware samples each named as their hash. We prune these artifacts from our test dataset and demonstrate reductions in false negative rate of 32.3 a similar topology single input PE file content only model.

READ FULL TEXT

page 6

page 17

research
03/13/2019

ALOHA: Auxiliary Loss Optimization for Hypothesis Augmentation

Malware detection is a popular application of Machine Learning for Infor...
research
10/10/2019

Would a File by Any Other Name Seem as Malicious?

Successful malware attacks on information technology systems can cause m...
research
03/24/2022

MERLIN – Malware Evasion with Reinforcement LearnINg

In addition to signature-based and heuristics-based detection techniques...
research
12/16/2020

Beyond the Hype: A Real-World Evaluation of the Impact and Cost of Machine Learning–Based Malware Detection

There is a lack of scientific testing of commercially available malware ...
research
10/29/2020

Short Text Classification Approach to Identify Child Sexual Exploitation Material

Producing or sharing Child Sexual Exploitation Material (CSEM) is a seri...
research
10/01/2019

Ransomware Analysis using Feature Engineering and Deep Neural Networks

Detection and Analysis of a potential malware specifically, used for ran...
research
06/09/2023

AVScan2Vec: Feature Learning on Antivirus Scan Data for Production-Scale Malware Corpora

When investigating a malicious file, searching for related files is a co...

Please sign up or login with your details

Forgot password? Click here to reset