KiloGrams: Very Large N-Grams for Malware Classification

08/01/2019
by   Edward Raff, et al.
0

N-grams have been a common tool for information retrieval and machine learning applications for decades. In nearly all previous works, only a few values of n are tested, with n > 6 being exceedingly rare. Larger values of n are not tested due to computational burden or the fear of overfitting. In this work, we present a method to find the top-k most frequent n-grams that is 60× faster for small n, and can tackle large n≥1024. Despite the unprecedented size of n considered, we show how these features still have predictive ability for malware classification tasks. More important, large n-grams provide benefits in producing features that are interpretable by malware analysis, and can be used to create general purpose signatures compatible with industry standard tools like Yara. Furthermore, the counts of common n-grams in a file may be added as features to publicly available human-engineered features that rival efficacy of professionally-developed features when used to train gradient-boosted decision tree models on the EMBER dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/09/2023

A Feature Set of Small Size for the PDF Malware Detection

Machine learning (ML)-based malware detection systems are becoming incre...
research
07/22/2018

Deep learning at the shallow end: Malware classification for non-domain experts

Current malware detection and classification approaches generally rely o...
research
12/14/2020

SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection

In this paper we describe the SOREL-20M (Sophos/ReversingLabs-20 Million...
research
12/10/2017

Improving Malware Detection Accuracy by Extracting Icon Information

Detecting PE malware files is now commonly approached using statistical ...
research
09/26/2018

Classification of malware based on file content and characteristics

In general, the industry of malware has come to be a market which brings...
research
01/04/2019

Network-based Analysis and Classification of Malware using Behavioral Artifacts Ordering

Using runtime execution artifacts to identify malware and its associated...

Please sign up or login with your details

Forgot password? Click here to reset