MALIGN: Adversarially Robust Malware Family Detection using Sequence Alignment

We propose MALIGN, a novel malware family detection approach inspired by genome sequence alignment. MALIGN encodes malware using four nucleotides and then uses genome sequence alignment approaches to create a signature of a malware family based on the code fragments conserved in the family making it robust to evasion by modification and addition of content. Moreover, unlike previous approaches based on sequence alignment, our method uses a multiple whole-genome alignment tool that protects against adversarial attacks such as code insertion, deletion or modification. Our approach outperforms state-of-the-art machine learning based malware detectors and demonstrates robustness against trivial adversarial attacks. MALIGN also helps identify the techniques malware authors use to evade detection.



page 1

page 2

page 3

page 4


Robust Android Malware Detection System against Adversarial Attacks using Q-Learning

The current state-of-the-art Android malware detection systems are based...

Mal2GCN: A Robust Malware Detection Approach Using Deep Graph Convolutional Networks With Non-Negative Weights

With the growing pace of using machine learning to solve various problem...

secml-malware: A Python Library for Adversarial Robustness Evaluation of Windows Malware Classifiers

Machine learning has been increasingly used as a first line of defense f...

Adversarial Attacks against Windows PE Malware Detection: A Survey of the State-of-the-Art

The malware has been being one of the most damaging threats to computers...

Genomic Compression with Read Alignment at the Decoder

We propose a new compression scheme for genomic data given as sequence f...

I-MAD: A Novel Interpretable Malware Detector Using Hierarchical Transformer

Malware imposes tremendous threats to computer users nowadays. Since sig...

Bio-inspired data mining: Treating malware signatures as biosequences

The application of machine learning to bioinformatics problems is well e...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

To detect the rising number of malware at scale, machine learning is necessary. Indeed, currently all commercial malware detectors use machine learning. However, one shortcoming of current static malware detectors is that they can be easily evaded by changing a malware trivially without changing the core of the malware [kolosnjaji2018adversarial, grosse2016adversarial, pe11, pe13, pe14, pe15, pe16, pe18, pe17, pe19]. Fundamentally, the adversarial attacks use one simple technique: add or modify selected content to the malware.

One simple way to design an adversarially robust malware detector is to make it rely on the core functionalities of a malware and ignore the added insignificant content. This turns out to be surprisingly hard, especially for static detectors, without increasing the false positive rate. This happens because malware authors often take a similar approach to evade detection—hide malicious content in the overlay and data section instead of the code section without changing the malware semantics.

Static detectors are one of the first and cheapest lines of defense against malware. Commercial Anti-virus systems contain more sophisticated detectors, based on dynamic and behavior analysis, which are unlikely to be fooled by adversarial malware samples evading static-only detection. However, having a robust static detector is still crucial for efficient and timely malware detection. Static detectors require fewer resources, can be easily deployed in many different environments including multi-architecture platforms and can detect a vast majority of malware samples faster than dynamic detectors.

Taking inspiration from bioinformatics, we model malware like DNA sequences or genomes, which can be used for static detection of malware. Just as DNA sequences are made of only four types of nucleotides, malware are sequences of bits, and modifications of malware mirror accumulation of mutations in genomes during evolution. Genomes contain critical regions for the survival of the organism, such as, protein coding genes where mutations may be lethal. Similarly, malware contain code blocks that are difficult to modify without altering its functionality and semantics. If we can translate a malware in terms of the basic building blocks, our detector will be robust by design that cannot be evaded without substantially changing the malware.

We propose MAlign

, a novel malware family detection approach inspired by genome sequence alignment. Our approach at first converts a family of malware files into malware nucleotide sequence files i.e. sequences of A, C, G and T. Then we use a multiple whole-genome alignment tool to identify common alignment blocks per family. These alignment blocks work as a signature of the malware family, and a score is assigned to each of them depending on their importance. We train a classifier using the features that represent how well a block identifies with a particular family. To classify whether a new malware belongs to the family, we first compute the alignment of the new malware with the sequences representing the blocks i.e. signature of the family, and use it to classify the malware.

Our robustness properties come from the use a recent multiple whole-genome alignment method that can find conserved blocks of sequences even in the presence of sequence re-ordering and minor modifications, and through estimation of the degree of conservation at each location by processing the generated alignment. It prevents certain types of adversarial manipulation, such as, adding extra content, changing code order, and minor changes to the code. To evade detection, an attacker is likely to need to make substantial modification to the code.

We evaluate MAlign on two datasets: Kaggle Microsoft Malware Classification Challenge (Big 2015) and Microsoft Machine Learning Security Evasion Competition (2020) (MLSec). In comparison with MalConv, feature fusion and CNN-based malware classifiers, our approach has higher accuracy and robustness. Moreover, sequence alignment helps to reveal the common practices of malware families, such as hiding code in a non-code sections.

In summary, our main contributions are:

  • Scalable and explainable

    : Our approach is simple, scalable and easily explainable. Use of a recently developed multiple whole-genome alignment tool makes it applicable to large datasets such as Big 2015 with running time comparable to existing machine learning based methods while classification using a logistic regression model on conserved sequence based features provide explainability.

  • High accuracy: MAlign

     outperforms current state of the art methods on both Big 2015 and MLSec datasets. Moreover, it has high accuracy even without large amount of training data unlike some deep learning based methods.

  • Robustness to adversarial attacks: MAlign finds conserved code blocks considering mismatches and assigns high scores to those. So, critical code blocks may need to be modified drastically to evade detection.

Ii Background

Sequence alignment is a widely studied problem in bioinformatics to find similarity among DNA, RNA or protein sequences, and to study evolutionary relationships among diverse species. It is the process of arranging sequences in such a way that regions of similarity are aligned, with gaps (denoted by ‘-’) inserted to represent insertions and deletions in sequences. An alignment of the sequences ATTGACCTGA and ATCGTGTA is shown below where the regions denoted in black, characterized by matched characters, are conserved whereas the red and blue regions denote substitutions i.e. point mutations, and insertions or deletions during the evolutionary process respectively.




In sequence alignment, matches, mismatches and insertions-deletions (in-dels) are assigned scores based on their frequencies during evolution and the goal is to find an alignment with the maximum score. The problem of finding an optimal alignment of the entire sequences (global alignment) and that of finding an optimal alignment of their sub-sequences (local alignment) can be solved by dynamic programming using the Needleman-Wunsch [needleman-wunsch] and Smith-Waterman [smith-waterman] algorithms respectively. While the algorithms can be used to align more than two sequences, the running time is exponential in the number of sequences. To address the tractability issue, a number of tools have been developed [clustal, mafft, muscle]

, that use heuristics to solve the multiple sequence alignment (MSA) problem.

However, in addition to point mutations and short insertions-deletions, large scale genome rearrangement events take place during evolution. Such genome rearrangement events include reversal of a genomic segment (inversion), shuffling of order of genomic segments (transposition or translocation), duplication and deletion of segments, etc. Although the aforementioned tools are unable to deal with genome rearrangements, methods such as MUMmer [mummer] can perform alignment of two sequences in presence of rearrangements whereas Mauve [mauve], Cactus [cactus], etc. can handle multiple sequences.

Recently, Armstrong et al. have developed Progressive Cactus [progressivecactus], and Minkin & Medvedev have developed SibeliaZ [sibeliaz] that can align hundreds to thousands of whole-genome sequences in presence of rearrangements. The tools identify similar sub-sequences in the sequences from different species to create blocks of rearrangement-free sequences, and then performs a multiple sequence alignment of the sequences in each block.

Since adversaries can modify malwares relatively easily by changing orders of blocks of codes without altering functionality of the malware, it is important that the tool used to align malware sequences is robust to such rearrangements in code. Here, we use SibeliaZ to align malware sequences to identify conserved blocks of codes and calculate a conservation score of the blocks for malware detection and classification. It is worth noting that the blocks of codes identified need not be fully conserved, i.e. there can be modifications, insertions and deletions of small number of instructions within the blocks, making it robust to adversarial attacks.

Iii Related Work

To counter the increasing amount of malware and detect them, several methods and techniques have been developed over the years. In the early days, Wressnegger et al. [wressnegger2017automatically] and Zakeri et al. [zakeri2015static] proposed a signature based approach using static analysis. Later, a dynamic approach - malware detection by analysing the malware behavior, was proposed by Martignoni et al. [martignoni2008layered] and Willems et al. [willems2007toward]. In recent times, machine learning based techniques are mostly being used to classify malware. Schultz et al. [schultz2000data] first proposed a data mining technique for malware detection using three different types of static features. Subsequently, Nataraj et al. [nataraj2011malware] proposed a malware classification approach based on image processing techniques by converting the bytes files to image files. Later, Kalash et al. [kalash2018malware] improved on  [nataraj2011malware] by developing M-CNN using malware images. Besides CNN, RNN has also been used for malware analysis. [shahzad2011accurate] and [lu2019malware] proposed techniques with LSTM using opcode sequences of malware. Santos et al. [santos2013opem] proposed a hybrid technique by integrating both static and dynamic analysis. Yan et al. [malnet] developed MalNet using an ensemble method on CNN, LSTM and extracting metadata features while Ahmadi et al. [ahmadi2016novel] extracted and selected features of malware depending on the importance and applied feature fusion on them. Recently, Raff et al. [malconv]

developed a state-of-the-art technique, MalConv using only the raw byte sequence as the input to a neural network.

Prior work proposed two main ways to improve the adversarial robustness of malware detectors: adversarial training, and robustness by design. Adversarial training, where a malware detector is trained with adversarial examples, is one of the mostly used approaches to improve adversarial robustness [bai2021recent]. In the malware domain, several work demonstrated that adversarial training can improve the robustness significantly without reducing the accuracy on the original sample [grosse2017adversarial, zhang2021enhanced, al2018adversarial]. Robustness by design approaches build classifiers to eliminate a certain classes of adversarial attacks. Certified or provable robustness is a robustness by design approach that trains classifiers with local robustness properties that can provably eliminate classes of evasion attacks [cohen2019certified]. In the malware domain, Chen et al. [chen2020training] proposed learning PDF malware detectors with verifiable robustness properties. Íñigo et al. [incer2018adversarially]

trained a XGBoost based malware detector with the monotonicity property that ensures that an adversary cannot decrease the classification score by adding extra content. The first method relies on the availability of enough adversarial samples, which may not always be the case in the fast-changing malware world. Our approach falls under the second category, robustness by design. Although our approach does not provide any provable robustness guarantees, it increases the cost of an adversary by eliminating the possibility of trivial attacks.

In the past, sequence alignment based approaches have been used for malware analysis by a number of researchers [chen2012multiple, narayanan2012effects, naidu2014further, kirat2015malgene, naidu2016needleman, cho2016malware, kim2019improvement]. Chen et al. used multiple sequence alignment to align computer viral and worm codes of variable lengths to identify invariant regions [chen2012multiple]. This approach was subsequently enhanced in [narayanan2012effects, naidu2014further, naidu2016needleman]. However, none of these methods address the issue that blocks of code can be shuffled without affecting malware behaviour.

Sequence alignment has also been applied on system call sequences of malware to extract evasion signatures and cluster samples [kirat2015malgene], classify malware families [cho2016malware], and for malware detection, classification and visualization [kim2019improvement]. While it is more difficult for malware developers to shuffle API calls without changing the behaviour of malware, these approaches require access to API call sequences of malware and are not suitable in all circumstances.

Drew et al. [drew2016polymorphic] utilized another approach developed by the bioinformatics and computational biology community for malware classification - that for gene or sequence classification. The method is based on extracting short words i.e. -mers from sequences and calculating similarity between sequences based the set of words present in them. Although the method is efficient, it does not fully utilize the information provided by long stretches of conserved regions in malware, and is not suitable for identifying critical code blocks in malware.

Iv Methods

Fig. 1: Overview of MAlign. (1) The malware bytes files (executables) from malwares of a particular family are first converted to nucleotide sequence files. (2) Then the malware nucleotide sequences are aligned using a multiple sequence alignment tool SibeliaZ. It first identifies similar sequences in different malwares to form blocks. Highly similar sequences (colored sequences) can be in different order in different files. The sequences in each block are then aligned. (3) The aligned sequences in each block are used to construct consensus sequences and conservation scores are calculated for each conserved block. (4) Then two sets of sequences - one corresponding to the malware family of interest and the other corresponding to non-malwares or malwares from other families are aligned to the consensus sequences and the degrees of conservation of each conserved block in the training sequences are estimated. (5) Finally a machine learning model is learnt to classify sequences based on the alignment scores of the sequences with the blocks. To classify new instances, sequences are aligned to the consensus sequences of the blocks and alignments are scored. The scores are then used as features for the class prediction.


To classify or detect known/unknown malwares and its variants, in this paper, we propose a malware classification or detection system based on multiple whole-genome alignment. The basic building block of the method is a binary classification system that can predict whether an instance belongs to a particular malware family or not. The input to this binary classifier is a training set consisting of positive examples i.e. malwares from a particular family, and negative examples which may be non-malwares or malwares from other families. The system can be extended to malware detection by creating a binary classifier for all malware families. The instances that are predicted to be negative by all these classifiers can then be treated as benign.

Input: Training set, , where : byte files from malwares of a family, : byte files from non-malwares or malwares from other families, and Test set,
Output: Labels for
Convert to nucleotide sequence files Perform multiple sequence alignment and identify conserved blocks () forall blocks  do
        Get consensus sequence () for i =  do
               Calculate where
forall sequences  do

:= Feature vector of

forall blocks  do
               Get alignments (, ) Calculate alignment score Get alignment count
Learn classification model where : feature matrix Convert to nucleotide sequence files () : labels forall sequences  do
        := Feature vector of forall blocks  do
               Get alignments (, ) Calculate alignment score Get alignment count
Algorithm 1 MAlign

The main steps of our proposed method are shown in Algorithm 1 while an illustration is provided in Figure 1. We start with the given malware bytes files i.e. executable files and convert them to malware nucleotide sequence files i.e. sequences of A, C, G and T. Then these nucleotide sequence files are aligned using a multiple whole-genome alignment tool (SibeliaZ) which outputs alignment blocks that are common among a number of these files. These alignment blocks are merged and thus consensus sequence is constructed. In this step, conservation score for each coordinate of the consensus sequence is also generated. This consensus sequence is aligned with each sample from a balanced train set with positive and negative samples with respect to the malware family of interest, and an alignment score is calculated for every sample for each conserved block. These scores are then used as input to a machine learning model which learns a classifier to distinguish between malwares belonging to the family, and malwares from other families as well as non-malwares. To classify a new sample, the sequence is aligned with the consensus sequence and alignment scores for the new instance are generated similarly, and the scores are passed into our classifier to classify the new sample. Each of these steps is described in more detail below.

Bytes file to nucleotide sequence file conversion

First the binary executable or bytes files are converted to nucleotide sequence files containing sequences of A, C, G and T. The conversion is performed so that existing whole-genome alignment tools can be used. The conversion from the byte code to nucleotide sequence is done by converting each pair of bits to a nucleotide according to Table I.

Byte Character Nucleotide
00 A
01 C
10 G
11 T
TABLE I: Bytes to Nucleotide Mapping

In some malware datasets such as the Kaggle Microsoft Malware Classification Challenge (Big 2015) dataset [microsoftdataset2015], the provided bytes files contain “??” and long stretches of “00” in some cases which do not preserve any significant value or meaning. These are removed before the conversion to nucleotide sequences.

Multiple alignment of malware nucleotide sequences

The next step is to align the malware nucleotide sequences. In this paper, we use the multiple whole-genome alignment tool SibeliaZ [sibeliaz]. SibeliaZ performs whole-genome alignment of multiple sequences and constructs locally co-linear blocks. Figure 2 illustrates alignment of three different malware nucleotide sequence files from the same family. The sequences share blocks of similar sequences showed in dashed lines of same color. They may also contain sequences unique to each sequence indicated by lines with different colors.

During the block construction process:

  • The order of the shared blocks may differ in different sequences and the blocks may not be shared across all sequences. This helps MAlign to be robust to the evasion attempts, such as shuffling blocks of code.

  • The shared blocks may not be fully conserved i.e. there may be mismatches of characters to some extent which means minor alteration, modification to the code will not prevent detection of blocks.

These properties of multiple whole-genome alignment bolsters the robustness of MAlign against many obfuscation techniques.

SibeliaZ first identifies the shared linear blocks and then performs multiple sequence alignment of locally co-linear blocks. The block coordinates are output in GFF format and the alignment is in MAF format. The GFF format is a file format used for describing genes and other features of DNA, RNA and protein sequences. The multiple alignment format (MAF) is a format for describing multiple alignments in a way that is easy to parse and read. In our case, this format stores multiple alignment blocks at the byte code level among malwares. We generate such an MAF file for each malware family using the training samples and identify the blocks of codes that are highly conserved across the malware family.

Consensus sequence and score generation

We process all sequences of the alignment blocks of MAF file from the previous step and generate a new sequence for each block, which is known as consensus sequence [consensus]. At first, we scan the length of all sequences and find the maximum one that will be the length of our consensus sequence. Then we traverse through the coordinates of every sequence and find the nucleotide of highest occurrence for each coordinate. We put the most frequently occurring nucleotide in corresponding index of the consensus sequence i.e. the characters of the consensus sequences of the blocks are given by

, where is the count of the character at the -th position in block and is the length of block .

In Figure 2, consensus sequence generation of alignment block for Block-1 is shown in detail. Below the alignment block, the corresponding sequence logo is shown. The height of the individual letters in sequence logo represents how common the corresponding letter is at that particular coordinate of the alignment.

Fig. 2: Details of consensus sequence, conservation score, and alignment score generation from alignment blocks.

We thus construct the consensus sequence by taking the letter (nucleotide) with highest frequency for each coordinate. Similarly, the consensus sequences for all blocks are generated and are stored in a file in FASTA format with a unique id. These consensus sequences are the conserved part of the malware family which can be considered as the signature or common pattern of that family. The files are used in subsequent steps to classify malwares.

In addition to the consensus sequence, we calculate conservation scores for the blocks. In bioinformatics, conservation score is used during evaluation of sites in a multiple sequence alignment, in order to identify residues critical for structure or function. This is calculated per base, indicating how many species in a given multiple alignment match at each locus. In malware world, the conservation score can indicate the significance or importance of a code segment in a malware family. The responsible code segments of a malware will have high conservation scores compared to the segments those are not frequent, or conserved in malware files.

In Figure 2, the height of the bars of conservation score indicates the degree of conservation at the corresponding position. For each coordinate, we store the score for each of the four nucleotides which is given by the occurrence ratio of that nucleotide at that coordinate. So, conservation score at the index of the alignment block for nucleotide is given by

where is the number of sequences in block .

For example, in Figure 2, Block-1 has 3 sequences in total. Since at the index, the block contains 3 s,

Again at the index, the block contains 2 s and 1 . So

Alignment with consensus sequences

Once the consensus sequence and the conservation scores are generated, we take a training set for each malware family. In the training set, the positive examples are samples from that malware family and the negative examples are non-malwares or malwares from other families. All samples from the training set are aligned to the consensus sequences of the corresponding family to get the aligned blocks for each sample. Using the previously generated conservation score, we calculate new scores called alignment scores for each block for all samples which will be used as features.

In Figure 1 the green and red lines indicate the positive and negative samples in the training set respectively. These samples are then aligned with the consensus sequences using the alignment tool, SibeliaZ which outputs alignments for each sample. An example of alignment score calculation is shown in Figure 2. Malware X-1 and X-2 are positive and negative sample respectively. X-1 has three aligned sequences with the consensus sequence (shown in purple) whereas X-2 has only one. Sum of scores for all aligned sequences will be the score for the corresponding block of that sample. As an example, for total score of sample X-1, we sum the scores of 3 aligned sequences. Each aligned sequence’s score is the sum of the score of all coordinates.

The aligned sequence score is then multiplied by the number of sequences that constructed the corresponding block since the higher the number of sequences that generated the block, the more conserved the sequence is across the instances from that family. In Figure 2, adding all coordinate’s score of sample X-1’s first aligned sequence, we get . Since the corresponding consensus sequence was generated from 3 sequences, the final score for first aligned sequence will be . Finally, the total alignment score for the block was calculated by adding the scores of all 3 aligned sequences.

In general, the alignmment score of a sample for consensus sequence  of the alignment block is given by

where, is the set of sequences from sample that got aligned with , is the -th nucleotide of the sequence and is the index of where was aligned.

Along with this score, we also store the total number of times the consensus sequence of a block gets aligned with the sample i.e. alignment count . In Figure 2, the consensus sequence was aligned with sample X-1 three times. Both the number of occurrences and the total alignment score for the consensus sequence of each block for a malware family are used as features for the subsequent classification, resulting in features if multiple sequence alignment of a malware family has aligned blocks.


Finally, we learn machine learning models for each malware family to classify malwares. The scores and number of alignments calculated as mentioned above are used as the features in our classifiers.

We experimented with a number of machine learning models including logistic regression, support vector machines (SVM), decision tree and deep learning. Since the results did not vary significantly across models (see Table

III in Results), we use logistic regression as our primary model because of its simplicity and interpretability.

After the training phase, we get classifiers that can be used to classify or detect new instances as shown in Figure 1. The scores of the train and the test examples are calculated in the same way. The scores for new instances are passed to the classifiers to classify them into positive and negative instances. If a new sample is classified as negative by classifiers for all families, it can be considered as a benign sample.

V Results

In the following sections, we first discuss the datasets used in this paper and subsequently present the results on these datasets.


The Kaggle Microsoft Malware Classification Challenge (Big 2015)

The Kaggle Microsoft Malware Classification Challenge (Big 2015) [microsoftdataset2015] aimed to organize polymorphic malwares into 9 separate classes of malicious programs at a high level (see Table II). This challenge simulates the file input data processed on over 160 million computers by Microsoft’s real-time anti-malware detection products inspecting over 700 million computers per month.

Family Name No of Train Samples Type
Ramnit 1541 Worm
Lollipop 2478 Adware
Kelihos_ver3 2942 Backdoor
Vundo 475 Trojan
Simda 42 Backdoor
Tracur 751 TrojanDownloader
Kelihos _ver1 398 Backdoor
Obfuscator.ACY 1228 Any kind of obfuscated malware
Gatak 1013 Backdoor
TABLE II: Malware Families in the Kaggle Dataset

Microsoft provided almost half a terabyte of input training and classification input data when uncompressed. They included:

  1. Binary Files: 10,868 training files containing the raw hexadecimal representation of the file’s binary content.

  2. Assembly Files: 10,868 training files containing data extracted by the Interactive Disassembler (IDA) tool. This information includes assembly command sequences, function calls and more.

  3. Training Labels: Each training file name is a MD5 hash of the actual program. Each MD5 hash and the malware class it maps to are stored in the training label file.

From this, we constructed 9 balanced datasets for binary classification consisting of equal number of positive and negative samples for each of the 9 malware families. In each dataset, the positive examples are all the samples from the corresponding family and the negative examples were chosen by randomly sampling from the 8 other families in the dataset. Then 20% of each dataset was set aside as the test sets while the remaining 80% was used as the training sets. For the machine learning approaches that require hyper-parameter selection, the 80% was further split into training (60%) and validation sets (20%).

Family Name Train Accuracy Test Accuracy
Logistic regression Decision tree SVM Logistic regression Decision tree SVM
Ramnit 99.91 99.91 99.91 99.64 99.82 99.64
Kelihos_ver3 99.83 99.94 99.81 99.27 99.27 99.27
Vundo 100 100 100 97.4 97.4 97.4
Simda 100 100 97.92 84.62 84.62 84.62
Tracur 100 100 99.5 97.3 94.6 96.3
Kelihos _ver1 100 100 99.5 96.7 98.9 98.9
Obfuscator.ACY 100 100 100 95.9 92.7 96
Gatak 99.17 99.17 99.17 96.2 96.2 96.2
Overall 99.82 99.86 99.74 97.99 97.42 98.02
TABLE III: Classification Accuracy on Kaggle Microsoft Malware (BIG 2015) Dataset for Different Machine Learning Models

Microsoft Machine Learning Security Evasion Competition (2020) (MLSec) Dataset

While the Kaggle Microsoft Malware Classification Challenge (Big 2015) dataset is a large and widely studied one, often malwares families only have a few samples - especially when they emerge initially. Therefore it is important to assess the performance of the methods on datasets with small number of instances per family. So, we applied our method on the Microsoft Machine Learning Security Evasion Competition (2020) [mlsec] (MLSec) dataset, too. Here we used the dataset from ‘Defender Challenge’ which consists of malware bytes code and their variants. Defenders’ challenge was to create a solution model that can defend against evasive variants created by the attackers. 49 malwares along with their evasive variants (submitted by the attackers) are in this ‘Defender Challenge’ dataset. This dataset contains 49 original malwares with unique id from ‘01’ to ‘49’. Each malware contains a different number of evasive variants varying from 5 to 20. On average, a malware has 12 variants in this dataset. Similarly to the Kaggle Microsoft Malware dataset, 49 datasets were created which were then split into training, validation and test sets. During the split of this dataset, we always kept the original sample in the train set and the variants in the test set for each family, so that the test set can be considered as an evolution of the train set.

Evaluation of machine learning algorithms

First we assess the performance of various machine learning approaches on the Kaggle Microsoft Malware (Big 2015) dataset. The binary files of this dataset were converted to nucleotide sequence files and labelled using ‘Training Labels’ data. Then we generated common alignment blocks using SibeliaZ and constructed the consensus sequences as discussed in Methods. We generated the conservation scores for each consensus sequences using frequency of nucleotides which were then used as features in the machine learning models.

We experimented with logistic regression, decision trees and support vector machines (SVM). Table III shows the train and test accuracy for 80%-20% train-test split on Kaggle Microsoft Malware Classification Challenge (Big 2015) Dataset. We were unable to align instances of the ‘Lollipop’ family by SibeliaZ possibly due to the limitation of computational resources. Hence, the family was removed from our analysis. We observe that the algorithms show similar performances in terms of accuracy. So, we selected logistic regression for future experiments because of its simplicity and interpretability. We experimented with the hyper-parameters of logistic regression and found that it gave the best results for ‘elasticnet’ penalty, C=0.05 (regularization factor), ‘saga’ solver and l1_ratio=0.5.

Comparison with existing approaches

Next we compare the performance of MAlign with that of state of the art approaches, MalConv [malconv] (a neural network based approach using raw byte sequence), Ahmadi et al. [ahmadi2016novel] (a feautre fusion based approach using byte and assembly files), and M-CNN [kalash2018malware]

(a convolutional neural network (CNN) based approach relying on conversion to images) on the Kaggle Microsoft Malware (Big 2015) dataset. It is worth noting that models with multiclass loss as low as 0.00283 have been reported for this specific dataset. However, we compare with MalConv and M-CNN, as they have been successfully applied to many different datasets. We compared

MAlign with Ahmadi et al.’s Feature-Fusion method, because to our knowledge, this was the closest to the accuracy of the winning team of the Kaggle competition.

We also implement a deep learning based approach that classifies malwares using the alignment scores calculated by MAlign. The architectures of the deep learning based approach on alignment scores as well as architectures of MalConv and M-CNN are shown in Figure 3.

Family Name MAlign MAlign Feature-Fusion MalConv M-CNN
(Logistic Regression) (Deep Learning)
Ramnit 99.91 99.58 100 98.39 97.64
Kelihos_ver3 99.83 99.86 100 99.89 99.91
Vundo 100 98.26 100 99.4 99.26
Simda 100 94.44 100 100 100
Tracur 100 98.18 100 98.74 98.43
Kelihos _ver1 100 98.92 100 98.54 99.52
Obfuscator.ACY 100 98.66 100 97.33 99.70
Gatak 99.17 99.2 100 99.15 92.69
Overall 99.82 99.24 100 98.96 98.4
TABLE IV: Performance of different models on Train-set of Kaggle Microsoft Malware Classification Challenge Dataset (BIG 2015)
Family Name MAlign MAlign Feature-Fusion MalConv M-CNN
(Logistic Regression) (Deep Learning)
Ramnit 99.64 99.46 98.7 95.66 88.44
Kelihos_ver3 99.27 99.82 99.18 100 99.72
Vundo 97.4 98.70 95.79 94.89 97.21
Simda 84.62 84.62 76.47 52.94 62.5
Tracur 97.3 98.2 98.34 93.91 94.55
Kelihos _ver1 96.7 95.7 98.75 96.08 94.67
Obfuscator.ACY 95.9 96.39 98.98 94.42 91.74
Gatak 96.2 98.37 98.03 98.67 88.68
Overall 97.99 98.59 98.52 96.95 94.10
TABLE V: Performance of different models on Test-set of Kaggle Microsoft Malware Classification Challenge Dataset (BIG 2015)
Fig. 3: Architectures of (a) deep learning model on alignment scores, (b) the MalConv model [malconv], and (c) the M-CNN model [kalash2018malware].

The training and test accuracy of MAlign with logistic regression and deep learning along with those of Feature-Fusion, MalConv and M-CNN are shown in Tables IV and V. Table V shows that MAlign (Deep Learning) has better accuracy on the test set than other approaches. MAlign (Deep Learning) has the best accuracy of 98.59%. Table IV shows that, on train set, MAlign has better accuracy than MalConv and M-CNN, and the difference with Feature-Fusion is negligble.

Applicability with limited amount of data and features

Although deep learning based approaches have been widely applied for malware classification and detection, they require extensive amount of data for training and tend to overfit in absence of that. Tables IV and V show that Feature-Fusion, MalConv and M-CNN perform well for most malware families. However, we observe that, for Type 5 (Simda), which has only 42 samples, the test accuracy of Feature-Fusion, MalConv and M-CNN are only 76.47%, 52.94% and 62.5% respectively whereas MAlign has 84.62% test accuracy. The other methods having training accuracy of 100% indicates overfitting. Similar observation can be made for Type 4 (Vundo) that has second smallest number of train samples.

Moreover, MAlign needs only the raw byte sequence whereas the Feature-Fusion method needs byte sequence, assembly file, address for byte sequence, section information from PE etc. So, even when only the binary executable files are available MAlign will work perfectly, but typical approaches like Feature-Fusion [ahmadi2016novel] that needs various features, will have to use other tools (such as IDA) to work properly.

Train Accuracy Test Accuracy
MAlign(Logistic Regression) 98.18 80.42
MAlign(Deep Learning) 97.72 80.00
MalConv 98.24 79.22
M-CNN 96.99 71.09
TABLE VI: Performance of models on MLSEC dataset for train-test split
Train Validation Test
Accuracy Accuracy Accuracy
MAlign (Deep Learning) 97.53 81.67 79.58
MalConv 95.39 81.22 64.45
M-CNN 98.4 74.29 73.83
TABLE VII: Performance of models on MLSEC dataset for train-validation-test split

We also compare performances of the methods on the MLSec dataset (Microsoft Machine Learning Security Evasion Competition (2020) [mlsec]). Since this dataset contains limited number of variants created in almost real-time, this can be used to identify how our method works on zero-day malwares when only a limited number of samples are available.

Because of the limited number of instances in this dataset, the validation set is very small for some types. So we run the deep learning models on a 80%-20% train-test split as well as 60%-20%-20% train-validation-test split of the data. Tables VI and VII summarizes the performances of the models on all 49 types for train-test and train-validation-test split respectively whereas Figures (a)a and (b)b provide radar charts showing training and test accuracy in all 49 types individually.

We observe that MAlign outperforms the other deep learning based models overall on the MLSec dataset regardless of the splitting. This highlights the advantage of explicit identification of critical code blocks when data is limited. Ahmadi et al. [ahmadi2016novel] Feature-Fusion method has not been included in the analysis because running it on Mlsec dataset was not possible due to its limitation of using both the byte and the assembly file.

(a) Radar chart showing accuracy for MLSec-Train Dataset
(b) Radar chart showing accuracy for MLSec-Test Dataset
Fig. 4: Radar chart showing accuracy of different models on each of the 49 malware types along the perimeter on MLSec-Train Dataset

From Figures (a)a and (b)b, we observe that on the train set, all approaches have consistent performance, but on the test set M-CNN and MalConv is relatively inconsistent. For example, type 29 and 43 have 10 and 5 available variants respectively, and M-CNN’s test accuracy is 0 on both.

Robustness to adversarial attacks

A major issue with deep learning based malware detection approaches is - they can often be evaded by adding selective content to it. In deep learning based methods, the gradient attack can be used to find such selected content. Since MAlign relies on finding conserved blocks critical to the malware families for classification through sequence alignment, score calculation, and an interpretable logistic regression model, it should in its principle to be inherently robust to such attacks.

We tried to investigate the robustness of our method compared to conventional malware detection techniques. We used the evasion technique on MalConv model that was proposed by Kolosnjaji et al. [kolosnjaji2018adversarial]

. It creates adversarial samples just by modifying (or padding) approximately 1.25% of the total size of malwares which can successfully evade the MalConv model with high percentage. The evasion rate increases with the percentage of modification on malware samples. We experimented with their implementation  

[gradient_attack_github] on some of the types in the MLSec dataset, and the results are shown in Table VIII.

MalConv MAlign
Type Evaded/Total Evasion Rate Evasion Rate
Train set Test set
1 5/15 2/4 36.84% 0.00%
11 1/9 1/2 18.18% 0.00%
12 5/13 2/4 41.18% 0.00%
28 5/8 2/2 70% 0.00%
45 4/5 2/2 85.71% 0.00%
TABLE VIII: Gradient attack results on MalConv on some types in the MLSec Dataset

We then applied MAlign

 on these evasive samples and all of them were successfully detected with almost 100% prediction probability. It is worth mentioning that since the evasion technique did not have access to the

MAlign model, the experiment is not rigorous. However, since MAlign finds the blocks critical for function in a malware family and classifies them based on those, to evade MAlign, the malware attackers will have to go through an incommodious process of changing those blocks without changing its intended features and semantics. Moreover, gradient attack based evasion techniques cannot directly be applied on MAlign because we are not feeding the malware file directly to our model unlike many other techniques. Besides gradient attack, other typical obfuscation techniques will not be able to evade MAlign. Some of those have been discussed below-

  • Including pieces of other malware to confuse family detection: The inclusion of pieces from other malware will be detected but code-pieces of the original family will also be detected at the same time, and consequently, the score for the original family should be higher since it contains more code-pieces than the other family. Thus, MAlign should still identify that sample correctly.

  • Intersperse instructions into other benign programs to dilute the signal: Interspersed instructions will still be detected as sequence alignment works in the presence of substitutions, insertions, and deletions of nucleotides (instructions). Now the prediction of MAlign will depend on the score and number of occurrences of these interspersed instructions.

  • Using indirect addressing: Use of indirect addressing may cause some mismatches, but still there will be some matching because the source and destination register will have to be the same. Moreover, there will still be other preserved instructions except addresses to protect the malware semantics which will be captured by MAlign.

There are plausible approaches to attack the present implementation which can be addressed using classifiers with only non-negative weights and other techniques in the future.

Running Time

We run the experiments of different methods on different platforms. However, to provide an idea, the total running time (from data processing to classification) for MLSEC dataset is given in Table IX. Although inference time grows linearly with the number of families, once the signatures are found in MAlign, they can be trimmed if needed (based on how conserved they are) and then the alignment of the sequence of the new instance and the signature can be sped up. Moreover, for malware detection, the classification models for different families can be run in parallel thereby speeding up the process.

Method Running Time
MAlign 18hr 4min
MalConv 18hr 13min
M-CNN 14hr 26min
TABLE IX: Model Running Time for Different Methods on Mlsec Dataset

Ahmadi et al. [ahmadi2016novel] Feature-Fusion method has not been included in Table IX because it was not run on the Mlsec dataset. But we can get an estimation of its running time on the Kaggle Microsoft dataset from the paper [ahmadi2016novel]. For example, it takes almost 17 hours 15 minutes to extract the ’REG’ feature from all samples, and it extracts in total of 14 features.

Vi Case Studies

(a) Code in .text segment
(b) Code in .rdata segment
(c) Code in .data segment
Fig. 5: Code obfuscation in data segment (a) A code fragment in a malware sample, (b) and (c) Same code obfuscated in .rdata and .data segments indicated by the hex-codes

An important advantage of MAlign is - it is interpretable and insights can be derived about malware families through a simple backtracking process. The process is as follows:

  1. Find the blocks that are assigned high weights by the logistic regression model

  2. Select the blocks that are highly conserved from the above list

  3. Process the alignment file (MAF) to determine the sequences and their indices that constructed the blocks

  4. Locate the code fragments corresponding to the sequences found in Step (iii)

This process can be used to uncover potential malicious code as discussed next.

Detection of obfuscated malicious code

Different techniques and methods are used by attackers or malware creators to obfuscate malicious, harmful code segments to evade anti-malware tools. MAlign, our proposed method gives us the ability to find responsible malicious code segments from malware files by the backtracking process sketched above and analyze the blocks of code that are highly conserved.

We backtracked from the aligned blocks to the assembly code on some randomly selected samples from the Kaggle Microsoft Malware (Big 2015) dataset. In some cases, we found evidence of malware obfuscation. Figure 5 is an example of such case (data-transformation obfuscation technique in this case). Figure (a)a, (b)b and (c)c are snapshots of three different malware files from ‘Vundo’ malware family of ‘Trojan’ type. All samples contain the same hex-code but in different segments. Figure (b)b and (c)c conceal the same code string of Figure (a)a in its .rdata and .data section with db respectively. In short, the same machine code was transformed to its hex form and placed into data section, possibly to evade anti-malware tools.

Vii Conclusions

In this paper, we presented a malware detection tool MAlign based on a recently developed multiple whole-genome alignment tool, SibeliaZ [sibeliaz]. Sequence alignment based approaches have been used for malware analysis in the past, but the use of a whole-genome alignment tool makes MAlign scalable to long malware sequences and protects against trivial adversarial attacks such as code obfuscation. The method is also interpretable and can be used to derive insights on malware such as identification of critical code blocks and code obfuscation.

We have applied MAlign on the Kaggle Microsoft Malware Classification Challenge (Big 2015) and the Microsoft Machine Learning Security Evasion Competition (2020) (MLSec) datasets, and observed that it outperforms widely used deep learning based methods such as MalConv, M-CNN, and the feature fusion method (Ahmadi et al.).

Preliminary experiments show that MAlign is robust to common adversarial attacks such as padding and modification of sequences. In future, the method may be tested against other possible types evasion techniques and the machine learning algorithms can be adjusted accordingly. In addition, other whole- genome alignment tools such as Progressive Cactus [progressivecactus] may be experimented with.