ML-based IoT Malware Detection Under Adversarial Settings: A Systematic Evaluation

08/30/2021 ∙ by Ahmed Abusnaina, et al. ∙ Wayne State University University of Central Florida EWHA WOMANS UNIVERSITY 0

The rapid growth of the Internet of Things (IoT) devices is paralleled by them being on the front-line of malicious attacks. This has led to an explosion in the number of IoT malware, with continued mutations, evolution, and sophistication. These malicious software are detected using machine learning (ML) algorithms alongside the traditional signature-based methods. Although ML-based detectors improve the detection performance, they are susceptible to malware evolution and sophistication, making them limited to the patterns that they have been trained upon. This continuous trend motivates the large body of literature on malware analysis and detection research, with many systems emerging constantly, and outperforming their predecessors. In this work, we systematically examine the state-of-the-art malware detection approaches, that utilize various representation and learning techniques, under a range of adversarial settings. Our analyses highlight the instability of the proposed detectors in learning patterns that distinguish the benign from the malicious software. The results exhibit that software mutations with functionality-preserving operations, such as stripping and padding, significantly deteriorate the accuracy of such detectors. Additionally, our analysis of the industry-standard malware detectors shows their instability to the malware mutations.



There are no comments yet.


page 1

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

IoT malware have been the focus of the security research community and the industry alike. These efforts have resulted in various malware detection approaches, intended for safeguarding the IoT infrastructure against increasing targeted attacks. These proposed detectors leverage the traditional signature-based approach or the capabilities of the learning algorithms to build Artificial Intelligent (AI)-based detectors. These detection systems leverage modalities generated through static and dynamic software analysis techniques, along with deep learning and natural language processing, for generalizing detection to previously unseen IoT malware 


Considering that these techniques are heavily dependent on the specific data used for their training and testing, it is plausible that they would have a reduced performance when tested in an uncontrolled environment due to various practical settings. For example, the constant evolution of malware that employ obfuscation may impact the performance of these detectors over time, especially the static-based techniques. While packing is widely used among malicious software, it is not exclusive to malware. This limit the usage of packing as a detection modality, since that may result in a large number of false positives. Even in the absence of packing, malware detection systems have been shown to be susceptible to adversarial attacks. An adversary can manipulate the features of any software, directly or indirectly, to force the detector to output the adversary’s desired decision [severi2020exploring, AbusnainaKAPAM19, KreukBABPK18].

A common practice for inspecting software is using online scan engines, such as VirusTotal [VirusTotal], which embody the aforementioned techniques for malware detection and provide reports that contain the detection results of a pool of anti-virus engines. Additionally, these online scanners are utilized by the malware developers to check if their malicious payloads can evade detection from the anti-virus engines before starting a malware campaign [GrazianoCBL15]. Altogether, before deploying such malware detection systems in practice, it is essential to understand the shortcomings of state-of-the-art IoT malware detection systems under adversarial settings that can be abused by the adversaries towards future malware campaigns.

In this work, we examine state-of-the-art malware detection approaches, including those that rely on different representation and learning algorithms. We consider techniques that represent the software as a binary sequence, static disassembly features, and graphs. These representations yield a promising detection performance, with higher than 99% detection accuracy [mercaldo2020deep, LiSYLSY18, AhmadiUSTG16, xu2018cdgdroid, ChenYHWPWY18]. However, our findings highlight the instability of the learning algorithms in learning useful fundamental patterns that represent the difference between benign and malicious software (more details can be found in section V).

By systematically evaluating the robustness of various malware detectors, we demonstrate that manipulating the malicious software with functionality-preserving operations, such as stripping and binary padding, significantly reduces the detectors’ performance. Towards this, we generate four equivalent binaries for each software using means of packing (with different compression levels), stripping, and padding. We evaluate each of the resultant software against various IoT malware detection approaches, along with the industry-standard malware detection engines. The results show a concerning behavior, where one or more detectors fail to hold a reasonable performance (lower than 50% detection rate) in detecting malware mutations.  Figure 1 shows the different phases of analysis strategy; feature representation, software manipulation, and evaluation of ML-based malware detectors.

Figure 1: The system pipeline. The software binaries are (a) represented using different state-of-the-art approaches, and (b) manipulated using functionality preserving operations, such as packing, stripping, and padding. The corresponding representations of the original samples and manipulated ones are then (c) tested against pre-trained ML-based malware detectors and industry-standard detection engines.

Contributions. This work highlights the discrepancies between the capabilities of the adversary and the assumed adversarial capabilities by the research community. Particularly, we make the following contributions:

  1. Validity of the baseline: We examine nine state-of-the-art malware detection representations and three learning algorithms and evaluate their performance using a total of 5,295 IoT software binaries. The evaluation shows the effectiveness of each representation in detecting malicious IoT software with high accuracy in a level playing field.

  2. Model instability: We investigate the stability of the baseline malware detectors. Our results demonstrate the inconsistency of the learning process, i.e., with the introduction of a small random perturbation to the input space, the detector is rendered useless (outputs random label).

  3. Vulnerability to adversarial settings: We examine the robustness of the IoT malware detectors under white-box and black-box adversarial settings, resulting in an accuracy reduction of up to 70%.

  4. Vulnerability to binary manipulation: We evaluate detectors against three manipulation techniques: packing, stripping, and padding. These techniques are practicality and functionality preserving, where the generated software is identical in functionality to the original software. Our evaluation shows that such software is capable of misleading the state-of-the-art malware detectors.

  5. Vulnerability of industry-standard malware detectors: The evaluation of industry-standard malware detection engines shows that most of the engines are rendered useless upon slight modification of the software.

Ii Background

The increasing security concern for IoT devices has been paralleled by an increasing body of work in the area of IoT security, particularly addressing malware analysis and detection. Building towards our work, it is important to outline the efforts that propose IoT malware detection systems and the methods of evasion that will elucidate the susceptibility of the malware detection systems to various adversaries.

Ii-a Malware Detection

Prior works have shown the potential and feasibility of ML to detect malware with more than 99% accuracy [AlasmaryAPCNM19, WangCYJPYC20, AnwarAPW20, YanCZPYHZY21, vasan2020image, mercaldo2020deep, LiSYST19]. The performance of these detection systems depends on the choice of software representations, which are a result of two common analysis techniques. In dynamic analysis, a malware is executed in a monitored sandbox environment. The behavioral patterns are then used as feature representation. However, dynamic analysis is time and space-consuming, thereby limiting its scalability [willems2007toward].

The static analysis involves analyzing the binary executable without executing it. The fast and scalable extraction of representations makes static analysis the primary analysis technique for malware detection. Malware binaries have multiple features that can be statically extracted and used as modalities for malware representation.

Selected Representations. We focus on representations that are (1) extensively used in the prior works, (2) fast to generate, and (3) can be extracted for malware detection on the fly. We summarize the used representations in the following.

  1. [topsep=0.5pt,leftmargin=*]

  2. A common strategy is to transform the malware into a grayscale image. Particularly, the byte-code is visualized as a grayscale image of a fixed size of () where every byte is a pixel in the image.

  3. CFG adjacency. Another strategy is to extract the assembly instructions by disassembling malware and further transforming them into a Control Flow Graph (CFG) by dissecting them into basic blocks depending on the instruction branching or jumps. The CFG is then represented as a square matrix representing edges between nodes.

  4. CFG algorithm. Graph algorithms have been augmented to extract graph attributes that represent the connectivity patterns in the CFG.

  5. Strings are a sequence of printable characters in the binary codebase. The strings of a program are analyzed to understand the possible behavioral patterns of the malware and can also be used to prepare a sandbox environment for the dynamic analysis [CozziGFB2018].

  6. Segments are necessary for program execution. They describe the memory layout of an executable and is interpreted by the kernel during load [ONeill16]. Within every segment, there may be code or data divided among sections, such as .text. Binaries contain symbol tables which are used as references for linking and debugging [ONeill16].

  7. Symbols are symbolic references to code or data and include global variables or functions. Every executable generally has two symbol tables: the symbol table that contains all symbol references and the dynamic symbol table which only contains references for dynamic symbols [ONeill16].

  8. Hexdump

    represents a malware as a sequence of hexadecimal values, where each value represents two bytes (in 0-255 range), the frequency of which is then recorded as a vector of size


  9. Feature fusion represents a unified (combined) representation of all of the aforementioned representations.

For the completeness of the study, we include malware representations proposed by works that are not strictly IoT malware-specific.  Table I summarizes the malware representations that have been proposed for malware detection, and utilized in this work.

Ii-B Representation Evasion

Several software evasion and manipulation techniques were proposed for malware mutation and misclassification. In the following, we briefly discuss the commonly used techniques.

Binary Packing. Packing is used by malware authors to thwart detection or analysis by detectors, analysts. The packer is augmented to compress or encrypt an executable, where the code and data are hidden from the analysts. Considering that portions of the executable are compressed, it needs to be decompressed before it is executed in memory [ONeill16].

Typical packing software consist of two programs, packer program and the stub program, where the first packs the software while the second deobfuscates the software. While there are many packing programs, such as DacryFile by Grugq, Burneye by Scut, Shiva by Neil and Shawn, and Maya’s Veil by Ryan, the Ultimate Packer for eXecutables (UPX) [upx] is the one most commonly used [CozziGFB2018]. UPX utilizes the UCL data compression library algorithm [uclCompression] which uses in-place decompression, and does not introduce memory overheads.

Binary Stripping. Stripping is utilized to hide information that may leak the functional software strategies. A codebase can be compiled with no standard library linking (gcc-nostlib). Alternatively, parts of the ELF file can be hidden such that the different constituents of the binary format can be obfuscated such that the interpretation can be halted. The resultant binaries would be void of information such as debug and relocation information, section headers, and symbols [stripping].


Type Feature Work Bin. R.E. Graph


Binary Image [kancherla2013image, vasan2020image, yajamanam2018deep, mercaldo2020deep]
CFG Adjacency [xu2018cdgdroid, jalote2012integrated, bruschi2006detecting]
CFG Algorithmic [AlasmaryAPCNM19, bruschi2006detecting, AnwarAPW20]
CODE String [AhmadiUSTG16, AnwarAPW20]
CODE Symbols [AhmadiUSTG16, AnwarAPW20]
CODE Sections [AhmadiUSTG16, AnwarAPW20]
CODE Segments [AhmadiUSTG16]
CODE Hexdumps [AhmadiUSTG16]
CODE Combined [AhmadiUSTG16, AnwarAPW20]


Table I:

The state-of-the-art static analysis representations used in this work. Most of the representations require reverse-engineering (R.E.), while image-based representation directly used the raw binaries (Bin.). CODE: features extracted from the disassemble binaries.

Adversarial Evasion. With the growth in ML adoption, it is essential to understand and assess the robustness of ML techniques to several adversarial settings. These settings include adversarial examples, in which an adversary crafts perturbation to misguide the model output to its desired label by applying a minimal perturbation to the original sample [PapernotMJFCS16].

Given a model objective function and a sample represented by a vector , the adversary aims to introduce perturbation () in the feature space such as . Crafting the perturbation can be derived from two perspectives: targeted and non-targeted attacks. Targeted attacks. The adversary in this attack generates an adversarial example

that forces the classifier to misclassify into a specific target class

. For instance, the adversary generates a set of malicious IoT software samples, which are classified as benign. That is: . Untargeted attacks. The adversary’s goal is to misclassify the output of the model to any class other than the original label. That is . In this work, we only consider the two-class classification task, where targeted and untargeted attacks behave the same.

Adversarial attacks can be launched under different adversarial capabilities that allow for either black-box or white-box attacks. In a white-box attack, the adversary has full knowledge of the inner networking paradigm of the model. In a black-box attack, the adversary has only access to the model via an oracle and can only observe the model’s output.

Several methods have been proposed to generate adversarial examples by directly perturbing the feature space in both black-box and white-box settings [GoodfellowSS15, HuT17, Moosavi-Dezfooli16, KurakinGB17a]. For example, Carlini and Wagner [CarliniW17]

proposed generic adversarial attacks against distilled Neural Networks (NN), which showed its effectiveness against several “robust” deep learning NN.

While initially designed to exploit image-based classifiers, where perturbation can be directly applied to the image pixels [PapernotMJFCS16, PapernotMGJCS17, wangYVZZ18], adversarial attacks showed high success in malware detection while preserving the software functionality and executability [grosse2017adversarial, AbusnainaKAPAM19]. At the binary-level, several studies [KolosnjajiDBMGE18, KreukBABPK18] generated practical adversarial examples by appending binaries to the original file. While it is effective against signature- and binary-based classifiers, it can be countered by reverse-engineering the software to extract the corresponding representations.

Other studies [AbusnainaAASNM19, AbusnainaKAPAM19] introduced adversarial attacks on the execution flow of the code, by injecting benign functionalities within the malware and vice versa. However, such a perturbation should be applied to the source code, and is only possible by the malware author, unlike the binary padding approach.

To investigate the effectiveness of different malware representation and learning approaches, we examine a wide set of adversarial settings, including direct generic and modified adversarial attacks, as well as the black-box adversarial settings.

Iii Threat Model

Learning algorithms are widely used to obtain state-of-the-art performance in several fields, including malware detection. However, the usage of ML in critical domains is subject to adversarial attacks. In the following, we discuss the threat models used for systematically evaluating the robustness of the malware detectors.

Iii-a Gaussian Noise

A stable learning model is argued to be immune to misclassification under the introduction of Gaussian noise in the feature space, as unguided perturbation is unlikely to disrupt the existing patterns to some extent [hu2019new, guo2018countering, SzegedyZSBEGF13].

A correctly trained model that can distinguish benign and malicious samples with high confidence, is constraint by three factors. (1) Data representation: A robust software representation should contain meaningful patterns that can distinguish the malicious from the benign software, (2) Learning algorithm: The learning algorithm should be able to capture such patterns even at a higher dimensionality without over-fitting or under-fitting, and (3) Training data: The trained model should be generalizable to unseen new samples, and samples that are not fundamentally different from the ones in the training dataset. This requires the training data to be cohesive and the samples of each class to be an accurate representation of that class. While the first two factors are considered, the third is an open challenge, and we consider it out-of-scope of this work.

In this work, we use the Gaussian noise as a metric to measure the stability of the representations. Given the model objective function , data points (samples) with feature space of features, the output of the model is defined as . The Gaussian noise is then calculated as follows:

where is a list of the features of all . A stable model is then defined as:

In this work, we do not introduce a cut-off threshold for a stable model. However, we observe the model’s behavior when a perturbation in the range of [1%, 100%] is introduced. Ideally, the relationship between the accuracy and perturbation should be linear: with an increasing perturbation, the accuracy should linearly decrease, e.g., to reach random (50%) at 100% perturbation given the two-class classification task. We note that this attack will not generate practical adversarial examples, as it applies the perturbation to the feature space directly. Rather, it is used to measure the detectors’ stability.

Figure 2: Graph manipulation. The software is reverse-engineered and (a) represented as CFG and adjacency matrix, (b) using the pre-trained neural network, (c) white-box C&W-based perturbation is crafted/applied to the CFG.

Iii-B Graph Manipulation

This configuration targets the graph-based representations, including the adjacency- and algorithmic-based representations extracted from the software’s corresponding CFGs. Given a CFG , where is the set of nodes in the graph, and is the set of edges, the adversary’s goal is to introduce a carefully crafted perturbation that misclassifies the system to the desired output. To introduce such a perturbation, we used the adjacency matrix representation as a baseline to craft the perturbation. Then, the Carlini & Wagner (C&W) attack [Carlini017] is used to craft the perturbation under the white-box settings. The C&W is a gradient-based attack that optimizes the penalty and distance metrics on norms in the process of generating adversarial examples. This method ensures that the added perturbation will be minimal while causing misclassification.

Using the adjacency matrix representation, the adversary aims to craft a perturbation as a domain-specific range of possible features that can be observed in ordinary samples. This perturbation achieves the adversarial goal if , where is the classifier’s prediction after applying the perturbation to the original feature space Figure 2 shows the outline of the attack. To keep the generated CFG realistic, we limit the actions done by C&W attack to only adding nodes and edges. This is done by modifying the original attack to prevent deleting existing edges, and only limiting the process to adding edges.

While CFG manipulation preserves the original functionality [AbusnainaAASNM19, AbusnainaKAPAM19], we do not have access to the source code of the samples. Therefore, we cannot generate practical adversarial binaries using CFG manipulation. Given that, we used this attack to evade the graph-based detectors using direct white-box attacks on NN-based adjacency matrix-based classifier, while transferring the attack to remaining CFG-based classifiers.

Iii-C Static String Manipulation

Another white-box attack is the string manipulation attack. In this representation, the software is represented as a vector of bag of words of size , where is the number of words considered in the representation. Similar to the graph manipulation attack, we used C&W attack to craft a minimal perturbation to misclassify the model. Given that the crafted perturbation cannot be applied directly to the binaries, we consider it as a practical attack under the assumption of the availability of the source code. We evaluate this attack by crafting the perturbation using the NN baseline and transferring the attack to the remaining baseline models.

Iii-D Binary Packing

Recall that a binary executable can be packed using packer software, such as UPX (see subsection II-B). The ML-based detectors utilize the features, such as raw binaries, strings, and segments, from the malware. These features are, however, suppressed from packing. In this attack, we pack the malware and probe the performance of the representation used in the literature. Moreover, UPX supports different degrees of packing. For this study, we utilized the default settings and the best compression method of UPX.

Iii-E Binary Stripping

Recall that a binary can be stripped of information without affecting its executability (see subsection II-B). In this attack, we probe the impact of a stripped binary on an ML-based detector’s performance. Particularly, we strip the binaries of their debug information and the symbol information that are not needed for relocation.

Figure 3: Binary padding attack overview. (a) The software is represented as an image. (b) The content of the image is then compressed into the size of . (c) Using C&W attack, we generate perturbation on the remaining half of the image. (d) The generated image perturbation is then rescaled to the original size of the software, and then (e) reshaped to a 1-D vector represented the binaries to be appended.

Iii-F Binary Padding

In this attack, the adversary aims to craft a white-box practical (executable) adversarial example by appending binaries to the end of the software binaries. Figure 3 shows the process of generating perturbation in the white-box settings for image-based representation baselines. For a software of size represented as an image of size , we first compress the content of the image into the space . Afterward, we craft a minimal perturbation using C&W attack. To prevent the attack from applying a perturbation to the upper half of the image, the attack is modified allowing changes in the lower half of the image. After the evasion, we convert the generated lower half of the image of size back to the actual size of the software , and then converting it to 1-D vector by concatenating the rows. We note that this attack will introduce a perturbation size of 100%, as the perturbation has the same size as that of the original file, and the generated software will be of size . This attack generates an adversarial software that is executable. We evaluate the generated software on the image-based baseline models, in addition to the other representations by re-extracting the features from the manipulated software.

Iv Dataset Overview

To analyze the robustness of state-of-the-art malware detectors, we start by collecting a dataset of malicious and benign IoT binaries. The dataset was collected between November 2018 and December 2020, where 3,000 malware samples of three families—Gafgyt, Mirai, and Tsunami—were retrieved from CyberIOCs [cyberiocs19], VirusTotal [VirusTotal], and VirusShare [VirusShare], in addition to 2,295 benign samples, compiled from source files on GitHub [github19] with different optimization levels.

Ground Truth Class. We used VirusTotal [VirusTotal] to validate the malicious and benign samples in our dataset. The samples were first uploaded to VirusTotal. After 24 hours, the scan results corresponding to each sample were retrieved.

Data Augmentation. As aforementioned in section II, the dataset samples are transformed to different representations: (1) Represented as images to be fed into an image-based classifier. (2) Using Radare2 [radare2]

, a reverse-engineering open-source framework for analyzing binaries, the samples were reverse-engineered to obtain various features, such as strings, symbols, sections, and segments. (3) Hexdump representation is used to represent the “

.text” section of the binaries. (4) The software CFG is extracted using Radare2, which then used to generate the software adjacency matrix and different graph-theoretic features.

V Robustness Analysis

V-a Experimental Setup

Towards evaluating the robustness of the state-of-the-art IoT malware detectors, the dataset is transformed using the nine representations. Then, four learning algorithms are used to establish the baseline classifiers.

Learning Algorithms. Several classification algorithms have been adopted and used in various domains in IoT malware detection and classification [AlasmaryAPCNM19, ShenHZYFC18].

In this study, we evaluate the robustness of four ML algorithms, namely, Logistic Regression (LR), Random Forest (RF), Convolutional Neural Networks (CNN), and Deep Neural Networks (DNN). The selection of learning algorithms is for multiple reasons. They are (1) commonly used in this domain, (2) fundamentally different in the learning process, (3) highly sophisticated approaches, such as DNN and CNN, and simpler ML algorithms, such as LR and RF. For instance, the LR-based classifier is selected to extract the relationships between variables in the feature space, with no deep representations. CNN, on the other hand, was selected to extract deep patterns in higher dimensionality. The nature of the selected models will help in investigating the robustness and stability of the feature representations and the learning algorithms more accurately and on a larger scale.

The CNN-based architecture performs well in extracting patterns in higher dimensionality when the pattern location is irrelevant. Therefore, we use the CNN model with image-, CFG adjacency-, and CFG algorithmic-based feature representations. On the other hand, the DNN-based architecture is used with the static-based vector representations, including Strings-, Symbols-, and Hexdump-based feature representations.


Type Feature LR RF NN


Binary Image 99.90 99.81 100


CFG Adjacency 91.67 89.90 92.25


CFG Algorithmic 90.20 99.22 92.09


CODE String 98.48 99.43 98.48


CODE Symbols 98.77 99.43 97.82


CODE Sections 100 100 58.16


CODE Segments 98.39 100 58.16


CODE Hexdumps 98.96 99.24 98.48


CODE Combined 100 99.90 57.79


Table II: Accuracy (%) of the baseline models. Each representation is evaluated using LR, RF, and NN-based classifiers. Note that almost all representations hold high performance (up to 99%) in detecting IoT malware.

Training Stage.

The dataset is split into 80% training and 20% testing. The Neural Network (NN) classifiers were trained with ten epochs, and a learning rate of 0.01.

V-B Evaluation & Results

To better understand the robustness of the IoT malware detection systems, we evaluate each of the settings separately.

V-B1 Baseline Evaluation

We implemented the baseline classifiers on our dataset (see section IV). Table II shows the performance of the classifiers. Eight out of the nine representations achieve a high detection accuracy of 99% with at least one learning algorithm. The only exception is the CFG-based adjacency matrix representation, with an accuracy of 92.25%. We recall that high accuracy does not reflect accurate learning, nor the quality of the learned patterns.

(a) Image representation.
(b) Adjacency matrix representation.
(c) Graph algorithmic features.
(d) String representation.
(e) Symbols table representation.
(f) Sections table representation.
(g) Segments table representation.
(h) Hexdump-based representation.
(i) Combined static representations.
Figure 4: Baseline classifiers evaluation under various Gaussian noise perturbation rates (1%-100%).

V-B2 Model Stability

RQ1. “Are the baseline models correctly trained with no over-fitting and under-fitting?”

A stable model’s performance should ideally decrease linearly with the increase of the perturbation size, to eventually reach random (50% given the two-class classification). Figure 4 shows the evaluation of the baseline classifiers under the Gaussian noise with 1%-100% perturbation. Except for the Hexdump representation, with the introduction of a perturbation size of , the classifiers fail to deliver beyond the random guess. This highlights that the used representations are not stable and may fail due to the temporal changes in the data over time. A likely reason for this is the frequent appearance of different versions of the same or identical malware, thereby forcing the model to over-fit on the exact match instead of extracting feasible patterns.

Key Finding: Except for Hexdump-based representation, the baseline classifiers demonstrate high instability in their performance under small perturbation (1% Gaussian noise).


Type Feature Attack Type Model Accuracy (%)


Binary Image Transferred LR 63.73
Transferred RF 72.71
Direct CNN 63.73


CFG Adjacency Transferred LR 81.77
Transferred RF 79.60
Direct CNN 81.30


CFG Algorithmic Transferred LR 59.95
Transferred RF 60.70
Transferred CNN 59.95


CODE String Transferred LR 29.08
Transferred RF 30.02
Direct DNN 30.59


Table III: Baseline classifiers evaluation under white-box settings. Only realistic and practical adversarial attacks are considered. All attacks are done on the NN and transferred to the LR- and RF-based classifiers.

V-B3 White-box Attacks

RQ2. “Are the classifiers prone to practical white-box adversarial attacks?”

Evaluating the classifiers against white-box settings is essential to understand their point-of-failure. In this context, we evaluate the white-box attacks that can be implemented directly on the binaries, or on the source code by the malware author. Table III shows the evaluation of the baseline models under white-box attacks, including binary padding and graph and string manipulation. While the binary padding can be also applied to the remaining representations (as shown later), it is considered as a white-box attack on the image-based representation only, and therefore reported here. We note that all considered attacks are implemented on the NN-based classifier, and transferred to other learning algorithms. The CFG-based algorithmic representation was evaluated using the perturbation generated on the adjacency-based representation (i.e., transferred) due to their feature dependencies.

Key Finding: For several representations, practical white-box attacks are possible, and can be transferred to related learning algorithms and representations.


Type Feature L.A. Benign Malware
Original Packed Packed* Stripped Padded Original Packed Packed* Stripped Padded


Binary Image LR 100 3.92 4.35 6.31 63.73 99.83 98.00 98.00 98.00 98.33
RF 99.56 2.39 2.17 2.39 72.71 100 96.66 96.66 92.00 85.00
NN 100 6.31 6.31 2.17 63.73 100 100 100 100 100


CFG Adjacency LR 87.36 33.11 33.55 87.36 87.36 95.50 77.33 77.50 95.50 95.50
RF 88.01 98.91 99.12 88.01 88.01 91.50 73.16 73.16 91.50 91.50
NN 86.92 1.74 1.74 86.92 86.92 96.33 79.16 79.16 96.33 96.33


CFG Algorithmic LR 91.54 1.96 1.96 91.54 91.54 89.04 89.86 89.64 89.04 89.04
RF 99.51 99.56 99.78 99.51 99.51 98.96 88.76 88.76 98.96 98.96
NN 93.23 2.17 2.17 93.23 93.23 91.11 91.85 91.62 91.11 91.11


CODE String LR 96.51 3.48 3.48 96.51 96.51 100 100 100 100 100
RF 98.69 2.39 2.39 98.69 98.69 100 100 100 100 100
NN 96.51 0.00 0.00 96.51 96.51 100 100 100 100 100


CODE Symbols LR 97.16 1.08 1.08 97.16 97.16 100 100 100 100 100
RF 98.69 2.17 2.17 98.69 98.69 100 100 100 100 100
NN 94.98 3.26 3.26 94.98 94.98 100 100 100 100 100


CODE Sections LR 100 100 100 3.48 100 100 34.66 34.66 100 100
RF 100 3.48 3.48 100 100 100 100 100 100 100
NN 0.00 0.00 0.00 0.00 0.00 100 100 100 100 100


CODE Segments LR 96.51 0.00 0.00 96.51 96.51 99.83 99.83 99.83 99.83 99.83
RF 100 3.48 3.48 100 100 100 100 100 100 100
NN 3.48 3.48 3.48 3.48 3.48 100 100 100 100 100


CODE Hexdumps LR 98.03 97.60 97.60 98.03 98.03 99.66 86.16 86.16 99.66 99.66
RF 98.25 1.74 1.74 98.25 98.25 100 92.83 92.83 100 100
NN 96.51 0.00 0.00 96.51 96.51 100 100 100 100 100


CODE Combined LR 100 3.48 3.48 3.48 100 100 100 100 100 100
RF 99.78 3.26 3.26 99.56 99.78 100 100 100 100 100
NN 0.00 0.00 0.00 0.00 0.00 100 100 100 100 100


Table IV: Baseline evaluation under binary manipulation (%). Packed*: optimized packing, L.A.: learning algorithm.

V-B4 Binary Manipulation Attacks

These settings include evaluating the classifiers under manipulation attacks on the software. Here, we consider binary packing under default and optimized (packing*) conditions, stripping, and padding. Table IV shows the evaluation results under these manipulation attacks strategies. In the following, we interpret these results posed as research questions.

RQ3. “Does binary packing affect the performance of the baseline classifiers?”

The evaluation results show that most of the classifiers identify packed software as malicious. This indicates that they identify packing as a malicious pattern. This observation is in line with Aghakhani et al. [aghakhani2020malware], demonstrating that the industry-standard windows malware detection systems identify the packed software as malicious. However, our results bring forward an exception, where Hexdump-based LR classifier maintains its performance under the two levels of packing.

Key Finding: Baseline classifiers, in general, identify packing as malicious behavior.

RQ4. “Does stripping affect the baseline classifiers?”

Recall that stripping removes information, such as the debug information, from the software binaries. However, the results exhibit that the performance of most of the representations, such as the CFG, strings, and Hexdump, are intact.

Key Finding: Generally, existing approaches maintain high accuracy under binary stripping.

RQ5. “Does padding affect the baseline classifiers?”

Given that with binary padding we do not remove any existing functional codebase, it does not affect the analyses of the software. Therefore, it only affects the binary/image-based representation.

Key Finding: Binary padding only reduces the performance of binary/image-based classifiers and can be countered by reverse-engineering the software samples.

RQ6. “Which of the representations and learning algorithms are best suited for malicious IoT software detection?”

To answer this question, we considered the following metrics: (1) Baseline accuracy. A detector should have a minimal detection error (i.e., false positive and negative rates). (2) Performance consistency. The performance of the classifiers should be robust to various binary manipulation techniques. (3) Model stability. The robustness of the classifier should encompass Gaussian noise, to some extent. Altogether, the classifier that performed best is the Hexdump-based LR classifier, followed by the CFG algorithmic-based RF classifier.

Key Finding: Hexdump-based LR classifier is the most robust classifier, providing a stable 98.96% baseline accuracy.

(a) Original.
(b) Binary Packed.
(c) Binary Packed*.
(d) Binary Stripped.
(e) Binary Padded.
Figure 5: The online engines’ detection rate of the original and binary manipulated IoT malware samples.

Vi Industry-Standard Detection Engines Robustness

Malware authors check their software on the online detection engines to ensure that it evades the scanning engines. Given that these scan engines provide results for a pool of anti-virus engines, evading the detection from these engines is considered as a prototype for malware evolution. These mutations are then used in malware campaigns in the future. We argue that a practical malware detector should detect such mutations, or at least cover for the low-effort based mutations.

Vi-a Experimental Setup

Online scan engines, such as VirusTotal, are commonly used by researchers to inspect software. VirusTotal reports contain the detection results of a pool of state-of-the-art anti-virus engines that can be considered as the up-to-date capability of industry-standard malware detectors. Overall, it contains reports from 66 IoT malware detection engines. Therefore, to have a comprehensive evaluation of the existing IoT malware detectors, we also evaluate the industry-standard malware detection systems.

VirusTotal Reporting. The original and manipulated software were uploaded to VirusTotal using their Large File Scan API. To account for the time the AI engines take to properly scan the uploaded files, we wait for 24-hours before gathering the reports. Each of the reports contains details about the uploaded file, including the date, size, header information, and the scan results of each available detection engine. Each report contains results of multiple engines (45-66), each highlighting if it detects the file as malicious or otherwise. Additionally, we found two engines that report for less than ten samples, which we removed from our list. Ultimately, we scan the malicious and benign software through 64 detection engines.

AI-based Engines. The next step is to separate the AI-based engines from other engines. This step is challenging as the detection engines are unlikely to share their detection approaches with the public. We manually inspect each detection engine website, searching for the used approaches. Engines that explicitly mention AI or ML are labeled as AI (), while others are labeled as uncertain (✗).

Ethical Considerations. As stated by VirusTotal, the API is not meant to be used to compare between the engines, nor be used to draw conclusions of whether engine X is better than engine Y. Toward this, we take the following considerations: (1) All engines are renamed as “E — ”, where is a given index for the engine. (2) The usage of the API is to assert that state-of-the-art scan engines are vulnerable and behave similar to the research-based detection approaches discussed in section V. We do not intend to compare the engines, nor raise concerns against any specific service provider.


Engine AI Benign Malware
Original Packed Packed* Stripped Padded Original Packed Packed* Stripped Padded


E — 1 100 86.41 89.68 100 100 100 82.79 82.94 100 100


E — 2 100 100 100 100 100 98.33 33.83 34.67 97.33 23.5


E — 3 100 100 100 100 100 99.5 34.67 35.5 98.5 37.0


E — 4 100 100 100 100 100 99.33 94.5 96.33 99.33 95.29


E — 5 100 100 100 100 100 100 100 100


E — 6 100 100 100 100 100 99.67 99.67 99.67 99.66 99.67


E — 7 100 100 100 100 100 0.0 0.0 0.0 0.0 0.0


E — 22 100 100 100 100 100 80.61 29.15 29.34 79.16 4.04


E — 23 100 100 100 100 100 99.67 99.67 99.5 99.5 97.33


E — 24 100 100 100 100 100 50.34 29.36 29.88 85.21 59.97


E — 25 100 100 100 100 100 84.8 28.42 28.52 81.27 4.65


E — 26 100 100 100 100 100 100 58.29 58.66 98.99 40.37


E — 27 100 85.84 90.07 100 100 100 82.78 82.8 100 100


E — 28 100 100 100 100 100 99.83 99.83 99.83 99.66 95.41


E — 29 100 100 100 100 100 0.0 0.0 0.0 0.0 0.0


Table V: The online IoT malware detection engines evaluation (%). Packed*: optimized packing.

Vi-B Evaluation & Results

In the following, we interpret the results of the industry-standard malware detectors to understand their behavior, shown in Table V and presented as research questions. Due to space constraint, we only report detailed results for 15 engines. The major insights are illustrated in Figure 6.

RQ7. “Does manipulation affect malware detection rate?”

To answer this question, we recorded the number of engines that identify malware as malicious. We begin by probing the original malware samples: (a) shows the distribution of their detection rate by the engines. Notice that malware, on average, is detected by 40 engines, with a majority of them being detected by 35-45 engines. For the manipulated samples, however, the detection rate varies highly. Figure 5 shows the distribution of malicious samples by the number of engines for each of the manipulation strategies. We notice that stripping ((d)) does not affect the distribution of the samples. However, packing ((b) and (c)) highly affects the detection rate. Moreover, while binary padding had minimal effects on the baseline classifiers’ performance (section V), it highly affects their detection among the online engines. This indicates that several engines use binary-based representations (e.g., binary sequence and image) to detect malicious software.

Key Finding: Except for binary stripping, binary manipulation highly decreases the detection confidence.

RQ8. “How individual engines generally perform?”

To answer this question, we evaluate each individual detection engine using the original and manipulated benign and malicious software, shown in Table V. We observe that multiple engines perform poorly, with 36% of the engines (23 out of 64) failing in identifying malware ( 0% accuracy), such as “E — 7” and “E — 29”. Additionally, except for “E — 1” and “E — 27”, the benign detection accuracy is 100%, similar trends were observed for packed, stripped, and padded benign software.

Key Finding: Several engines (36%) exhibit reduced performance for detecting original and binary manipulated malicious software.

RQ9. “Does packing affect the engines’ performance?”

The evaluations exhibit that packing does not affect the performance of the engines in accurately detecting benign software (except for “E — 1” and “E — 27”). This observation is in contrast to previous observations [aghakhani2020malware] (refer to section V). However, packing, generally, reduces the accuracy of malware being detected as malware. For instance, “E — 3” performance declined from 99.5% to 35% when tested with packed malware. We also observed that optimized packing does not decrease the detection rate, in fact, it slightly increases the chance of malicious software being detected, as compared to the standard packing. Additionally, for engines, such as “E — 5”, we observe that no results were reported for benign packed binaries, while achieving 100% in other categories. This can be attributed to the low confidence of the engine in labeling benign packed samples.

Key Finding: Although packing reduces the detection rate of malicious software, it has no effect on the benign software detection rate. Optimized packing has a higher detection rate in comparison with default packing.

Figure 6: Industry-standard detection engines robustness highlight. Binary packing significantly reduces the detection rate of Malware software (“E — 2”). Binary stripping does not result in noticeable performance degradation, and may increase the malware detection rate (“E — 22”). Simple binary padding to the end of the file may cause significant degradation in the performance (“E — 3” and “E — 22”).

RQ10. “Does stripping affect the engines’ performance?”

There is no noticeable decrease (1%) in the detection accuracy of stripped software in the case of online engines. In fact, for some engines (i.e., “E — 24”), the malware detection performance increased from 50.34% to 85.21% after stripping.

Key Finding: Stripping has no negative effect on the performance of the engines, albeit increasing the accuracy in some instances.

RQ11. “Does padding affect the engines’ performance?”

Binary padding significantly decreases the performance of several online engines, such as “E — 2”, “E — 3”, and “E — 22”. This is maybe attributed to the fact that appending binaries disrupt the existing signatures. The online engines’ reports show that of them are affected negatively, with of them exhibiting a drastic decrease in performance ( decrease). Although padding does not affect the reverse-engineered features, the decrease in performance, regardless, indicates that the engines use the raw binary representations (e.g., binary sequence- and image-based) for classification, which apparently can be easily disrupted.

Key Finding: Binary padding highly reduces the performance of several engines, while leaving others intact.

Vii Concluding Remarks

Malware analysis and detection have been the focus of the research community and the industry alike, with many advances in defenses with the use of AI-backed systems. Despite those advances, these systems have been shown to be vulnerable to several simple-yet-effective adversarial attacks, such as binary stripping and packing. With this work, we systematically evaluate the state of a range of malware detectors, proposed by the research community and industry-standard.

Our efforts show that malware detectors proposed in the literature are vulnerable to adversarial perturbation and binary manipulation attacks. Similarly, industry-standard malware detectors are prone to such attacks. Our efforts also unveil the status-quo of the existing detectors, and bring forward various insights to consider when proposing detection systems. Particularly, in addition to optimizing baseline malware detection accuracy, researchers should take into account the robustness of the proposed systems under adversarial capabilities. This obligates for a deep understanding of the underlying learning algorithms and data representations, alongside the learned patterns and their characteristics.