Leveraging Uncertainty for Effective Malware Mitigation

A promising avenue for improving the effectiveness of behavioral-based malware detectors would be to combine fast traditional machine learning detectors with high-accuracy, but time-consuming deep learning models. The main idea would be to place software receiving borderline classifications by traditional machine learning methods in an environment where uncertainty is added, while software is analyzed by more time-consuming deep learning models. The goal of uncertainty would be to rate-limit actions of potential malware during the time consuming deep analysis. In this paper, we present a detailed description of the analysis and implementation of CHAMELEON, a framework for realizing this uncertain environment for Linux. CHAMELEON offers two environments for software: (i) standard - for any software identified as benign by conventional machine learning methods and (ii) uncertain - for software receiving borderline classifications when analyzed by these conventional machine learning methods. The uncertain environment adds obstacles to software execution through random perturbations applied probabilistically on selected system calls. We evaluated CHAMELEON with 113 applications and 100 malware samples for Linux. Our results showed that at threshold 10 non-intrusive strategies caused approximately 65 accomplishing their tasks, while approximately 30 software to meet with various levels of disruption. With a dynamic, per-system call threshold, CHAMELEON caused 92 the benign software to be disrupted. We also found that I/O-bound software was three times more affected by uncertainty than CPU-bound software. Further, we analyzed the logs of software crashed with non-intrusive strategies, and found that some crashes are due to the software bugs.



There are no comments yet.


page 9


Learning Fast and Slow: PROPEDEUTICA for Real-time Malware Detection

In this paper, we introduce and evaluate PROPEDEUTICA, a novel methodolo...

Towards Improving the Trustworthiness of Hardware based Malware Detector using Online Uncertainty Estimation

Hardware-based Malware Detectors (HMDs) using Machine Learning (ML) mode...

ML-based IoT Malware Detection Under Adversarial Settings: A Systematic Evaluation

The rapid growth of the Internet of Things (IoT) devices is paralleled b...

CNN vs ELM for Image-Based Malware Classification

Research in the field of malware classification often relies on machine ...

Can We Leverage Predictive Uncertainty to Detect Dataset Shift and Adversarial Examples in Android Malware Detection?

The deep learning approach to detecting malicious software (malware) is ...

Dataset Optimization Strategies for MalwareTraffic Detection

Machine learning is rapidly becoming one of the most important technolog...

Identifying the root cause of cable network problems with machine learning

Good quality network connectivity is ever more important. For hybrid fib...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Attacks are continuously evolving and existing protection mechanisms have not been coping well with the increased sophistication of attacks, especially advanced persistent threats (APTs), which target organizations. Malware used in APTs attempts to blend in with approved corporate software and traffic, and act slowly, thus evading detection. As a result, by the time an APT attack is discovered, sensitive information has already been exfiltrated and many computers have been compromised, making recovery difficult [1, 2].

Real-time malware detection is challenging. The industry still relies on antivirus technology for threat detection [3, 4], which is effective for malware with known signatures, but not sustainable given the massive amount of new malware samples released daily. Additionally, since zero-day malware has no known signature, and polymorphic and metamorphic attacks constantly change their patterns, signature scanning operates at a practical detection rate of only 25% to 50% [5]. Alternative approaches identify behavioral properties, such as unusual sequences of system calls, and use behavioral patterns to characterize malware. However, research has shown that behavior-based detectors suffer from a high false-positive rate [6, 7]

, because of the increasing complexity and diversity of current software. Aggressive heuristics, such as

erring on the side of blocking suspicious software, can interfere with employee productivity, resulting in employees overriding or circumventing security policies.

Recently, deep learning has achieved state-of-the-art results in a broad spectrum of applications, and has been considered a promising direction for behavior-based approaches with high detection rates. However, the direct application of pure deep learning methods in practical, on-the-fly malware detection is challenging, because deep learning algorithms require more computation time for classification (about one hundred orders of magnitude higher than conventional machine learning algorithms), and a considerable amount of time and memory for re-training, as new malware samples become available. Incremental retraining is a common requirement for malware detection, as new variants and samples are regularly discovered and added to the training set.

Thus, it seems that there is a trade-off for behavioral-based malware detection solutions. If the solution prioritizes speed with fast conventional machine learning algorithms, it might lose accuracy and risk generating high false positive rates. If the solution prioritizes accuracy, it might not be entirely practical for on-the-fly application given the computational overhead of emerging deep learning methods: long re-training time and memory resources and higher response time for classification.

A promising solution would be to have the best of these two worlds: combining both conventional machine learning methods and emerging deep learning methods and applying them where their advantages are leveraged and their limitations are downplayed, via a two-phase detection stage. The main idea is as follows. All software in the system starts running in a standard OS environment and is continuously monitored through a behavioral detector based on conventional machine learning algorithms, which provide fast classification and retraining. If a piece of software receives a borderline classification (i.e., reaches a threshold set by the system administrator), it is moved to an uncertain environment. In this environment the software will experience probabilistic and random perturbations, whose severity will depend on whether the software is whitelisted. The goal of these perturbations is to thwart the actions of potential malware or compromised benign software while deep learning analysis is underway. If the deep analysis finds the software benign, it is placed back in the standard environment, where it is again continuously monitored.

In this paper, we present a detailed description of the design and implementation, as well as new extensions of Chameleon, a framework realizing this spectrum OS behavior for Linux. Chameleon has the potential to allow the successful combination of conventional machine learning detection methods with the power of emerging deep learning models for real-time, on-the-fly malware detection with the goal to protect organization computers against sophisticated and stealthy malware.

Today, it is standard practice for organizations to restrict or whitelist mission-critical software used by their employees [6]. Employees are supposed to only use approved software tied to their primary task and are allowed to use some personal software. Chameleon applies non-intrusive strategies (e.g., delay a system call execution) to whitelisted software, and intrusive strategies (e.g., increase or decrease the bytes in a buffer passed as a parameter to a system call) to non-whitelisted software.

Chameleon allows the introduction of perturbations to the execution of software running in the uncertain environment, with the goal to “buy time” for deep learning-based detectors to provide a definitive and accurate classification of a piece of software. Chameleon has the potential to allow the successful combination of ML detection methods with the power of DL for real-time malware detection to protect computer infrastructures in organizations. Chameleon has the potential to advance systems security, as it can (i) make systems diverse by design because of the unpredictable execution in the uncertain environment, (ii) increase attackers’ workload, and (iii) decrease the speed of attacks and their chance of success.

We evaluated Chameleon [8] with 100 samples of Linux malware and 113 common software from several categories. Our results show that at a threshold of 10%, intrusive strategies thwart 62% of malware, while non-intrusive strategies caused a failure rate of 68%. At threshold 50%, the percentage of adversely affected malware increased to 81% and 76% respectively. With a 10% threshold, the perturbations also cause various levels of disruption (crash or hampered execution) to approximately 30% of the analyzed benign software. With a 50% threshold, the percentage of software adversely affected raised to 50%. We also found that I/O-bound software was three times more affected by uncertainty than CPU-bound software.

In this paper we introduced an optional dynamic, per-system call perturbation threshold. Our analysis show that the application of such personalized threshold caused 92% of the malware to fail to accomplish their tasks, and impacted only 10% of the benign software. Compared with a static threshold, this personalized threshold corresponded to an increase in 20% more benign software unaffected by the perturbations and 24% more malware crashed or hampered in the uncertain environment. We also analyzed on the crash logs from benign software undergoing non-intrusive perturbations, and found that it was actually software bugs that caused the crashes.

In this paper, we improved our work described in Chameleon [8], and presented the following new contributions.

  • We designed and implemented a dynamic, per-system call perturbation threshold based on the behavior of software execution. We showed that such threshold is more effective by bringing more overall adverse effects to malware execution and less impact to benign software execution, compared with a static threshold.

  • We designed and implemented a fully automated testbed for collecting system call traces (at kernel level) from malware and benign software when these software is under perturbations. Such testbed can be leveraged to analyze benign software behavior under OS misbehavior and help developers pinpoint portions of their software that are sensitive to misbehavior, thus leading to more resilient software.

  • We explored the reasons causing benign software crash in the uncertain environment, and found that 4 out of the 5 crashes were actually caused by software bugs whose behavior was exposed by the perturbations we added. We argue that resilient benign software will be less affected by the uncertain environment. Moreover, Chameleon can be used as a framework to locate software bugs through analyzing the logged system calls and their parameters invoked before software crashes.

This paper is organized as follows. Section 2 describes our threat model and assumptions. Section 3 describes in detail Chameleon’s design and implementation, including the newly proposed per-system call perturbation threshold. Section 4 describes Chameleon’s security and performance evaluation, including our analysis of causes of crashes for benign software in the uncertain environment. Section 5 discusses and summarizes Chameleon’s results and limitations. Section 6 summarizes related work on malware detection, software diversity, and attempts on unpredictability as a security mechanism. Section 7 concludes the paper.

2 Threat Model and Assumptions

Chameleon’s goal is to provide an environment that rate-limits the effects of potential malware, while more time-consuming deep analysis is underway. Chameleon’s protection is designed for corporations and similar organizations, which have already adopted a standard practice of controlling software running at their perimeter [7]. Organizations face the challenge of enforcing perimeter security, while also requiring minimum interference to employees’ primary tasks. The combination of fast, preliminary classification by traditional machine learning methods and more time-consuming and more accurate deep analysis for borderline cases can help address this challenge.

We assume that if an organization is a target of a well-motivated attacker, malware will eventually get in. A classic scenario is when a C-level personnel of a targeted organization falls victim to a spear-phishing email attack, thereby causing an APT backdoor to be installed in one of the computers of the victim’s company. The malware is zero-day and is not detected by any antivirus (signature-based and behavioral based). In a standard OS, this APT would infiltrate and compromise the organization. With Chameleon, applying in a hybrid behavioral-based detection solution as described in Section 1, this APT might receive a borderline classification at some point by a conventional machine learning detector and would then be placed in the uncertain environment. In this environment the APT backdoor would encounter obstacles and delays to operate, while more time and resource-consuming deep analysis is underway.

We assume that whitelisted software receiving a borderline classification by a conventional machine learning detector can be an indication of a software compromise. Of note, Chameleon does not compete with standard lines of defenses, such as antiviruses and traditional behavioral-based detectors, but actually complements them by equipping these solutions with a safety net in the case of misdiagnosis.

3 Design and Implementation

We designed and implemented Chameleon for the Linux OS. Chameleon offers two environments to its processes: (i) a standard environment, which works predictably as any OS, and (ii) an uncertain environment, where a subset of the OS system calls undergo unpredictable interferences.

The key insight is that interference in the uncertain environment will hamper the malware’s chances of success, as some system calls might return errors in accessing system resources, such as network connections or files. Moreover, random unavailability and some delays will make gaining CPU time difficult for malware.

3.1 The Interference Set

Our first step was deciding which system calls were good candidates for interference. We relied on Tsai et al.’s study [9], which ranked Linux system calls by their likelihood of use by applications. Based on these insights, we selected 37 system calls for the interference set to represent various OS functionalities relevant for malware (file, network, and process-related). Most of these system calls (summarized in Table I) are I/O-bound, since I/O is essential to most malware, regardless of its sophistication level.

We introduced new versions for all system calls in the interference set. When Chameleon’s uncertainty module is loaded, it records the pointer to each system call in the interference set as orig_<syscall_name> and alters the corresponding table entry to point to my_<syscall_name>().

Category System call
sys_open, sys_openat, sys_creat, sys_read,
sys_readv, sys_write, sys_writev, sys_lseek,
sys_close, sys_stat, sys_lstat, sys_fstat,
sys_stat64, sys_lstat64, sys_fstat64, sys_dup,
sys_dup2, sys_dup3, sys_unlink, sys_rename
sys_bind, sys_listen, sys_connect, sys_accept,
sys_accept4, sys_sendto, sys_recvfrom,
sys_sendmsg, sys_recvmsg, sys_socketcall
sys_preadv, sys_pread64,
sys_pwritev, sys_pwrite64,
sys_fork, sys_clone, sys_nanosleep
Table I: System call Interference Set.

3.2 Interference Strategies

We introduced two sets of interference strategies. The first set, non-intrusive, perturbs software execution within the OS specification, and applies to whitelisted software running in the uncertain environment. The second set, intrusive, might cause corruptive perturbations, and applies to non-whitelisted software running in the uncertain environment.

3.2.1 Non-intrusive Strategies

System call silencing with error return: The system call immediately returns an error value randomly selected from the range [-255, -1]. This strategy can create difficulties for the execution of the process, especially if the process does not handle errors well. Further, this strategy can cause transient unavailability to resources, such as files and network connections, creating difficulties for malware such as a fork bomb or a network flooder to operate. Of note, all error returns are within the OS specification; most system calls in Linux have an expected set of return values, and software might fail to check for return errors.

Process delay: The system call injects a random delay within the range [0,0.1s] during the system call execution with the goal to drag potential malware execution. It can create difficulties in timely malware communication with a C&C for files ex-filtration, as well as prevent flooders from sending enough packets in a very short time, rate-limiting DoS in a victim server.

Process priority decrease: The system call decreases the dynamic process priority to the lowest possible value, delaying its scheduling to one of the system’s CPUs. This strategy can hamper malware execution, buying time for a definitive detection by a deep learning analyzer.

3.2.2 Intrusive Strategies

System call silencing: The system call immediately returns a value (without being executed) indicating a successful execution.

Buffer bytes change: The system call decreases the size of the number of bytes in a buffer passed as a parameter to a system call. It can be applied to all system calls with a buffer parameter, such as sys_read, sys_write, sys_sendto and sys_recvfrom. This strategy can corrupt the execution of malicious scripts, thus making the exfiltration of sensitive data more difficult. This strategy also targets viruses, which can be adversely affected by the disruption of the buffer with a malicious payload trying to be injected into a victim’s ELF header, and the victim may get corrupted and lose its ability to infect other files.

Connection restriction: The strategy changes the IP address in sys_bind, or limits the queue length for established sockets waiting to be accepted in sys_listen. The IP address can be randomly changed, which will likely cause an error, or it can be set to the IP address of a honeypot, allowing backdoors to be traced.

File offset change: The strategy changes a file pointer in the sys_lseek system call so that subsequent invocations of sys_write and sys_read will access unpredictable file contents within a specified, configurable range.

3.3 System Architecture

To implement the uncertain environment, the following fields were added to the Linux task_struct.

Figure 1: Chameleon’s architecture. When a process running in the uncertain environment invokes a system call in the interference set (Step 1), the Uncertainty Module checks if the process is running in the uncertain environment (Step 2), and depending on the execution of the corruption protection mechanism (Step 3), randomly selects an interference strategy to apply to the system call. The corruption protection mechanism prevents interferences during accesses to critical files, such as libraries.

process_env: It informs if the process should run in the standard or uncertain environment.

fd_list: It keeps a list of critical file descriptors during runtime execution. Interference on system files, such as library or devices, will likely crash the program execution. Thus, interference is not applied to system calls manipulating those file descriptors (see Section 3.5 for more details).

strategy_set: It informs if the process should be perturbed with non-intrusive strategies or intrusive strategies.


: It represents the probability that a system call from the interference set invoked by a process in the uncertain environment will undergo interference. The higher the threshold, the higher the probability that an interference strategy will be applied.

Figure 1 illustrates Chameleon’s architecture and operation. A key component of Chameleon is a loadable kernel module, the Uncertainty Module, which monitors the execution of all system calls in the interference set, and applies a randomly chosen interference strategy to the system call, depending on the process environment and the interference threshold.

For example, consider Process 2 in Figure 1, loaded in the uncertain environment invoking sys_write (Step 1). Because sys_write is in the interference set, it can introduce uncertainty in its own execution. First the system call inspects Process 2’s environment and finds that it runs in the uncertain environment (Step 2). Next, sys_write runs the corruption protection mechanism (see Section 3.5) to make sure that no interference will occur if the system call is accessing a critical file (Step 3). If sys_write is not accessing a critical file, Chameleon decides based on the threshold whether or not a strategy should be applied. If a strategy is to be applied, sys_write randomly selects one of the strategies that can be applied to its execution.

3.4 Per-System Call Interference Threshold

As an extension of Chameleon, representing a new contribution of this paper, we introduced an optional per-system call interference threshold. As described above, the newly added field threshold in Linux represents the probability that any system call from the interference set invoked by a process in the uncertain environment will undergo interference. Once configured, Chameleon applies the same threshold to all system calls in the set without considering execution context.

In real-time malware detection, the machine learning analyzer will dynamically produce the probability of a software being malware. We aim to adjust the threshold based on the probability proportionally.

Because we did not include the machine learning detector in this paper, we simulated a simplified situation with a dynamically changed probability of software being malware. The probability was considered higher when we observed the following behaviors.

  1. Frequent invocation of one type of system call or a pattern of several system calls, such as sys_open(), sys_fork() and sys_sendto();

  2. Writing to ELF executable headers;

  3. Redirection of the system standard input, output or error;

  4. Renaming or unlinking of system binaries.

Behavior (1) generalizes many types of malware operations. This behavior can represent a flooder sending millions of packets to block a server, a botnet trying to scan victim IPs and report back to the C&C server, a password cracker attempting to brute-force a ssh session key, or a fork bomb trying to use up system resources. Behavior (2) is common for viruses trying to inject themselves into other benign executables or source code files. Behavior (3) is common for malware opening a backdoor or a reverse shell, a crucial step for the operations of C&C servers. Behavior (4) is common for malware replacing system files with Trojans. All in all, malware lifeblood are I/O operations and they eventually depend on one or a combination of the behaviors described above to perform their primary malicious tasks.

Benign software, unless under debugging or configuration modes, are less likely to show above behaviors, especially clusters of them. Therefore, interference on system calls relevant to these types of behaviors should disrupt malware more than benign software 222developers performing debugging tasks or system administrators could have a different customization for the uncertain environment..

This per-system call threshold is dynamically adjusted as follows. All processes loaded in the uncertain environment start with a default threshold . During a process execution, we dynamically adjust the threshold upon system call invoking pattern. For Behavior (1), given that different programs invoke system calls at different frequencies, we compute “frequent invocation” from the ratio of a system call pattern occupying the total number of system calls. It is a per-process per-system call variable stored in the uncertainty module.

If is larger than , the system call pattern is considered “frequent invocation”. As stated in Behavior (1), higher frequency of invocations indicates higher probability of software being malware. Therefore, the threshold and the ratio should be in a proportional relationship [10]. We use to denote the proportion, and will be the adjusted threshold. In any case, the threshold will be smaller than .

Since Behavior 2-4 exhibit strong likelihood of software being malware, the threshold will be adjusted to

if a process exhibits any of the behaviors from 2-4. Examples of a process’ actions that can be classified as behaviors 2-4 are:

sys_write("177ELF") (Behavior 2), sys_dup(0), sys_dup(1) (Behavior 3), and sys_unlink("bin") or sys_rename("bin") (Behavior 4).

In our study we configured the parameters with , , , and . was not chosen as 100%, because we do not expect processes receiving borderline classification and transfered to the uncertain environment to completely stop running. To avoid early termination of the program, the dynamic threshold only started to update when a defined amount of system calls are monitored (100 in our case).

3.5 Corruption Protection Mechanism

The uncertainty module employs a corruption protection mechanism to prevent interference while a process in the uncertain environment is accessing critical system files, which might cause early termination of the process. The files are identified through file descriptors, created by sys_open, sys_openat and sys_creat, and are deleted by sys_close. System calls whose parameters are file descriptors, such as sys_lseek, sys_read and sys_write, are under this protection mechanism. These protected files are determined by an administrator and tracked by setting an extended attribute in the file’s inode in the .security namespace (a similar strategy is employed by SELinux [11]).

When a process running in the uncertain environment opens a file with a pathname beginning with critical directories or containing keywords, the file descriptor (fd) is added to a new per-process data structure . Later, when this process invokes sys_read or sys_write referring to an in , the protection mechanism will prevent interference strategies from being applied to these system calls.

Algorithm 1 shows how the OS applies the interference strategies on sys_write. First, the following conditions are checked: (i) the process is running in the standard environment (), and (ii) the targeted file descriptor is a critical system file (see Section 3.5). If either of the two conditions is true the system call runs normally. Otherwise, the system call updates its execution counters of the current process (i.e. the total number of system calls invoked and the total number of sys_write invoked ) and check whether sys_write is within frequent system call sequences. Then the algorithm generates a random number in the range [0,1], and if the number is smaller than the threshold, the system call undergoes interference.

The algorithm will randomly select one of the interference strategies based on the strategy type. If non-intrusive strategies are selected, one of the following strategies will be randomly selected for execution: System call silencing with error return, Process delay, or Process priority decrease. If sys_write is silenced, a random error code is returned, so that the process knows that an error occurred. If Process delay is chosen, the algorithm randomly selects a delay for the system call execution in the range . If Process priority decrease is selected, the algorithm decreases the process priority to the minimum.

Function  long my_sys_write(fd, buf, size)
        if process_env == 0 or corruption_protection(sys_write, pid, fd_list) then
               return orig_sys_write(fd, buf, size);
               boolean top;
               freq = isFrequentCalls(total_syscall_cnt++, write_cnt++);
               if  then
                      return orig_sys_write(fd, buf, size);
               end if
               = random(1,3);
               if  then
                      if  then
                             /* Silence system call with error */
                             return random(-255, -1);
                      else if  then
                             /* Delay process */
                             delay(random(0, MAX_DELAY));
                             return orig_sys_write(fd, buf, size);
                             /* Process priority reduction */
                             return orig_sys_write(fd, buf, size);
                      end if
               end if
                      if  then
                             /* Silence system call */
                             return size;
                      else if  then
                             /* Change buffer length */
                             newbuf = injectRandomBytes(buf);
                             return orig_sys_write(fd, newbuf, size);
                             /* Change buffer byte */
                             red_len = reduceLength(len);
                             return orig_sys_write(fd, buf, red_len);
                      end if
               end if
        end if
Algorithm 1 Applying interferences to sys_write()

4 Evaluation

The goal of our evaluation is to discover the impact of Chameleon’s uncertain environment in affecting malware and benign software behavior. We considered security, performance, and software behavior to answer the following research questions: (i) how will the uncertain environment with interference strategies affect software execution? (ii) is the per-system call interference threshold more effective than a static threshold? (iii) how different strategies impact the malware in the uncertain environment? and (iv) how benign software can be more resilient in the uncertain environment?

In general, our evaluation leveraged a collection of 113 software including common software from GNU projects [12], SPEC CPU2006 [13] and Phoronix-test-suite [14] (47 I/O-bound and 66 CPU-bound). Our 100 malware samples were randomly selected from THC [15] and VirusShare [16] in different categories (22 flooders, 14 worms, 15 spyware, 24 Trojans and 25 viruses). The samples contained executables built on both x86 and x86_64 systems. All the malware and benign software used in our experiments are detailed in the Appendix.

We deployed and evaluated Chameleon on four virtual machines (VMs) running Ubuntu 12.04 with 1GB RAM, 30GB Hard Disk, and 1 processor, one with x86 architecture, and the other with x86_64 architecture. The host machine working as the testbed runs Ubuntu 14.04 with 16GB RAM, 160GB Hard Disk, x86_64 architecture, and 8 processors.

4.1 Testbed and Data Collection

In this subsection, we detailed the architecture of the testbed and the process of automating scalable experiments. Figure 2 illustrates the architecture of the execution testbed. It has four components, a central Controller, a Resource Scheduler, a Task scheduler and a Data Collector.

Figure 2: The architecture of our evaluation testbed. The Controller starts the Resource Scheduler, the Task Scheduler and the Data Collector (Step 1). The Resource Scheduler reverts the Test VM to a clean Snapshot, loads the uncertainty module (Step 2), copies the malware or benign software resources (e.g. files and parameters needed during the execution) to the Test VM (Step 3). The Task Scheduler starts the honeypot service in the Honeypot VMs and the execution of malware or benign software in the Test VM (Step 4). The Data Collector reads system call traces and execution results (Step 5).

The Controller works on the host and is responsible for managing the other components. The Controller starts the Resource Scheduler to prepare the files and parameters for all the experiments, launches the Task Scheduler to run malware and benign software in the test VM, and starts the Data Collector to record system call traces and execution results after each experiment (Step 1).

The Resource Scheduler is responsible for preparing the environment for the test VM to start each experiment. First, it reverts the test VM to the Snapshot storing the state of a fresh installed and booted system. Then, it loads the uncertainty module to the test VM system (Step 2). Finally, it copies the software and the corresponding files and parameters needed during the execution from the malware/benign software resource pool to the test VM (Step 3).

The Task Scheduler is responsible for starting the Dionaea [17] honeypot service in the Honeypot VMs, and executing malware and benign software in the test VM. Since malware in the test VM may attack other computers in the same LAN, the Honeypot VMs are provided so that the malware can fully exhibit its malicious behavior. For security purpose, the Gateway is configured to block test VM (running malware) with external traffic.

The Data Collector is responsible for collecting system call traces logged in dmesg and the software execution results (adversely affected or not). For system call monitoring, we choose to hook the system call table through a loadable kernel module rather than using strace for two reasons: (1) our system considers the behavior from multiple processes rather than just one; (2) malware with anti-analysis techniques may stop executing when strace is detected.

4.2 Security

The goal of this security evaluation is to analyze the effect of the uncertain environment in malware and benign software execution. We considered that malware were adversely affected by the uncertain environment if they crashed or executed in a hampered fashion. An execution is considered Crashed if malware terminates before performing its malicious actions. An execution is considered Succeeded if malware accomplished its intended tasks, such as injecting malicious payload into an executable. The following outcomes are examples of hampered malware execution in the uncertain environment: (1) a virus that injects only part of the malicious code to an executable or source code file; (2) a botnet that loses commands sent to the bot herder; (3) a cracker that retrieves wrong or partial user credentials; (4) a spyware that redirects incomplete stdin, stdout or stderr of the victim; (5) a flooder that sends only a percentage of the total number of packets it attempted.

We evaluated the effects of the uncertain environment with 100 Linux malware samples using intrusive and non-intrusive strategies at static and dynamic per-system call thresholds. As Figure 3 shows, generally intrusive strategy produced approximately 10% more Crashed and 8% fewer Hampered execution results than non-intrusive strategies. For both intrusive and non-intrusive strategies, the ratios of Succeeded malware execution (infection) were almost the same. When intrusive strategies were applied, 81% of the malware samples failed to accomplish their tasks at threshold 50%, 62% failed at threshold 10%, and 92% failed with a dynamic per-system call threshold. Non-intrusive strategies yielded similar results for threshold 50%, 10% and per-system call threshold, with 76%, 68%, 93% of malware adversely affected, respectively. In general, threshold 50% caused more Crashed and fewer Succeeded malware execution results than threshold 10%. Per-system call threshold made improvements with about 25% fewer malware Succeeded the infection, and 30% more malware Crashed during execution than static threshold. This corroborates our assumption that a per-system call threshold is more effective in targeting malware, thus better protecting the system.

Figure 3: Execution results for malware running in the uncertain environment using intrusive and non-intrusive strategies with static (10% and 50%) and per-system call thresholds.

We also ran our samples of general software in the uncertain environment and observed their execution outcome. We considered the following cases as Hampered executions: (1) a text editor temporarily losing some functionality; (2) a scientific tool producing partial results; (3) a network tool missing packets. The execution outcome was considered Crashed if the software hanged longer than twice its standard runtime and needed to be manually killed. A Succeeded execution generates outputs that are exactly the same as those produced with the same test case in the standard environment and with a runtime that does not exceed twice that in the standard runtime.

As shown in Figure 4, compared with non-intrusive strategies, intrusive strategy caused more adverse effects to benign software with approximately 10% more Crashed, 7% more Hampered and 15% fewer Succeed execution. At static threshold 10% with intrusive strategies, on average 37% of the tasks experienced some form of Crashed or Hampered execution. With non-intrusive strategies, this percentage was 30%. For a 50% static threshold and intrusive strategies, 59% of the software was adversely affected. With non-intrusive strategies, this number was 10% smaller. Per-system call threshold with non-intrusive strategies on benign software made improvements with 25% more Succeeded, 15% fewer Hampered and 20% fewer Crashed execution than static threshold. With intrusive strategies, the effects of per-system call threshold is similar with static threshold 10%.

All the aforementioned results proved that uncertain environment with per-system call threshold can better ensure the succeeded execution of benign software, and more disproportionately bring adverse effect to malware than static thresholds. With non-intrusive strategies, the proof was strengthened.

Figure 4: Execution results for benign software running in the uncertain environment using intrusive and non-intrusive strategies with static (10% and 50%) and per-system call thresholds.

4.3 Software Behavior and Performance

We compared the execution of malware and benign software at the system call level in the uncertain environment. We explored the effects of different software types, software workloads and strategies.

In our experiment, modern software invoked more than twice the number of system calls monitored than malware, even with the existence of Flooders, which usually largely increased the average number of system calls invoked. For benign software the number of system calls perturbed or silenced was only half of those for malware, mainly because of the effectiveness of the corruption protection mechanism introduced in Section 3.5. Benign software had a larger number of connection attempts and read/write operation monitored than malware.

Table II and Table III show the results of system calls perturbed with different types of malware using non-intrusive and intrusive strategies. The impact of intrusive and non-intrusive strategies are similar. Generally, threshold 50% caused higher percentage of perturbations than threshold 10% and per-system call threshold, especially with connection-related system calls (50% increase). A per-system call threshold caused higher percentage of perturbations than static threshold 10%, because a per-system call threshold changed from to while static threshold 10% remained constant at . For all types of system calls invoked, flooders had the highest percentage perturbed and worms had the lowest percentage perturbed, with both static and per-system call threshold. Based on previous work, this can be explained by the fact that flooders invoked most system calls in the interference set, while worms invoked least. For the connection-related system calls perturbed, spyware had the lowest percentage with both static and per-system call interference threshold. With a per-system call threshold, spyware had 0% perturbed. From our observation, spyware usually invoked small number of network-related system calls with a big buffer of the contents they spied. Moreover, most of the packets were transmitted after sys_dup() system call, therefore a per-system call threshold with perturbing sys_dup() prevented further connection perturbations. For buffer-related system calls, spyware and Trojan received very small percentage of perturbations, indicating the effectiveness of a per-system call threshold in perturbing spyware and Trojan in an early stage before letting them send and receive more buffers.

Percentage of all syscalls
Percentage of connection-related
calls perturbed
Percentage of buffer-related
calls perturbed
10% 50% Dynamic 10% 50% Dynamic 10% 50% Dynamic
Flooders 9.74% 39.31% 37.91% 10.13% 59.29% 36.68% 6.58% 23.35% 24.92%
Spyware 2.89% 25.79% 14.33% 7.14% 48.15% 0.00% 3.06% 31.06% 0.41%
Trojan 8.09% 27.07% 21.17% 9.52% 62.14% 17.00% 7.14% 15.22% 1.81%
Viruses 5.02% 28.62% 23.47% 9.56% 47.78% 12.69% 4.96% 21.87% 17.27%
Worms 0.05% 15.67% 11.04% 9.86% 60.97% 8.11% 8.97% 14.37% 16.27%
All 0.41% 28.39% 19.80% 9.87% 60.97% 15.69% 6.83% 21.10% 13.81%
Table II: Percentage of system calls perturbed running malware with different thresholds in the uncertain environment using non-intrusive strategies.
Percentage of all syscalls
Percentage of connection-related
calls perturbed
Percentage of buffer-related
calls perturbed
10% 50% Dynamic 10% 50% Dynamic 10% 50% Dynamic
Flooders 8.22% 39.28% 39.11% 9.56% 49.57% 38.37% 3.50% 23.11% 30.69%
Spyware 4.38% 26.39% 15.02% 16.62% 51.25% 37.96% 0.64% 27.14% 0.07%
Trojan 6.90% 35.16% 25.08% 12.49% 56.62% 19.45% 3.58% 27.47% 6.08%
Viruses 6.49% 23.03% 28.14% 12.96% 52.34% 14.30% 8.94% 22.32% 24.19%
Worms 3.92% 22.93% 13.26% 6.53% 60.59% 8.15% 3.52% 19.86% 33.24%
All 6.26% 29.90% 26.27% 11.26% 53.55% 26.79% 4.47% 24.04% 19.36%
Table III: Percentage of system calls perturbed running malware with different thresholds in the uncertain environment using intrusive strategies.

Table IV and Table V show the results of system calls perturbed with I/O-bound and CPU-bound software using non-intrusive and intrusive strategies. In general, threshold 50% caused higher percentages of system calls perturbed than threshold 10%. With static threshold 10% and 50%, compared with IO-bound software, CPU-bound software has a higher percentage of all the system calls perturbed, and lower percentages of connection-related system calls and buffer-related system calls perturbed. This can be explained that CPU-bound software invoked more buffer-related system calls in the corruption protection mechanism. Per-system call thresold with I/O-bound software caused higher percentage of system calls perturbed than CPU-bound software, mainly because the per-system call thrshold had higher chances to increase the threshold in IO-related system calls (Behaviors in Section 3.4).

Percentage of all syscalls
Percentage of connection-related
calls perturbed
Percentage of buffer-related
calls perturbed
10% 50% dynamic 10% 50% dynamic 10% 50% dynamic
IO 1.24% 7.65% 2.60% 5.94% 23.40% 2.50% 0.94% 6.34% 34.16%
CPU 3.42% 10.72% 0.10% 0.00% 1.97% 0.00 3.39% 10.16% 0.02%
All 2.40% 9.28% 1.28% 2.79% 12.04% 1.18% 2.24% 8.36% 15.99%
Table IV: Percentage of system calls perturbed running benign software with different thresholds in the uncertain environment using non-intrusive strategies.
Percentage of all syscalls
Percentage of connection-related
calls perturbed
Percentage of buffer-related
calls perturbed
10% 50% dynamic 10% 50% dynamic 10% 50% dynamic
IO 1.21% 5.62% 8.20% 3.57% 20.14% 2.50% 0.95% 3.88% 50.93%
CPU 3.34% 11.82% 2.40% 0.03% 2.01% 1.00% 3.49% 12.27% 8.10%
All 2.34% 8.91% 5.13% 1.69% 10.53% 1.71% 2.30% 8.33% 28.21%
Table V: Percentage of system calls perturbed running benign software with different thresholds in the uncertain environment using intrusive strategies.

One of the greatest differences between malware and benign software is the diversity of functionality of the latter. To ensure the fairness of analysis on benign software, we measured the test coverage (percentage of software instructions executed) by compiling their source code with gcov [18], EMMA [19] and Coverage.py [20] based on the software’s programming language. The average coverage was 69.49%.

We analyzed the performance penalty caused by the interference strategies, such as process delay and process priority decrease on all 23 benchmark software whose execution could be scripted. Highly interactive software were tested manually and showed negligible overhead. Figure 5 shows the average runtime overhead for software whose execution could be scripted running in the uncertain environment. For runtimes ranging from 0 to 0.01 seconds, the average penalty is 8%; for runtimes ranging from 0.1 to 1 seconds, the average penalty is 4%; for runtimes longer than 10 seconds, the average penalty is 1.8%. This shows that the longer the runtime, the smaller the overhead is. One hypothesis is that software with longer execution time are usually CPU-bound programs performing time-consuming calculations. Because most of the system calls in the interference set are I/O related, CPU-bound programs are perturbed less and thus smaller overhead are incurred.

Figure 5: Performance penalty for 23 benchmark software whose execution time could be scripted. We categorized the software according to their average runtime.

We also tested 26 benign applications with different workloads running in the standard and the uncertain environment. The workloads contained three levels: light, medium and heavy, which corresponded to test, train, and ref level for SPEC CPU2006, and first, middle-most, and last-level in the Phoronix Test Suite. On all three different workloads, our results showed that 2 benign software were adversely affected by non-intrusive strategies and 9 software were affected by intrusive strategies (see Table VI). Further, there were no significant changes on the percentages of total system calls perturbed, connection-related system calls perturbed and buffer-related system calls perturbed with the change of workloads for both types of interference strategies. The results indicate that the workload type of the tested software does not impact the program outcome in the uncertain environment for the two sets of interference strategies we used.

Percentage of
syscalls perturbed
Percentage of
syscalls perturbed
Percentage of
syscalls perturbed
Number of
Workload Non-intrusive Intrusive Non-intrusive Intrusive Non-intrusive Intrusive Non-intrusive Intrusive
Light 4.3% 5.1% 0.0% 0.1% 2.9% 4.2% 2 9
Medium 5.8% 6.3% 0.2% 0.3% 3.1% 3.7% 2 9
Heavy 5.2% 5.9% 0.2% 0.2% 3.5% 3.0% 2 9
Table VI: Impact of non-intrusive and intrusive strategies on 26 benign software from Phoronix Test Suite and SPEC CPU for different workloads in the uncertain environment (static threshold 10%).

To understand the effects of different strategies, we carried out experiments running a flooder in the victim system. In each experiment (running for 1 minute), the flooder will be running in the standard environment (no strategy), or perturbed with one strategy or a combination of all strategies. To simulate a real-world test case, a piece of software was running some processes as background workload in the system. The software was configured to run with either a normal workload or a heavy workload. We measured the number of system calls invoked by the flooder and summarized the results.

As Figure 6 shows at normal background workload, with strategy system call silencing with error return, the total number of system calls invoked increased by 4% (from 6,913,041 to 7,187,094), compared with no strategy employed. The reason is that flooders would immediately retry when a packet failed, saving the time for waiting for responses. With strategy delay only, the total number of system calls invoked decreased by one third. When we randomly chose a strategy each time, the result was the best—smallest number of system calls being successfully sent. Strategy priority decrease did not make any difference than applying no strategy, and this was because the flooder program had already been scheduled with the lowest priority in the server. Thus, we increased the background workload to heavy (more processes running), so that the flooder program can have higher priority. In the case of heavy background workload, the number of system calls a flooder could send in the standard environment decreased by more than half. Priority decrease demonstrated strong effectiveness by making a sharp decrease on the total number of system calls invoked. This could be explained by software under heavy workload running more background processes than under normal workload, and some of the background processes having a priority lower than the flooder. Therefore, a random decrease of the flooder’s priority saved more system resources to the other benign but low-prioritized processes. Regardless of the background workloads, running mixed strategies yielded better results with fewest packets sent than these running a single strategy.

Figure 6: Comparison among different types of strategies in the number of system calls invoked by a flooder, with normal and heavy background workloads.

4.3.1 Case Study: Advanced Persistent Threat (APT)

In this section we show the evaluation of the interference strategies with an APT attack. We simulated a watering hole attack similar to the Black Vine APT from Symantec [21]. This attack has three main components: a Trojan, a backdoor and a keylogger. First, the attacker sends a spear-phishing e-mail to a user with a link for downloading the Trojan encryption tool. If the user clicks on the link and later uses the Trojan tool to encrypt a file, the tool downloads and executes a backdoor from a C&C server while encrypting the requested file. Then, the backdoor copies the directory structure and the ssh host key from the user’s machine into a file and sends it to the C&C server. After the backdoor executes, the attacker deletes any traces of the infection without affecting the Trojan’s encryption/decryption functionality. The attacker will also install a keylogger to obtain root privileges. Next, the backdoor runs a script that uploads sensitive data to the C&C server.

The Trojan is written in C using libgcrypt for encryption and decryption. It uses the curl library for downloading the backdoor from the Internet. In our simulation we used the logkeys keylogger [22]. The backdoor script uses scp for sending the data to the C&C server.

APT in the Uncertain Environment: From the system call traces we collected, the first malicious behavior occurred when the backdoor was being configured, with a sys_write() invoked with a buffer parameter starting with ” 177ELF. This behavior caused the threshold to increase to on the sys_write() system call. Later, three pairs of sys_dup2() with file descriptors 0 and 1 are invoked afterwards to execute the backdoor. The threshold on the three sys_dup2() was increased to again. Then, when sys_read() on the ssh host key files was invoked, the threshold decreased to . Finally, the keylogger started, sys_write() was invoked to write to a log file and sys_connect() and sys_sendto() were invoked for the backdoor to communicate with the C&C server. In our 15 experiments, 9 of them crashed before setting up the backdoor, and another 4 of them crashed before starting the keylogger. None of them successfully completed the attack by communicating to the C&C server. Actually, the probability for the simulated APT to gain privilege and exfiltrate data is under , which is 0.14%.

4.3.2 Chameleon towards OS uncertainties

Software Bugs
Vim viminfo: Illegal starting char [23]
tar Fail using ’-C’ option extracting archive with empty directories [24]
“Operation not permitted” when extracting [25]
Thunderbird Unable to locate mail spool file [26]
segmentation fault (core dumped) [27]
Firefox Bus error (core dumped) [28]
Fatal IO error (Operation not permitted) on X server [29]
Table VII: Software bugs found by Chameleon

Since non-intrusive strategies consider only perturbations within the OS specification, software still experiencing crashed/hampered execution should be further analyzed. We explored the reasons behind the crashes or hampered executions, so that benign software could improve itself to better adapt to the interference strategies. We analyzed the execution logs manually, through examining from the last system call and its parameters in a reversed order. Usually it is the failure of one system call request with a specific parameter that lead to the early termination of a program. Therefore, locating the corresponding system call and the parameter will reveal the reason for the software crash, and help find bugs of the software. The reasons behind a software bug may be many. From our observation, some of the bugs may emerge again after being ‘fixed’ for a while. Chameleon is capable of interfering every system call with a probability, and logging the execution details about the crash.

During our analysis, we found that the crashes in Vim, tar, Mozilla Firefox and Thunderbird were in fact software bugs reported before on Launchpad and Bugzilla [30, 31]. Because each system call was perturbed with a probability, the perturbations causing the crash in different tests varied on the same software (we ran each software fifteen times and averaged the results). Therefore, more than one bugs could be found for one piece of software. Table VII lists the bugs in detail. Besides general bugs, e.g. Segmentation Fault, Fatal I/O error and Bus error, we found several bugs of particular interest.

.viminfo [23]: This bug causes Vim to fail launching because of an erroneous .viminfo file. The .viminfo file is used to remember the information about last edit by a user. If the user exit Vim and later start it again, .viminfo file enables the user to continue where left off [32]. In our experiment, the .viminfo error was caused by silencing a sys_write on .viminfo file. In the reported bug, the error was caused by an operation using a special character not recognized by Vim before exiting. With Chameleon, we identified the reason of Vim stops launching is because of failure in sys_open on .viminfo from our log file.

tar -C empty directory [24]: This bug occurs when one extracts the empty directories inside an archive using the ‘-C’ option to change directories. The cause for the bug is tar using mkdir (file_name, mode) instead of mkdirat (chdir_fd, file_name, mode) to extract a directory. With Chameleon, we identified the failure of creating a new file descriptor with sys_openat in our log file.

Thunderbird mail spool file [26]: The bug causes Thunderbird to hang when linking an existing email account. Thunderbird uses the spool file to “help” the user set up an email account with the assumption that the email providers well address SMTP, ports, and security configurations issues. Unfortunately, few of them are correctly configured [33]. From the log file of Chameleon, we identified the failure in linking an account is because of failure in sys_read of spool file.

To sum up, our results show that the crashes and adverse effects in the analyzed software was due more to bugs than the perturbations applied by Chameleon. It appears that the perturbations accelerated the exposure of such bugs.

5 Discussion

As we discussed in Section 2, a resourceful and motivated adversary can bypass any protection mechanism. Even though the uncertain environment is designed to rate-limit stealthy malware, malware can still accomplish its goals while running in such environment. For example, highly fault-tolerant malware will be resilient to the uncertain environment.

There are some trade-offs in selecting an interference strategy. Intrusive strategies are more aggressive, and will affect software running in the uncertain environment more. For an organization with high security demands and less tolerance for non-approved software, intrusive strategies will offer more protection.

Strategy Process Delay is different from just suspending software execution. A suspended execution stops suspicious software from running and will not generate data for deep learning analysis. Process Delay, on the other hand, slows down software execution, thus potentially buying time for deep analysis and allowing for a more accurate classification of software which received borderline confidence levels in classifications by a fast conventional machine learning detector. Moreover, suspension of execution can be detected by malware just by checking wall clock time.

Although Chameleon was implemented for Linux (to be freely distributed to the public), it can also be implemented in any modern OS, such as Windows, which is a popular target of malware attacks. Finally, we are aware that the degree of uncertainty is not a one-size-fits-all solution—we expect an administrator to dial in the level of uncertainty to the needs of the organization and applications.

Finally, we believe that Chameleon can also be used as a framework to test software resilience under OS misbehavior.

6 Related Work

Our work intersects the areas of malware detection, software diversity and deception, and fuzz testing. This section summarizes how they have been used in software design and highlights under-studied areas.

Malware Detection: There are extensive literature dating to the 1990s on detection of intrusions and malware. Malware detection techniques can be signature-based [3, 4] or behavior-based [34, 35, 36].

Signature-based approaches match bytes and instructions from known malware to the unknown program under analysis. These techniques are accurate, but they can be evaded when attackers use polymorphism and metamorphism to create malware variants; these variants have the same behavior but have different byte signatures. Further, these approaches cannot detect zero-day malware and have a practical detection rate ranging from 25% to 50% [5].

Behavior-based techniques, which can be static or dynamic, analyze program behavior and attempt to detect events, instructions or resource access that are indicative of malware. Behavioral solutions based on static analysis [35] analyze the source code of malware and benign applications in an attempt to extract their unique behavior in high level specifications. Most of the work on dynamic behavior-based malware detection [34, 36] are based on seminal work by Forrestet al.  [34]. System call-based malware detectors suffer, however, from high false positive rates due to the diverse nature of system calls invoked by applications. This challenge has worsened as programs are becoming increasingly diverse [36].

Some approaches analyze the data flow of a program to extract malware behavior. Panorama [37], for example, performs system-level taint-tracking to discover how malware leaks sensitive data. Martignoniet al.  [38] leveraged hierarchical behavioral graphs to infer high-level behavior of low-level events. The approach traces the execution of a program, performing data-flow analysis to discover relevant actions such as proxying, data leaking and key stroke logging. Ether [39] improved on tracing granularity on single instructions and system calls via hardware virtualization extensions. Yeet al.  [40] proposed a semi-parametric classification model for combining file content and file relation information to improve the performance of file sample classification.

More recently, Bromium [5] proposed the use of virtualization on a per-process basis to isolate every process from the system and from each other. While this certainly advances the level of granularity offered by traditional sandboxes, it has some inconveniences for the user (e.g., it creates obstacles to inter-process communication) and cannot guarantee complete perimeter protection (e.g., a keylogger still can record credentials).

Chameleon’s goal is to provide an environment where possible malware can be rate-limited, while time-consuming deep analysis is underway.

Diversity and Deception: The ability to diversify behavior within a system is an essential building block for unpredictability. Diversifying components within the software stack can improve overall robustness. Researchers have studied building diverse computer systems. Forrestet al.  [41] proposed guidelines and advocated the use of randomized compilation techniques, which motivated later work in this area [42]. Forrest and her colleagues [43] also showed that code exhibits evolutionary characteristics similar to those seen in the biological world. A program, like a biological organism, has the potential to mutate but can still function normally [43].

Several projects mitigate buffer overflows and other memory errors by randomizing system call mappings, global library entry points, stack placement, stack direction, and heap placement—often in conjunction with running multiple versions in parallel to detect divergence [44].

To a limited extent, deception has been an implicit technique for cyber warfare and defense, but is under-studied as a fundamental abstraction for secure systems. Honeypots and honeynets [45] are systems designed to look like production systems in order to deceive intruders into attacking the systems or networks so that the defenders can learn new techniques.

Several technologies for providing deception have been studied. Software decoys are agents that protect objects from unauthorized access [46]. The goal is to create a belief in the attacker’s mind that the defended systems are not worth attacking or that the attack was successful. The researchers considered tactics such as responding with common system errors and inducing delays to frustrate attackers. Red-teaming experiments at Sandia tested the effectiveness of network deception on attackers working in groups. The deception mechanisms at the network level successfully delayed attackers for a few hours. Almeshekah and Spafford [47] further investigated the adversaries’ biases and proposed a model to integrate deception-based mechanisms in computer systems. In all these cases, the fictional systems are predictable to some degree; they act as real systems given the attacker’s inputs.

True unpredictability requires randomness at a level that would cause the attacker to collect inconsistent results. This observation leads to the notion of inconsistent deception [48], a model of deception that challenges the cornerstone of projecting false reality with internal consistency. Sunet al.  [49, 50] also argued for the value of unpredictability and deception as OS features. In this paper we explored non-intrusive unpredictable interferences to create an uncertain environment for software being deep analyzed after an initial borderline classification.

Fuzz Testing: Fault injection is an important method for generating test cases in fuzz testing. Through fault injection, researchers are able to study fault propagation [51] and develop flexible and robust software and systems [52, 53, 54]. Kanawati and Abraham provide a methodology and guidelines for the design of flexible software, based on their experience with the fault injection tool FERRARI [52]. Fault injection has been applied to a number of abstractions. DOCTOR [53], for example, supports memory faults, CPU faults, and communication faults. FINE [51] traces execution flow and key variables through the UNIX kernel via hardware-induced software errors and kernel software faults injection. A recent survey on assessing dependability with software fault injection [55] provides a comprehensive overview of the state of the art fault injection approaches to fit the goals of researchers and practitioners. LFI tool [56] injects errors in library-calls, in order to identify error handling faults that arise from misunderstanding of library APIs, and from poor portability across different OSes. Other possible forms of fault injection are code mutations and data interface corruptions [57]. Chameleon is similar by injecting faults (interferences) in the execution of software at the kernel system call level.

Fuzz testing is an effective way to discover coding errors and security loopholes in software, operating systems, and networks by testing applications against invalid, unexpected, or random data inputs. Miller et al. [58] first proposed fuzz testing as an inexpensive mechanism to generate additional software tests. The authors later extended the work [59] to identify missed return code checks from crucial calls, such as memory allocation. Many additional fuzz testing approaches have been proposed [60, 61, 62]. Trinity [63], for example, randomizes system call parameters to test the validation of file descriptors, and found real bugs [64], including bugs in the Linux kernel [65, 66, 67]. White-box fuzzy testers [68, 69, 70] were also proposed to increase the coverage of test inputs by leveraging symbolic execution and dynamic test generation. For instance, KLEE [69] uses symbolic execution and a model of system call behaviors provided by a user to generate high-coverage test cases. BALLISTA [71] tests the data type robustness of the POSIX system call interface in a scalable way, by defining 20 data types for testing 233 system calls of the POSIX standard. Chameleon can also be considered as a fuzz tester at the OS system call API to understand how resilient an application is to a particular type of misbehavior.

7 Conclusion

In this work we introduced a detailed description of the design and implementation, as well as new extensions of Chameleon, a novel Linux framework which allows the introduction of uncertainty as an OS built-in feature to rate-limit the execution of possible malware that received a borderline classification by traditional machine learning detectors, while a second, performance expensive deep-learning analysis is underway. Chameleon’s protection target are organizations, where it is a common practice to restrict software running in the organization perimeter. Chameleon offers two environments for software running in the system: (i) standard, which works according to the OS specification and (ii) uncertain, for any software that receives a borderline classification by traditional machine learning based detectors. In the uncertain environment software experiences a set of perturbations, which create obstacles for their execution, while deep-learning analysis is underway.

We evaluated Chameleon on Linux with 113 common applications and 100 malware samples from various categories. We define success of software execution in the uncertain environment as benign software tolerating uncertainty and users obtaining useful results from benign software in the system. Our results showed that at a threshold of 10%, intrusive strategies thwart 62% of malware, while non-intrusive strategies caused a failure rate of 68%. At threshold 50%, the percentage of adversely affected malware increased to 81% and 76% respectively. With a 10% threshold, the perturbations also cause various levels of disruption (crash or hampered execution) to approximately 30% of the analyzed benign software. With a 50% threshold, the percentage of software adversely affected raised to 50%. We also found that I/O-bound software was three times more affected by uncertainty than CPU-bound software.

A dynamic, per-system call threshold caused various levels of disruption to only 10% of the analyzed benign software. The effects of the uncertain environment in malware was more pronounced with 92% our studied malware samples failing to accomplish their tasks. Compared to the results obtained for a static threshold, 20% more benign software succeeded and 24% more malware crashed or were hampered in the uncertain environment.

We also analyzed the behavior of crashed benign software, and found that many of the crashes were caused by software bugs. Several bugs were reproduced for Vim, tar, Mozilla Firefox and Thunderbird.

Besides effectively enabling the combination of the best of traditional machine learning and emerging deep learning methods and providing a “safety net” for failures of standard intrusion detection systems, Chameleon improves system security by (i) making systems diverse by design, (ii) increasing attackers’ work factor, and (iii) decreasing the success probability and speed of attacks.

The idea of making systems less predictable is audacious, nonetheless, our results indicate that an uncertain system can be feasible for raising an effective barrier against sophisticated and stealthy malware. The degree of uncertainty is not a one-size-fits-all solution—we expect an administrator to dial in the level of uncertainty to the needs of the organization and applications.


We thank the reviewers for their insightful comments. This research is supported by NSF grant CNS-1464801, CNS-1228839, CNS-1161541, DGE-1303211, ACI-1229576, CNS-1624782, VMWare and Florida Cyber Security grants.


  • [1] T. Wrightson, Advanced Persistent Threat Hacking: The Art and Science of Hacking Any Organization, 1st ed.   McGraw-Hill Education, 2014.
  • [2] “Email Attacks: This Time It’s Personal http://itknowledgeexchange.techtarget.com/security-detail/cisco-report-email-attacks-this-time-its-personal/ .”
  • [3]

    S. Kumar and E. H. Spafford, “An application of pattern matching in intrusion detection,” 1994.

  • [4] G. Vigna and R. A. Kemmerer, “NetSTAT: A Network-Based Intrusion Detection Approach,” ser. ACSAC ’98, 1998.
  • [5] “Bromium end point protection https://www.bromium.com/.”
  • [6] “Modern malware exposed http://www.nle.com/literature/FireEye_modern_malware_exposed.pdf.”
  • [7] “The modern malware review http://media.paloaltonetworks.com/documents/The-Modern-Malware-Review-March-2013.pdf.”
  • [8] R. Sun, X. Yuan, A. Lee, M. Bishop, D. E. Porter, X. Li, A. Gregio, and D. Oliveira, “The dose makes the poison-leveraging uncertainty for effective malware detection,” in 2017 IEEE Conference on Dependable and Secure Computing, 2017, pp. 123–130.
  • [9] C.-C. Tsai, B. Jain, N. A. Abdul, and D. E. Porter, “A study of modern linux api usage and compatibility: What to support when you’re supporting,” in Eurosys, 2016.
  • [10] “Proportional relationsip http://intermath.coe.uga.edu/dictnary/descript.asp?termID=487.”
  • [11] N. S. Agency, “Security-enhanced linux.” [Online]. Available: http://www.nsa.gov/research/selinux/
  • [12] “Gnu project http://www.gnu.org/software/software.html.”
  • [13] “Spec cpu 2006 https://www.spec.org/cpu2006/.”
  • [14] “The phoronix test suite http://www.phoronix-test-suite.com/.”
  • [15] “Thc: the hacker’s choice https://www.thc.org/.”
  • [16] “Virusshare https://virusshare.com/.”
  • [17] “Dionaea - a malware analysis honeypot http://www.edgis-security.org/honeypot/dionaea/.”
  • [18] “Gcov https://gcc.gnu.org/onlinedocs/gcc/Gcov.html.”
  • [19] “Emma: a free java code coverage tool http://emma.sourceforge.net/.”
  • [20] “Coverage.py https://github.com/msabramo/coverage.py.”
  • [21] “The black vine cyberespionage group http://www.symantec.com/content/en/us/enterprise/media/security_response/whitepapers/the-black-vine-cyberespionage-group.pdf.”
  • [22] “Logkeys ubuntu http://packages.ubuntu.com/precise/admin/logkeys.”
  • [23] “viminfo:illegal starting char in line.” [Online]. Available: http://www.explorelinux.com/fix-e575-viminfo-illegal-starting-char-line/
  • [24] “tar:fails using ’-c’ option extracting archive with empty directories.” [Online]. Available: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=602209
  • [25] “tar:operation not permitted.” [Online]. Available: https://bugs.launchpad.net/ubuntu/+source/file-roller/+bug/1238266
  • [26] “Thunderbird:unable to locate mail spool file.” [Online]. Available: https://support.mozilla.org/en-US/questions/1157285
  • [27] “Thunderbird crashes with segmentation fault.” [Online]. Available: https://bugs.launchpad.net/ubuntu/+source/thunderbird/+bug/571308
  • [28] “Firefox:bus error (core dumped).”
  • [29] “Firefox:fatal io error 11 (resource temporarily unavailable) on x server :0.” [Online]. Available: https://bugzilla.mozilla.org/show_bug.cgi?id=895947
  • [30] “Launchpad.net.” [Online]. Available: https://bugs.launchpad.net/
  • [31] “Bugzilla.” [Online]. Available: https://www.bugzilla.org/
  • [32] “Viminfo documentation http://vimdoc.sourceforge.net/htmldoc/starting.html#viminfo-file.”
  • [33] “Thunderbird spool file https://support.mozilla.org/en-US/questions/1021171.”
  • [34] S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff., “A sense of self for Unix processes,” in Proceedings of the IEEE Symposium on Security and Privacy, 1996, pp. 120–128.
  • [35] M. Christodorescu, S. Jha, S. A. Seshia, D. Song, and R. E. Bryant, “Semantics-aware malware detection,” in Proceedings of the 2005 IEEE Symposium on Security and Privacy, ser. SP ’05, 2005.
  • [36] A. Lanzi, D. Balzarotti, C. Kruegel, M. Christodorescu, and E. Kirda, “Accessminer: Using system-centric models for malware protection,” in Proceedings of the 17th ACM Conference on Computer and Communications Security, ser. CCS ’10, 2010, pp. 399–412.
  • [37] H. Yin, D. Song, M. Egele, C. Kruegel, and E. Kirda, “Panorama: Capturing System-wide Information Flow for Malware Detection and Analysis,” ACM CCS 07, pp. 116–127, November 2007.
  • [38] L. Martignoni, E. Stinson, M. Fredrikson, S. Jha, and J. C. Mitchell, “A layered architecture for detecting malicious behaviors,” in Proceedings of the 11th International Symposium on Recent Advances in Intrusion Detection, ser. RAID ’08, 2008, pp. 78–97.
  • [39] A. Dinaburg, P. Royal, M. Sharif, and W. Lee, “Ether: Malware analysis via hardware virtualization extensions,” in Proceedings of the 15th ACM Conference on Computer and Communications Security, ser. CCS ’08.   New York, NY, USA: ACM, 2008, pp. 51–62. [Online]. Available: http://doi.acm.org/10.1145/1455770.1455779
  • [40] Y. Ye, T. Li, S. Zhu, W. Zhuang, E. Tas, U. Gupta, and M. Abdulhayoglu, “Combining file content and file relations for cloud based malware detection,” in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’11.   New York, NY, USA: ACM, 2011, pp. 222–230. [Online]. Available: http://doi.acm.org/10.1145/2020408.2020448
  • [41] S. Forrest, A. Somayaji, and D. Ackley, “Building diverse computer systems,” in Proceedings of the 6th Workshop on Hot Topics in Operating Systems (HotOS-VI), 1997.
  • [42] P. Larsen, A. Homescu, S. Brunthaler, and M. Franz, “Sok: Automated software diversity,” in IEEE Security and Privacy Symposium, 2014, pp. 276–291.
  • [43] E. Schulte, Z. P. Fry, E. Fast, W. Weimer, and S. Forrest, “Software mutational robustness,” Genetic Programmable and Evolvable Machines, vol. 15, no. 3, 2014.
  • [44] M. Chew and D. Song, “Mitigating buffer overflows by operating system randomization,” UC, Berkeley, Tech. Rep., 2002.
  • [45] L. Spitzner, Honeypots: Tracking Hackers.   Addison Wesley Reading.
  • [46] N. R. J. Michael, M. Auguston, D. Drusinsky, H. Rothstein, and T. Wingfield, “Phase II Report on Intelligent Software Decoys: Counterintelligence and Security Countermeasures ,” Technical Report, Naval Postgraduate School, Monterey, CA, 2004.
  • [47] M. H. Almeshekah and E. H. Spafford, “Planning and integrating deception into computer security defenses,” in New Security Paradigms Workshop (NSPW), 2014.
  • [48] V. Neagoe and M. Bishop, “Inconsistency in deception for defense,” in New Security Paradigms Workshop (NSPW), 2007, pp. 31–38.
  • [49] R. Sun, D. E. Porter, D. Oliveira, and M. Bishop, “The case for less preditable operating system behavior,” in Proceedings of the USENIX Workshop on Hot Topics in Operating Systems (HotOS), 2015.
  • [50] R. Sun, A. Lee, A. Chen, D. E. Porter, M. Bishop, and D. Oliveira, “Bear: A framework for understanding application sensitivity to os (mis)behavior,” in ISSRE, 2016.
  • [51] W.-L. Kao, R. K. Iyer, and D. Tang, “Fine: A fault injection and monitoring environment for tracing the unix system behavior under faults,” Software Engineering, IEEE Transactions on, vol. 19, no. 11, pp. 1105–1118, 1993.
  • [52] G. Kanawati, N. Kanawati, and J. Abraham, “Ferrari: A flexible software-based fault and error injection system,” Computers, IEEE Transactions on, vol. 44, no. 2, pp. 248–260, 1995.
  • [53] S. Han, K. G. Shin, and H. Rosenberg, “Doctor: An integrated software fault injection environment for distributed real-time systems,” in Computer Performance and Dependability Symposium, 1995. Proceedings., International.   IEEE, 1995, pp. 204–213.
  • [54] J. Carreira, H. Madeira, and J. G. Silva, “Xception: Software fault injection and monitoring in processor functional units,” Dependable Computing and Fault Tolerant Systems, vol. 10, pp. 245–266, 1998.
  • [55] R. Natella, D. Cotroneo, and H. S. Madeira, “Assessing dependability with software fault injection: A survey,” ACM Comput. Surv., vol. 48, no. 3, pp. 44:1–44:55, Feb. 2016.
  • [56] P. D. Marinescu and G. Candea, “Efficient testing of recovery code using fault injection,” ACM Trans. Comput. Syst., vol. 29, no. 4, pp. 11:1–11:38, Dec. 2011. [Online]. Available: http://doi.acm.org/10.1145/2063509.2063511
  • [57] A. Lanzaro, R. Natella, S. Winter, D. Cotroneo, and N. Suri, “An empirical study of injected versus actual interface errors,” in Proceedings of the 2014 International Symposium on Software Testing and Analysis, ser. ISSTA 2014.   ACM, pp. 397–408.
  • [58] B. Miller, D. Koski, C. P. Lee, V. Maganty, R. Murthy, A. Natarajan, and J. Steidl, “Fuzz revisited: A re-examination of the reliability of UNIX utilities and services,” Tech. Rep., 1995.
  • [59] B. P. Miller, L. Fredriksen, and B. So, “An empirical study of the reliability of UNIX utilities,” Communications of the Association for Computing Machinery, vol. 33, no. 12, pp. 32–44, 1990.
  • [60] C. Miller and Z. N. J. Peterson, “Analysis of Mutation and Generation-Based Fuzzing,” Independent Security Evaluators, Tech. Rep., Mar. 2007.
  • [61] U. Kargén and N. Shahmehri, “Turning programs against each other: high coverage fuzz-testing using binary-code mutation and dynamic slicing,” Target, vol. 101, no. 1101, p. 1011, 2015.
  • [62] Z. Zhang, Q.-Y. Wen, and W. Tang, “An efficient mutation-based fuzz testing approach for detecting flaws of network protocol,” in Computer Science & Service System (CSSS), 2012 International Conference on.   IEEE, 2012, pp. 814–817.
  • [63] “Lca: The trinity fuzz tester.” [Online]. Available: https://lwn.net/Articles/536173/
  • [64] “Bugs found by trinity.” [Online]. Available: http://codemonkey.org.uk/projects/trinity/bugs-fixed.php
  • [65] B. Garn and D. E. Simos, “Eris: A tool for combinatorial testing of the linux system call interface,” in Software Testing, Verification and Validation Workshops (ICSTW), 2014 IEEE Seventh International Conference on.   IEEE, 2014, pp. 58–67.
  • [66] A. Kurmus, “Kernel self-protection through quantified attack surface reduction.”
  • [67] V. M. Weaver and D. Jones, “perf fuzzer: Targeted fuzzing of the perf event open () system call.”
  • [68] P. Godefroid, M. Y. Levin, and D. A. Molnar, “Automated whitebox fuzz testing,” in NDSS, vol. 8, 2008, pp. 151–166.
  • [69] C. Cadar, D. Dunbar, and D. Engler, “Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs,” in Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’08.   Berkeley, CA, USA: USENIX Association, 2008, pp. 209–224.
  • [70] N. Tillmann and J. De Halleux, “Pex–white box test generation for. net,” in Tests and Proofs.   Springer, 2008, pp. 134–153.
  • [71] P. Koopman, J. Sung, C. Dingman, D. Siewiorek, and T. Marz, “Comparing operating systems using robustness benchmarks,” in Reliable Distributed Systems, 1997. Proceedings., The Sixteenth Symposium on.   IEEE, 1997, pp. 72–79.