Maintainable Log Datasets for Evaluation of Intrusion Detection Systems

Intrusion detection systems (IDS) monitor system logs and network traffic to recognize malicious activities in computer networks. Evaluating and comparing IDSs with respect to their detection accuracies is thereby essential for their selection in specific use-cases. Despite a great need, hardly any labeled intrusion detection datasets are publicly available. As a consequence, evaluations are often carried out on datasets from real infrastructures, where analysts cannot control system parameters or generate a reliable ground truth, or private datasets that prevent reproducibility of results. As a solution, we present a collection of maintainable log datasets collected in a testbed representing a small enterprise. Thereby, we employ extensive state machines to simulate normal user behavior and inject a multi-step attack. For scalable testbed deployment, we use concepts from model-driven engineering that enable automatic generation and labeling of an arbitrary number of datasets that comprise repetitions of attack executions with variations of parameters. In total, we provide 8 datasets containing 20 distinct types of log files, of which we label 8 files for 10 unique attack steps. We publish the labeled log datasets and code for testbed setup and simulation online as open-source to enable others to reproduce and extend our results.

READ FULL TEXT VIEW PDF
09/15/2020

Data-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods

Data-driven methods have been widely used in network intrusion detection...
11/15/2021

Reproducible and Adaptable Log Data Generation for Sound Cybersecurity Experiments

Artifacts such as log data and network traffic are fundamental for cyber...
11/12/2020

Traffic Generation using Containerization for Machine Learning

The design and evaluation of data-driven network intrusion detection met...
01/29/2020

Intrusion Detection using ASTDs

In this paper, we show the application of ASTDs to intrusion detection. ...
11/08/2017

Probability Risk Identification Based Intrusion Detection System for SCADA Systems

. As Supervisory Control and Data Acquisition (SCADA) systems control se...
11/27/2018

A Real-Time Remote IDS Testbed for Connected Vehicles

Connected vehicles are becoming commonplace. A constant connection betwe...
03/09/2022

The Cross-evaluation of Machine Learning-based Network Intrusion Detection Systems

Enhancing Network Intrusion Detection Systems (NIDS) with supervised Mac...

1 Introduction

Cyber attacks pose a threat to network and system security at any scale. To achieve their goals, which usually range from intrusion, espionage, sabotage, and system takeover, adversaries typically utilize a wide range of tools and attack techniques to discover previously unknown vulnerabilities and find new attack vectors. While system operators seek to keep their network components patched, the ever-changing threat landscape implies that ultimate security is impossible to guarantee as networks continue to grow dynamically over time.

To counteract these problems, manual security-related tasks of system operators have long been supported by automatic tools that continuously monitor networks and systems for both known and unknown threats. Thereby, these so-called intrusion detection systems (IDS) usually ingest network traffic or system log data and analyze their contents for malicious activities. Many IDSs also carry out file integrity checks or scan registry keys and system memories; however, in the context of this paper we solely focus on intrusion detection techniques that leverage log data, i.e., sequentially generated and chronologically ordered events that usually comprise a timestamp and a message containing parameters. IDSs that analyze such log data are most often differentiated into signature-based detection systems, that search for predefined indicators such as hash sums that are known to correspond to malware, and anomaly-based detection systems, that employ self-learning techniques to capture the baseline system behavior and detect any deviations of this learned model as potential threat [17, 3].

Independent of their type, evaluating IDSs for their ability to detect attacks is crucial to compare different approaches and objectively select appropriate detection techniques for specific system environments. Thereby, publicly available benchmark log datasets are an indispensable prerequisite to enable evaluations. Unfortunately, such log datasets are scarce and usually do not fulfill the requirements set by security researchers. In particular, one of the most crucial aspects of evaluations is to compute detection accuracies, which requires a ground truth that specifies all malicious log events. However, datasets collected from real infrastructures generally lack a reliable ground truth as it is not possible to ensure that only normal and benign activities are carried out on the network, except for purposefully injected attacks [32]. Moreover, adjusting configurations of components in productive environments or launching attack cases is often only possible in a limited scope since the security and availability of these systems are of utmost importance to the organizations hosting the infrastructures [38]. In addition, datasets collected in real environments most often cannot be published due to privacy concerns as log data frequently contains user data or parts of sensitive file contents.

To alleviate problems with real infrastructures altogether, security analysts recreate networks and systems in testbeds and use simulations to generate a base load of normal system operation. However, even datasets created in such controlled environments have been criticized for several reasons, for example, missing documentation that explains installed services [25, 1], limited generalizability [1], outdated or too simple attack cases [31, 37], heavy preprocessing such as removal of event parameters [2], involvement of closed-source software [37, 38], lack of periodic behavior [25], missing reproducibility [38], insufficient duration [4], focus on single hosts rather than the whole network [25], or lack of variations of attack parameters [32].

Figure 1: Procedure for generating labeled log datasets.

In addition, testbeds generally require a high effort to setup, configure, update, and adjust components. In our earlier work [21], we therefore propose to introduce concepts from model-driven engineering in testbed deployment processes. Figure 1 visualizes our procedure for generating labeled log datasets from model-driven testbeds. Contrary to common testbed generation approaches that result in single static test environments, our approach implies to generate models for infrastructure setup, normal behavior simulation as well as attack execution that act as templates by leaving out several parameters as variables, and to define transformation rules that dynamically fill out these parameters when launching a testbed. The main advantage of this methodology is that it is simple to generate an arbitrary number of datasets that stem from different testbeds with variations, i.e., normal and malicious traces are slightly different across datasets and thus enable more robust evaluations. We recognize some shortcomings of our implementation, including a fairly simple network structure and an unreliable labeling strategy. To overcome these problems, we largely extend the scope of our simulation and integrate an automatic labeling mechanism [18].

Alongside with this paper, we publish a collection of log datasets generated with the presented approach as well as all code that is necessary to run our testbed and simulations within it so that other researchers are able to replay or augment the simulation runs. Our datasets are therefore maintainable and allow for continuous improvements such as enlargements of the labeling range as well as additions of datasets from new testbeds. We summarize our contributions as follows:

The remainder of this paper is structured as follows. Section 2 reviews existing log datasets. In Sect. 3 we outline our methodology for generating log datasets and explain our modeled scenario. We analyze the generated datasets in Sect. 4 and discuss the results in Sect. 5. Finally, Sect. 6 concludes the paper.

2 Background and Related Work

Due to the large need for datasets in cyber security research, several attempts to generate benchmark datasets were made in the past. However, most of these datasets are created with specific use-cases in mind and are thus not generally applicable. To compare these datasets on a common scheme, we first describe a set of requirements that are relevant for intrusion detection datasets and then discuss the fulfillment of these aspects for several state-of-the-art datasets.

2.1 Requirements

Recording log datasets in testbeds or real environments is not straightforward; it is a task that requires careful planning, since the quality and usefulness of the resulting data strongly relies on several decisions made by the analyst. We gathered a list of requirements by reviewing design principles that were followed by authors of existing datasets. In the following, we summarize our findings.

  1. Use-case. To ensure relevance and authenticity of the dataset, it is necessary to design the overall network layout and technical infrastructure of the system where log data is recorded in the context of a specific scenario. This also includes services available on the involved machines [21]. Clearly specifying the scope of the simulation also helps to define the limitations of the dataset.

  2. Synthetic data generation. Datasets collected from real-world system environments are sometimes considered superior to synthetically generated data due to the fact that they are per definition realistic, while simulations only try to replicate their characteristics [10]. However, real datasets have the strong disadvantage that it is infeasible to differentiate normal from anomalous or malicious logs with complete certainty, since the root causes of some actions are unknown to the analysts [32]. Obviously, synthetic dataset generation implies that scripts that replicate normal behavior on an appropriate level of detail are prepared beforehand. This particularly concerns models for user activities that normally occur on the system, which can be very diverse and thus non-trivial to formalize. On the plus side, modeling the normal behavior effectively enables to steer the parameters of the simulation to generate data that is representative for different levels of detection complexity [32]. Therefore, we argue that synthetically generated log datasets are the best option for IDS evaluations.

  3. Attacks. As part of a realistic evaluation of IDSs, it is necessary to select recent and relevant attack scenarios that are suitable for the system environment at hand [31, 21]. Otherwise, outdated attack cases may not yield representative intrusion detection evaluation results that are comparable to that of more modern attacks.

  4. System logs. When IDSs are applied in productive systems, they are usually able to analyze logs in raw and unaltered form. Accordingly, log datasets for evaluation of IDSs should also provide logs that are not processed in any way [25]. Fortunately, synthetic datasets recorded in simulations are usually less critical when it comes to privacy, since no humans are involved and thus anonymization of personal user data that possibly occurs in the logs is not required. This also concerns sensible contents of files that may appear in the logs and should thus be simulated with collections of predefined dummy files [38]. Another important aspect is to configure the logging framework in a realistic way that fits the use case. For this, analysts must decide where to log and what to log [41]. In particular, anomaly-based IDSs require logs corresponding to normal system behavior to learn a baseline for detection, meaning that logging levels should be set to info or even debug rather than error or warning. Moreover, it is beneficial to log performance metrics such as CPU or memory data, because they are also adequate inputs for IDSs [16].

  5. Network traffic. Beside system logs that are the main input of host-based IDSs, network traffic is a widely used data source for network-based IDSs. Accordingly, datasets should also include packet captures to enable evaluation of network-based IDSs and hybrid IDSs that make use of both system logs and network traffic [24].

  6. Periodicity. Productive system environments naturally exhibit periodic behavior, for example, cron jobs are scheduled for execution in fixed intervals and events originating from human activities follow daily and weekly patterns of work shifts. Self-learning IDSs are able to integrate these cycles in their models to detect contextual anomalies, i.e., events that are considered anomalous due to their time of occurrence [3]. It is therefore essential to expand the duration of the simulation to cover several of these cycles [25].

  7. Labels

    . Ground truth tables that unambiguously assign labels to all events are needed to compute evaluation metrics such as detection accuracy or false alarm rates

    [31]. Accordingly, it is essential to provide a comprehensible methodology for creating correct ground truth tables for IDS evaluation.

  8. Documentation. Datasets should be published with detailed descriptions of all relevant aspects of the data creation. Otherwise, it is not possible for others to fully understand all artifacts present in the data, which could possibly lead to incorrect assumptions and invalidate evaluation results [25].

  9. Repetitions

    . For anomaly-based IDSs that only learn from normal behavior and then classify test data either as normal or anomalous, it is sufficient to only have artifacts of a single attack execution in the data. However, for attack classification it is necessary that attacks are at least present in training and test datasets, and possibly validation datasets. Accordingly, attacks should be launched multiple times by repeating the simulation. In addition, research on alert aggregation urgently requires useful datasets, especially for system logs analyzed by host-based IDSs

    [27]. Thereby, clustering-based aggregation methods require that the same attacks are carried out multiple times to form groups [22].

  10. Variations. Approaches for both attack classification and alert aggregation should be challenged by introducing variations in attack executions [32]. Moreover, evaluation results have a higher robustness when they are based on multiple attack executions that cover a spectrum of possible attack variations [21]. This behavior could be realized by dynamically changing attack parameters in each simulation run.

  11. Reproducibility. Technologies that constitute the simulation are continuously updated. To avoid that datasets become outdated, it should be possible to repeat simulations at any given time [37, 38]. This also allows to reuse existing assets and only change certain parts of the simulation, e.g., keep the infrastructure and user simulation, but include another attack vector. It is therefore beneficial to publish all code used to carry out the simulation alongside the resulting datasets.

2.2 Literature Analysis

The previous section outlines a set of requirements that should be fulfilled by datasets to enable evaluation of intrusion detection systems. We gathered several datasets that are commonly used in scientific evaluations and analyzed whether they fulfill our requirements. Table 1 shows a complete list of all datasets and our findings, where ✓indicates that the datasets fulfill the respective requirement, indicates partial fulfillment, and no symbol means that the requirement is not fulfilled. In the following, we discuss our findings and relevant properties of the datasets in detail.

Requirement
Dataset (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11)
ADFA-LD [4] Linux OS
ADFA-WD [5] Windows OS
IoT-DDoS [1] Internet of Things
AWSCTD [2] Windows OS
CIDD [15] Cloud Systems
CIDDS [31] Enterprise IT
LID-DS [8] Linux OS
VAST Challenge 2011 [9] Enterprise IT
KDD Cup 1999 [36] Military IT
Loghub [12] Supercomputer and OS
NGIDS DS [10] Enterprise IT
CICIDS 2017 [33] Enterprise IT
Skopik et al. [34] Enterprise IT
SOCBED dataset [38] Enterprise IT
UGR’16 [25] Enterprise IT
HDFS [40] Supercomputer
AIT-LDSv1.1 [21] Enterprise IT
AIT-LDSv2.0 (this paper) Enterprise IT
Table 1: Fulfillment of requirements for existing datasets

One of the earliest log datasets that became widely used in intrusion detection is the KDD Cup 1999 dataset [36]

. The logs were collected during a simulation of several intrusions in a military network. Other than many modern datasets, the authors made sure to label all events with the respective attack types and furthermore repeat and vary the attacks to yield different probability distributions in the training, validation, and test datasets. These properties make this dataset especially attractive for evaluating machine learning techniques. Even today it is still widely used in scientific publications, although the dataset has been repeatedly criticized for being outdated, too simple, and not reproducible due to the fact that closed-source tools were used for traffic generation

[37].

As a consequence of these criticisms, Creech et al. generated ADFA-LD [4] and ADFA-WD [5], two datasets containing sequences of system calls on a Linux and Windows host respectively. For the generation of the dataset, the authors simulated normal activities such as web browsing and file editing and launched several attacks, such as brute-force logins and exploits for webshell uploads. Unfortunately, the system calls are stripped from all contextual variables such as timestamps, parameters, and return values, and are thus not representative for real data [2]. Moreover, the dataset is criticized for only including a single host, not generalizing well for other systems, as well as a lack of documentation detailing how the dataset is collected and what services are installed [1, 25]. The AWSCTD [2] aims to resolve at least one of these issues by recording Windows system calls without removing any parameters and further extend the set of launched attacks. However, the authors also consider only a single host and not a full network.

Another dataset based on Linux system calls is LID-DS [8]. While the authors explain the attack scenarios in great detail, there is only little information on the simulation of normal system behavior. They carry out all attacks multiple times and collect the logs from hundreds of runs that last around 30 seconds each. CIDD [15] provides logs specifically for masquerade attacks. One of the noteworthy aspects of this data is that the authors manage to label all events by correlating network and system logs and mapping them to attack tables specifying the expected times, IP addresses, and user names related to attacks. Moreover, the users generating normal activity in the dataset are categorized into normal, advanced, administrators, programmers, and secretary users.

One of the few datasets that also include system logs other than system calls is from the VAST Challenge 2011 [9]. In particular, the dataset comprises firewall logs, IDS alerts, syslogs and network packet captures. Among the attacks launched against the simulated system are security scans, denial-of-service attacks, and remote desktop connections as consequences of a social-engineering attack. The authors also provide a document describing the solutions to the challenge that depict a ground truth of malicious events. The dataset presented alongside the open-source testbed called SOCBED [38] contains system logs from a network of Windows and Linux hosts. While the authors did not collect network traffic for this dataset, they state that it is simple to extend their testbed accordingly and repeat the experiments. In addition, the authors discuss variations in log data, however, only with respect to circumstantial factors such as system performance and not purposefully incorporated variations as accomplished by our model-driven approach. Skopik et al. [34] also collect network traffic as well as access and application logs on a testbed where simulated users click around on a mail platform. Contrary to most existing papers that present new datasets, they configure their user simulations based on behavior of real users and also validate their data by comparison of accessed resources. Other datasets comprising system and application logs from various services are provided in Loghub [12]. The main problem with these datasets is that they mainly involve anomalous traces related to failures rather than cyber attacks.

While system log datasets are most often collected from single hosts, whole networks comprising several hosts are usually deployed to generate network traffic datasets. For generating CIDDS [31], the authors recreated a virtual company with network components that are commonly used in enterprise IT, e.g., Windows and Linux hosts as well as file shares and web servers, and place them in separate subnets for managements, office, and developers. Their user simulations are based on state machines to generate complex behavior patterns instead of repeated sequences and their models also respect working hours and breaks. Moreover, their network is also connected to the Internet to mix the simulated traffic with real connections and possibly attacks. To generate the UGR’16 [25], the authors also use a combination of real user behavior traffic and simulated attack traffic. Thereby, the authors specifically pay attention to the cyclic behavior of communication logs that originates from daily or weekly usage patterns. Moreover, their attacks are generated with random starting times.

The authors of CICIDS 2017 [33] follow a different approach as make use of a profiler that analyzes real communication in a network and then arbitrarily generates data following these patterns. They recorded the network traffic while launching several attacks, among which are denial-of-service attacks, vulnerability exploits, and a botnet. Similarly, a network traffic generation appliance was also used to generate NGIDS DS [10]. Other than these datasets, the IoT-DDoS [1] specifically focuses on a scenario that simulates Internet of Things in a network.

In our earlier work we present the AIT-LDSv1.1 [21], a system log dataset collected from a webserver hosting a content management system and groupware. Other than most existing approaches for dataset generation, the paper [21] describes a model-driven strategy for automatic testbed deployment to generate multiple datasets with variations of attack executions. We recognize several shortcomings of the dataset: First, beside some machines running user simulations, the network is relatively simple as it only consists of a single webserver. Second, the simulation focuses on system log data and thus no network traffic is captured. Third, labeling of malicious events is not reliable since it relies on similarity-based matching, which may lead to incorrectly unlabeled lines in case that variations lead to new or dissimilar events [18]. Finally, only the resulting data is publicly available, but the scripts for deploying the testbed and running the simulation are not accessible. As a consequence of these shortcomings, we propose the AIT-LDSv2.0 in this paper. In comparison to our previous testbed for log data generation, we increased the network complexity, collected logs from all components of the network (e.g., the firewall), extended the simulation of normal behavior, improved the strategy for event labeling, and published all code for deploying the testbed along with the generated dataset. As visible in Table 1, our new dataset meets all requirements stated in Sect. 2.1. We discuss the fulfillment of these requirements in detail in Sect. 5.1.

3 Log Dataset Generation Methodology

This section outlines the overall methodology for the generation of our dataset. We first describe a procedure for automatic testbed deployment that leverages concepts from model-driven engineering to enable the generation of multiple datasets with variations. Subsequently, we explain the application scenario modeled by our testbed and state relevant design criteria for the monitored network, simulations of normal behavior, and injected attacks.

3.1 Testbed Generation

In our earlier work [21] we presented a model-driven methodology for testbed generation. We also published a follow-up paper that proposes a strategy for log event labeling [18]. In this paper, we combine both methods to generate multiple labeled datasets with variations. In the following, we briefly summarize the main aspects of model-driven testbeds and the integration of our labeling procedure.

As described in Sect. 2, synthetic log datasets are commonly collected on testbeds, i.e., one or more virtual machines deployed in isolated networks. Thereby, setting up such testbeds involves time- and resource-consuming tasks that often require a high amount of domain knowledge. The resulting testbeds are often relatively static, i.e., difficult to modify in hindsight when updates of certain components are required or changes of the scenario become necessary [21]. Model-driven testbed generation, however, alleviates these problems by allowing analysts to design testbeds on a higher level of abstraction and use these models to automatically instantiate arbitrary numbers of testbeds. On top of that, it is possible to specify parameter spaces rather than specific values for all kinds of testbed properties, including network size, frequencies of user interactions, or attack attributes. Datasets generated from such approaches include variations of the system environment as well as user and attacker behavior that manifest in the logs, which is beneficial for IDS evaluation since it increases robustness of the results and makes the datasets applicable for evaluation of alert aggregation.

Figure 2

depicts an overview of the layers involved in such a model-driven approach for dataset generation. Layer (L4) represents the highest level of abstraction and comprises three different types of models: (i) state machines and scripts that simulate normal user behavior, (ii) provisioning scripts for deploying and configuring the technical infrastructure, and (iii) scripts that launch attacks and rules that assign labels to the generated events. All of these models are designed as templates, i.e., they leave out several parameters that are dynamically filled out when instantiating a specific testbed based on predefined ranges and lists. For example, IP addresses of all components are randomly chosen from pools, user names are selected from databases, and transition probabilities are calculated from predefined distributions. Accordingly, we refer to the templated scripts on layer (L4) as testbed-independent models (TIM).

Figure 2: Concept for model-driven testbed generation and dataset labeling.

Layer (L3) contains so-called testbed-specific models (TSM) that are instantiated from the TIMs. This is accomplished by running a transformation engine that processes all templates provided by the TIMs and fills out all parameters according to their predefined spaces. Note that this process is fully automatic and can be repeated as often as needed to generate any desired number of testbeds. The resulting TSMs are runnable scripts that are ready to be executed in order to deploy the virtual machines, configure all services, start the user simulations, and launch the attacks at a given point in time. The simulation then runs in real-time to ensure that all generated log artifacts, e.g., timestamps and latency times, resemble execution in real-world scenarios.

Once the simulation is completed, i.e., the analyst determines that logs from a sufficiently long time period are collected or a predefined end time for the simulation is reached, layer (L2) handles the collection of log data from all machines. This mainly involves logs that are typically analyzed by IDSs, e.g., access logs, authentication logs, monitoring logs, and audit logs, but also custom logs generated by our state machines that simulate normal user and attacker behavior. In addition, the collection script gathers so-called facts from all machines, including their IP addresses, OS information, network configurations, etc. These data are necessary for the automatic generation of a ground truth, which is carried out on layer (L4). Labeling consists of a sequence of steps [18]. First, a pre-processor prepares all logs for the following tasks. This includes unzipping archived log files or transforming logs from binary format into text. Second, a parser runs over all log lines, transforms them into tokens, and loads them into a database so that it is possible to query single or multiple logs based on their event parameters. Third, a post-processor trims all stored logs according to the predefined start and stop times of the simulation. Moreover, all labeling rules that are defined as templates within the attacker TIMs are filled out using the facts collected in layer (L2). For example, a rule that labels all DNS log events involving the domain address of the attacker may be automatically augmented with this information by extracting the address as a fact from the attacker’s host machine. Similarly, start and stop times that are retrieved as facts from the attack execution logs may be used to limit the search scope of the queries. In the final step of the labeling procedure, the completed labeling rules are used to query logs from the database and assign labels to the results. The main advantage of leveraging facts in rules is that no manual adjustments need to be made when executing the same rules on other testbeds, since all relevant information is automatically extracted as facts from the respective hosts. Finally, after all labeling rules are processed, the resulting dataset consisting of the raw logs and their assigned labels is ready to be shared or used for IDS evaluation.

3.2 Scenario

The previous section outlined a general overview of the methodology for the generation of our dataset. In this section, we describe our targeted use-case and explain specific design decisions regarding variations in the dataset.

3.2.1 Use-case

The purpose of our collection of log datasets is to enable evaluation of IDSs in the context of a widespread application scenario that is frequently subject of cyber attacks. Specifically small- or medium-sized organizations are a frequent target of cyber attacks, often due to the fact that they do not have the required resources for extensive protection [13]. We therefore design our testbed to resemble a small enterprise network that follows well-known security guidelines, such as segmentation of networks into zones [14].

Figure 3 displays an overview of the network realized by our testbed. The network comprises three zones: (i) the intranet that contains a number of Linux hosts555Ubuntu 20.04, https://ubuntu.com/ for each employee as well as an intranet server running WordPress666Wordpress 5.8.2, https://wordpress.com/ and Samba file share777Samba 4.5.9, https://samba.org/, (ii) the demilitarized zone (DMZ) that contains servers for VPN888OpenVPN 2.4.4, https://openvpn.net/, proxy, mail999Horde Groupware 5.2.17, https://horde.org/apps/webmail, and cloud share101010OwnCloud 10.5.0, https://owncloud.com/, and (iii) the Internet with global DNS111111MaraDNS 2.0.13, https://maradns.samiam.org/, and Dnsmasq 2.79, https://thekelleys.org.uk/dnsmasq/doc.html, hosts for remote employees that connect to the intranet via VPN, external employees that use external mail servers, and an attacker host. The zones are connected via a firewall121212Shorewall 5.1.12.2, https://shorewall.org/ that also acts as an internal DNS server for all domains owned by the organization. All employed technologies are publicly available and commonly used in real networks [25].

Figure 3: Overview of the testbed network. Steps (1)-(3) mark the attacker’s path to compromise the intranet server and steps (a)-(c) represent connections related to the data exfiltration attack vector.

As outlined in Sect. 3.1, TIMs result in different TSMs due to the fact that several parameters are set dynamically during instantiation of the testbeds. With respect to the system environment, this mainly concerns the network size and allocation of IP addresses. In particular, we generate between 3 and 9 hosts for internal, remote, and external employees respectively, meaning that the final testbed may consist of at least 9 and at most 27 user simulations running in parallel. Similarly, we generate between 2 and 4 external mail servers. We also assign each network zone a random class and randomly choose IP addresses from these zones for each host. Finally, we also configure the domain names of all network zones as random names using the Faker library131313Faker, https://github.com/joke2k/faker. Table 2 provides a summary of all variations of the technical infrastructure.

Parameter Range
Number of user hosts 9-27
Number of mail servers 2-4
Network zone classes [ a, b, c ]
Host IPs Random IP within respective zones
Network and zone names Random names
Table 2: Variations of the system environment

3.2.2 User simulation

Real networks in small- or medium-sized organizations are actively used by humans that carry out their daily routines in their workplace. The simulation of normal behavior is therefore an essential aspect of synthetic dataset generation for IDS evaluation. Simulated normal system behavior that is not sufficiently complex may result in non-representative datasets that yield too low false positive rates during IDS evaluation, as human interactions with machines are often erratic and possibly lead to unexpected system states that may be incorrectly detected as malicious. We therefore decided to create state machines for all services in our testbed that are normally accessed by real users. For this purpose, we make use of web automation software141414Selenium, https://www.selenium.dev/ that allows to use scripts to navigate on websites and click on specific links.

Figure 4 visualizes the state machine for a user accessing the cloud share platform. Note that states describe the current view of the users and that activities such as clicking buttons are carried out when traversing from one state to another. As visible in the figure, the user first logs into the OwnCloud platform (possibly with incorrect credentials, in which case login is retried) and then enters pages showing either all their files, files marked as favorites, files shared with other users, or files other users shared with them. Depending on their selection, the users are then able to view files, upload and share new files, change or remove existing shares, accept or decline invitations to share files, and manage their favorites. Furthermore, there is the possibility that a user leaves the cloud sharing application and switches to another website, or enters the idle state in which case no action is carried out for a certain amount of time. We argue that the total number of possible transitions and interweaving of states visible in Fig. 4 is sufficiently complex to represent real user interaction. Section 4.2 will compare log data generated by simulated and real users to verify this claim.

Figure 4: User state machine for simulating normal behavior on the cloud share platform.

We do not provide figures for all state machines for brevity, but briefly discuss their main features. (i) The web mail state machine allows users to view, compose, and respond to mails from other users, attach files to mails, change their preferences, and manage their calendar entries, contacts, notes, and tasks. In addition, privileged users may access the administrator panel to view and change settings of the platform. (ii) The WordPress state machine allows users to read existing posts on the WordPress instance, publish new posts, comment on existing posts, and view available media. (iii) The Internet state machine allows users to browse the Internet by randomly clicking on links on one of the websites from a predefined list. (iv) The SSH state machine allows users to connect to a host in the network via SSH to execute some of a predefined list of commands. All state machines are connected with each other, i.e., users are able to change between state machines, to further increase the complexity of the simulation.

Parameter Range
User name Random name
Password Random string
Wordpress role [ editor, admin, none ]
SSH admin [ yes, no ]
Samba role [ employee, mgmt., acc., admin, none ]
OwnCloud role [ employee, mgmt., acc., admin, none ]
Working hours (5:00-9:00) - (17:00-22:00)
User mail provider Random selection from all mail servers
User mail contacts Random selection from all users
State transition probabilities 0.0-1.0
Web browser [ firefox, chromium ]
Idle times Tiny: 0.4-2.5 seconds Small: 3-60 seconds Medium: 40-360 seconds Large: 400-3600 seconds
Table 3: Variations of simulated user behavior

Whether a user accesses specific states within the state machines or not depends on their roles, which are subject to variation. In particular, we define an SSH administrator role and furthermore differentiate between editor and administrator roles on the WordPress page and employee, management, accounting, administrator roles on Samba and OwnCloud pages. When no role is assigned to a user, the respective state machine is not entered at all. The names of all users are randomly generated from databases and their passwords are random strings. We also vary their working hours, assign their preferred web browser, generate their mail addresses from one of the external mail servers, and select random samples for their usual contacts and available files. To ensure that all files involved in the simulation appear realistic and do not only involve completely randomized contents, we make use of a collection of predefined dummy files with non-sensitive contents. Table 3 provides an overview of the varied parameters and their parameter spaces. Note that we use idle times to temporarily pause the state machines not only in idle states that are specifically created for this task, but also when entering or leaving certain states. This accomplishes to simulate delays between single clicks (tiny), pauses for reading and reacting to website contents (small and medium), or longer breaks of inactivity (large). The table leaves out several minor parameters, such as limits for maximum daily accesses or factors that make repeated executions of same activities more unlikely, for which we refer to our open-source implementation.

3.2.3 Attack scenario

While simulation of normal user activity is necessary to ensure authenticity of the underlying conditions, injected attacks are required to provide the artifacts to be detected or classified by IDSs. Accordingly, it is essential to design relevant attack cases that fit the overall use-case and are suitable to generate desired consequences in the dataset. For our use-case, we decided to model a multi-step attack that involves several stages of a typical cyber kill chain [7] and makes use of common penetration testing tools [29]. Figure 3 shows the connections and affected hosts of this attack scenario. In particular, steps (1)-(3) show how the attacker first accesses the intranet over VPN to gather information and eventually takeover the intranet server, and steps (a)-(c) indicate how data is extracted from the file share in the intranet zone over a public DNS server to the attacker. In the following, we explain all attack steps in detail.

As part of our attack scenario, we assume that the attacker illegitimately obtained VPN credentials that allow them to access the network. In real-world attack cases, obtaining such credentials could be achieved through phishing attacks or by compromising a personal computer of an employee. Note that we do not simulate this part of the multi-step attack, since it occurs outside of the enterprise’s network and thus does not leave any traces in the logs.

Once the attack execution starts, the attacker makes use of the VPN credentials to remotely establish a connection to the network over the VPN server. The first step of the attack chain then consists of several scans of the network. In particular, the attacker employs the well-known tool Nmap151515https://nmap.org/ to carry out DNS and port scans in the DMZ network where the VPN server is located. This allows the attacker to discover the CIDR of the intranet network and thus extend their scans to the hosts located in the intranet zone. Eventually, a web service scan shows a WordPress instance running on the intranet server, which leads to the attacker selecting this server as a possible target for intrusion. The attacker thus launches a brute force directory scan using the tool dirb161616https://tools.kali.org/web-applications/dirb in order to find potentially interesting files. Since this scan shows up no results that allow the attacker to progress any further, they carry out a WordPress security scan using the tool WPScan171717https://wpscan.com/wordpress-security-scanner in order to discover vulnerable versions or misconfigurations of plugins or themes installed on the server. Other than the directory scan, this security scan shows that a vulnerable version of the plugin wpDiscuz is present on the server. At this point, the attacker stops scanning and instead focuses on exploiting the vulnerability, which marks the end of the reconnaissance phase.

Kill chain phases Attack steps Tools MITRE ATT&CK Tactics and Techniques Data sources
Reconnaissance Traceroute Network scan DNS scan Service scan Nmap Reconnaissance - Active Scanning - Gather Victim Network Information DNS logs Network traffic
Reconnaissance WordPress scan Directory scan WPScan Dirb Reconnaissance - Active Scanning - Gather Victim Host Information Access logs Error logs Network traffic
Initial Intrusion Establish a Backdoor Webshell upload Webshell command execution Shell Execution - Exploitation for Client Execution Persistence - Server Software Component Discovery Access logs
Obtain User Credentials Wordpress database dump Shell Credential Access - OS Credential Dumping Access logs
Obtain User Credentials Install Various Utilities Password cracking John the Ripper Credential Access - Brute Force: Password Cracking Monitoring logs
Privilege Escalation Login as system user Shell Privilege Escalation - Valid Accounts Auth logs Audit logs
Lateral Movement Reverse shell setup Root command execution Shell Execution - Command and Scripting Interpreter Auth logs Audit logs
Data Exfiltration Exfiltration over DNS DNSteal Exfiltration - Exfiltration Over Alternative Protocol DNS logs Audit logs
Table 4: Overview of the attack scenario
Attack Parameter Range
General Start times 00:00 - 24:00
Attacker name Random name
Network scans Ports 100-2000 top ports
Hosts Random selection of servers
Wordpress scan Scan mode [ passive, mixed ]
Enumeration Random selection of plugins, themes, configs., database exports, users, and media
Directory scan Recursive [ yes, no ]
Case-sensitive [ yes, no ]
Webshell Shell name Random string
Commands Random commands
Password hash Mode [ online, offline ]
cracking Duration 30-90 minutes
Reverse shell Port 1100-65000
Commands Random commands
Exfiltration DNS domain Random string
Forced IP [ yes, no ]
Compression [ yes, no ]
Verbosity [ yes, no ]
Block size 32-63
Sub domains integer of (200 / block size)
Table 5: Variations of the attack scenario

By exploiting the vulnerable plugin, the attacker is able to perform unrestricted file uploads (CVE-2020-24186). This allows the attacker to upload a PHP webshell as a backdoor that in turn allows them to execute arbitrary commands with the privileges of the www-data user of the web server. The attacker proceeds to execute several commands to gather information about the host, e.g., reading out processes, command histories, OS information, connections, or file names. Eventually the attacker finds the password to the user database in the WordPress configuration file and is thus able to access all user names and their hashed passwords.

The attacker then attempts to crack one of the hashed passwords using a list of common passwords. For this, our attacker state machine branches into two paths. In one path, we assume that the attacker transfers the password hashes to their own system and manages to crack one of the passwords there. Since this activity takes place outside of the monitored network, no logs are created and thus detection is not possible. Accordingly, we simulate this case by simply pausing the state machine for a specific amount of time. The other path simulates that cracking takes place at the compromised server. For this, the attacker installs the tool John the Ripper181818https://www.openwall.com/john/ and uses a common password list for cracking. Due to the fact that the purpose of our datasets is to provide detectable traces of anomalous behavior, we opt for the latter case when running our simulations. Note that as part of our attack scenario, we assume that the password of at least one system user is always present in the password list and thus successfully cracked after a certain amount of time. Subsequently after obtaining the password, the attacker uploads a fully interactive reverse shell and misuses the compromised user account to escalate their privileges to root level. The attacker then executes several commands of which some require root privileges, such as reading out the shadow file.

As a final step of the attack kill chain, the attacker runs the DNSteal191919https://github.com/m57/dnsteal tool that exfiltrates sensitive data from the file share located in the intranet zone. Thereby, the tool starts a process that converts files from certain directories into base64 to conform to the requirements of DNS queries, splits them into chunks, and sends them as DNS requests through the firewall to a specific attacker-controlled domain in the global DNS. Eventually the data is transferred from the malicious domain to the attacker’s host, where it is decoded and stored. While we could have modeled the attack chain in a way so that the attacker would set up this exfiltration tool once they gained system privileges, we decided to separate this step from the remaining attack vectors and instead start the exfiltration tool already at the beginning of the simulation. The reason for this is that we decided to design the exfiltration attack as a challenge for anomaly-based IDSs that usually rely on an training phase that is free of attacks. By running the tool from the beginning of the simulation, we purposefully poison the training phase so that the malicious DNS communication is learned as part of the normal system behavior. However, the attack may still be detected by anomaly-based IDSs, since the exfiltration stops after a few days when all files are extracted. This is especially challenging, since it is usually more difficult for an IDS to recognize that a service suddenly stopped compared to the detection of a newly started service.

Table 4 summarizes the attack scenario. The first column maps each of the attack steps stated in the second column to phases of the cyber kill chain [7]. As stated before, the Data Exfiltration step does not chronologically follow the other attack steps. The third column lists related tactics and techniques from the well-known MITRE ATT&CK matrix version 10 [26] for each attack step. The matrix classifies and describes a wide range of common attack techniques and also provides information on detection. As visible in the table, our multi-step attack involves a diverse set of attack techniques that are part of several tactics. Finally, the last column states the most relevant log files that contain attack traces for each attack step. Since many different log files are affected, it is necessary to configure IDSs to monitor several hosts of the network in order to obtain a full picture of the multi-step intrusion.

Similar to the infrastructure and user behavior, we vary the attack parameters as part of the transformation from TIM to TSM. Table 5 provides an overview of the main variations used to generate the dataset. Note that while the time of day at which attack execution is initiated is varied, we manually set the day for each simulation run in advance. The reason for this is to avoid that the attack is launched too early and thus the dataset does not provide a sufficiently long training phase of at least 3 days. To select and implement variations of parameters of utilized attack tools, we looked up allowed values and ranges for each parameter in the respective documentations. Since tools such as WPScan and DNSteal have multiple parameters that support ranges of allowed values, many possible combinations of values exist and thus the attack traces resulting in the logs are highly different. To realize random command executions, we assembled a list of common commands and randomly sampled them. We also injected the user password to be cracked in specific positions of the password file used by John the Ripper so that the duration to complete cracking varies in each run.

4 Analysis of Log Datasets

The previous section outlined our methodology and scenario for generating testbeds using a model-driven approach. Following this methodology, we generated eight testbeds and collected log data from them. This section provides some insights into these datasets by analyzing and comparing the logs.

4.1 Testbed Infrastructures

In course of around four weeks we instantiated a total of eight testbeds that we used to collect log datasets. The durations of the simulations for each dataset are between 4-6 days, where the exfiltration attack that is already running in the beginning of the simulation usually stops after 1-3 days and the multi-step server takeover attack usually takes place one of the last two days.

Table 6 provides an overview of the technical infrastructure used to generate each of the datasets. Note that we refer to each dataset by the randomly selected name of the overall testbed network that contains all zones. As visible in the table, the randomly selected numbers of mail servers and user host machines present in the testbeds correspond to the parameter variations stated in Sect. 3.2.1. We point out that the size of the datasets mostly depend on the number of active users and the length of the simulation.

Dataset Network Mail servers Internal employees Remote employees External users Start End Duration
fox fox.org 4 5 4 7 2022-01-15 00:00 2022-01-20 00:00 5 days
harrison harrison.com 2 3 6 6 2022-02-04 00:00 2022-02-09 00:00 5 days
russellmitchell russellmitchell.com 2 4 3 3 2022-01-21 00:00 2022-01-25 00:00 4 days
santos santos.com 2 9 3 6 2022-01-14 00:00 2022-01-18 00:00 4 days
shaw shaw.info 3 5 5 3 2022-01-25 00:00 2022-01-31 00:00 6 days
wardbeck wardbeck.info 3 6 7 4 2022-01-19 00:00 2022-01-24 00:00 5 days
wheeler wheeler.biz 4 8 6 8 2022-01-26 00:00 2022-01-31 00:00 5 days
wilson wilson.com 2 7 8 9 2022-02-03 00:00 2022-02-09 00:00 6 days
Table 6: Technical infrastructure of testbeds

Table 7 shows which log files are collected from which hosts, where ✓indicates that the respective log file is collected from the host, $⃝\checkmark$ indicates that the respective log file is collected and also labels exist for that file, and no symbol indicates that the respective files are not collected or not present on the hosts. The table also shows that we collect network traffic as well as system logs from diverse sources, for example, access logs, low-level logs of the operating system (audit logs), application logs (Horde and VPN logs), monitoring logs, custom logs for state machine executions, etc. Note that files not marked as labeled do not necessarily lack a ground truth, since several files are not affected by any of the attacks and thus all occurring events correspond to normal behavior. We therefore only mark files as labeled in case that attack traces are known to occur in these files and labeling rules for the respective attack manifestations exist.

As visible in the table, we mainly focused on log files from the intranet server when developing our labeling rules. The reason for this is that the majority of attack steps are launched against that server and the diversity of these attack vectors cause that several different files are affected. In Sect. 4.4 we provide a more detailed overview of assigned labels.

Host

Statemachine logs

Network traffic

Apache access logs

Apache error logs

Auth logs

Journal logs

DNS logs

VPN logs

Syslog

Audit logs

Suricata event logs

Suricata fast logs

Suricata stats

Kernel logs

Exim logs

Horde access logs

Horde error logs

Mail (info) logs

Mail warning logs

Messages

User logs

Monitoring logs

attacker host
employee host
intranet server $⃝\checkmark$ $⃝\checkmark$ $⃝\checkmark$ $⃝\checkmark$ $⃝\checkmark$
file share $⃝\checkmark$
int. mail server
ext. mail server
firewall $⃝\checkmark$
DNS server
VPN server $⃝\checkmark$
web server
cloud share
Table 7: Log files collected from hosts

4.2 Normal Behavior

As pointed out in Sect. 2.1, it is essential for synthetic log data generation to simulate normal user behavior that corresponds to real humans interacting with the system in terms of click frequency as well as complexity and diversity of actions. However, we noticed in our literature review (cf. Sect. 2.2) that comparisons of presented datasets with real user behavior are rarely carried out. We therefore validate our log datasets by carrying out a comparison with real-world log data generated by humans performing tasks in a similar network environment. The real log data was collected during a cyber security exercise202020Austrian Press Agency, https://www.ots.at/presseaussendung/OTS_20210922_OTS0036 that took place in September of 2021. As part of the exercise, eight teams consisting of four people respectively were tasked to investigate traces of existing malware that infected their networks, monitor their systems for incoming cyber attacks, and respond to incidents by contacting authorities. As part of this one-day exercise, several attacks were scheduled for automatic execution at specific points in time, keeping the participants busy at all times. During the exercise, the teams worked isolated from each other and could not access the technical infrastructure of other teams.

To set up the system environment for each team, most of the provisioning scripts were reused as TIMs for setting up the testbed as outlined in Sect. 3.2.1. This allows us to compare the contents of the log files generated in the environments utilized by real humans and those of our dataset. We select the DNS logs as a base for comparison, since they contain queries on a level of abstraction that allows us to determine whether users accessed the cloud server, mail server, file share, etc. Figure 5 visualizes the events produced of the real users (left) and simulated users (right). Note that we only use logs from the first day of each dataset since there is also just one day of logs from real users available.

The plots show that there are some discrepancies between real and simulated users, however, these are mostly linked to some conscious design decisions. First, it is apparent that logs generated by simulated users are more spread out across the day with logs occurring between 5:00-22:00, while real users only produced logs between 7:00-17:00. This is clearly caused by the fact that the cyber security exercise had a clear start and end time and participants were not freely able to carry out their tasks at any time they desire. Accordingly, we argue that the user behavior in our datasets that simulates employees rather than participants of an exercise adequately represents the active times of employees with flexible working hours. Similarly, real logs show that users hardly ever accessed the file share, which is mostly due to the fact that none of their tasks were linked to sharing files with each other. Overall, the relative frequencies of accesses per service from real users largely resemble those of simulated users, with mail servers being the most actively accessed services. Considering the absolute event frequencies, the simulation appears to correctly depict access frequencies of real users in terms of average accesses per person and hour as well as fluctuations thereof across the day. In particular, we computed that real users generate 306.2 DNS events per day across all services on average with a standard deviation of 62.2 and simulated users generate 307.2 DNS events per day across all services on average with a standard deviation of 56.7.

(a) Access frequencies of real users.
(b) Access frequencies of simulated users.
Figure 5: Event counts in DNS logs for different services.

4.3 Attacks

Manifestations of attack executions in log data and labels thereof are crucial for log datasets. As discussed in Sect. 3.2.3, we designed our attack scenario to involve a wide variety of attack types that affect several different files. In the following, we exemplarily show how some of these attack steps manifest themselves in the generated datasets.

One of the most recognizable attack steps is the directory scan that is carried out as part of the reconnaissance phase. This attack makes several thousands of requests in a short amount of time to the targeted web server, of which all are recorded in the Apache access logs. Since this log file usually contains events that relate to users requesting resources by clicking around on web pages, the scan causes a drastic increase of the average load during normal system operation. Figure 6 shows the number of events in the Apache access logs per hour on the cloud, intranet, and mail servers of the santos dataset. As visible in the plot, the accesses on the intranet server during the directory scan (the relevant time interval is shaded red) increase from several hundred to more than 5000.

Figure 6: Apache access logs with attack consequences of scans.

Monitoring logs contain numeric values of system measurements that are an interesting input for anomaly detection

[16]. This includes measurements on the utilization of CPU, memory, disk, file system, network communication, processes, etc. For our datasets, we collect such monitoring logs from the file share and intranet server that are both located in the intranet zone and are thus reasonable targets for monitoring in real-world scenarios. Figure 7 shows several metrics derived from CPU and memory utilization that are collected from the santos dataset. As visible in the top plot, both system and total CPU are significantly increased as a consequence of the password cracking attack step (the relevant time interval is shaded red). The memory metrics do not show such a strong indication of an ongoing attack, even though a large file containing passwords is loaded into memory during cracking. Nonetheless, these and other metrics or combinations thereof could also contribute to the detection of certain attack steps.

Figure 7: Monitoring logs of CPU (top) and memory (bottom) showing attack consequences of password cracking.

Variations of the system environment, normal behavior simulation, and attack parameters, cause that aforementioned attack consequences differ across datasets. For example, peaks in event frequencies have different magnitudes relative to the baseline of event occurrences that is considered normal for that dataset and the time intervals where system metrics are affected change in length. In addition, event sequences that are generated as a consequence of commands executed by the attacker have different form or parameters. Consider the log events shown in Fig. 8 as an example. In the fox dataset (top), seven events are generated when the attacker logs into the compromised user account phopkins. The same attack step appears different in the harrison dataset, as both the affected user changes to jward, terminal /dev/pts/0 rather than /dev/pts/1 is used, and different commands are executed. We argue that these variations are useful to achieve higher robustness of results when evaluating IDSs, since detection accuracy should be similar across all datasets even though the events to be detected vary. In Sect. 5.2, we discuss the benefits of our datasets in more detail.

Jan 18 13:14:31 intranet-server su[28816]: Successful su for phopkins by www-data Jan 18 13:14:31 intranet-server su[28816]: + /dev/pts/1 www-data:phopkins Jan 18 13:14:31 intranet-server su[28816]: pam_unix(su:session): session opened for user phopkins by (uid=33) Jan 18 13:14:31 intranet-server systemd-logind[1011]: New session c1 of user phopkins. Jan 18 13:14:31 intranet-server systemd: pam_unix(systemd-user:session): session opened for user phopkins by (uid=0) Jan 18 13:14:41 intranet-server sudo: phopkins : TTY=pts/1 ; USER=root ; COMMAND=list Jan 18 13:14:43 intranet-server sudo: phopkins : TTY=pts/1 ; USER=root ; COMMAND= /bin/ls -laR /root/

Feb 8 08:36:38 intranet-server su[28321]: Successful su for jward by www-data Feb 8 08:36:38 intranet-server su[28321]: + /dev/pts/0 www-data:jward Feb 8 08:36:38 intranet-server su[28321]: pam_unix(su:session): session opened for user jward by (uid=33) Feb 8 08:36:38 intranet-server systemd-logind[935]: New session c1 of user jward. Feb 8 08:36:38 intranet-server systemd: pam_unix(systemd-user:session): session opened for user jward by (uid=0) Feb 8 08:36:54 intranet-server sudo: jward : TTY=pts/0 ; USER=root ; COMMAND=list Feb 8 08:36:57 intranet-server sudo: jward : TTY=pts/0 ; USER=root ; COMMAND= /bin/cat /etc/shadow

Figure 8: Different log events caused by the attacker escalating to system privileges in the fox (top) and harrison (bottom) datasets.

4.4 Labels

As explained in Sect. 3.1, our labeling procedure does not just make use of attack time windows to mark events as malicious based on their timestamps, but instead involves query rules that enable labeling based on event attributes. We created such rules for eight files as outlined in Table 7 and assign distinct labels to malicious events based on their attack step. Note that we specifically selected files and attack steps which involve distinct manifestations of attack consequences after manually checking all files, however, we also point out that there are traces of attack steps in other files that are not labeled in AIT-LDSv2.0. Due to the fact that our collection of log datasets is maintainable and the labeling procedure is repeatable, it is possible to add labeling rules for these files in future versions of the dataset.

We exemplarily show an overview of labeled events related to the multi-step attack of the santos dataset in Fig. 9. The figure visualizes the chronological occurrence of labeled events, where the distinct labels are depicted on the vertical axis and affected files are marked with different symbols. As visible in the plot, some attack steps cause singular events or short sequences (e.g., uploading the webshell), while others affect groups of events that span over a longer duration (e.g., password cracking). Note that we assign multiple labels to the same events for clarification. For example, we introduce a label foothold that subsumes all attack steps involved in the initial intrusion, including the VPN connection, scans, and webshell upload. This implies that our labels follow a hierarchical order, which makes it easy to select specific types of events for evaluation and furthermore allows to compute detection accuracies separately for different attack steps [18].

Figure 9: Occurrences of events labeled as part of the multi-step attack.

5 Discussion

In this section we discuss whether our generated datasets fulfill the requirements for IDS evaluation. In addition, we explain possible application scenarios for our datasets in detail and outline their limitations.

5.1 Fulfillment of Requirements

We stated requirements for log dataset generation that we used as a basis for our methodology in Sect. 2.1. Based on the generated datasets and the results of our analysis provided in Sect. 4 we check whether all requirements are fulfilled. Requirement (1) is fulfilled as our datasets address enterprise IT, which is a wide-spread and relevant use-case for intrusion detection. We followed common guidelines for network design and selected open-source components that are popular choices in such infrastructures [25]. Requirement (2) addresses simulation of normal system behavior. We argue that our state machines and randomized user role assignments that are used for simulating employees as outlined in Sect. 3.2.2 are sufficiently extensive to generate complex patterns. Moreover, we show in Sect. 4.2 that page visit frequencies of our simulated employees largely resemble those of real users. Similarly, our selected attack scenarios involve diverse steps and recent exploits to fulfill requirement (3). We collect both system log data as demanded by requirement (4) as well as network traffic as demanded by requirement (5). Our user simulations follow daily activity cycles as visible in our analysis results presented in Sect. 4.3. Since multiple days of such normal behavior is recorded, we consider requirement (6) that addresses periodic patterns as fulfilled. Requirement (7) is fulfilled as we generate a ground truth for events using our labeling framework as described in Sect. 3.1. Beside a description of the overall scenario available in this paper and the dataset repository, all scripts and configurations of our testbeds are published together with the log data and thus also requirement (8) on the availability of documentation is fulfilled. Requirements (9) and (10) are fulfilled, because we generate multiple datasets that contain repeated executions of the same attack steps with variations. Finally, requirement (11) is fulfilled as we publish all scripts for deploying and running the simulation as open-source code.

5.2 Application Scenarios for AIT-LDSv2.0

Due to the characteristics of our dataset, we foresee several different application scenarios. In the following, we discuss (federated) intrusion detection, alert aggregation, and user profiling as interesting research areas that benefit from our data.

5.2.1 Evaluation of Intrusion Detection Systems

Foremost, the purpose of our collection of datasets is to enable evaluation of host- and network-based IDSs. We injected attacks that employ diverse techniques so that their consequences manifesting in log files challenge a wide range of detection mechanisms [35]. For example, we anticipate the following non-exhaustive list of detection techniques to be applied on our dataset.

  • New log artifacts. As part of many attack steps, new log events such as the sample logs from Fig. 8 appear in some log files. Alternatively, normal event types may appear with different parameters or combinations of parameter values. Despite the fact that this detection technique is relatively simple, it is highly powerful, because its low runtime requirements can be applied to most events or categorical values.

  • Structure of parameter values. The DNSteal attack makes use of a randomly generated domain names for data exfiltration, which could be useful to evaluate detectors for domain-generation algorithms [39]. The same applies for Apache access logs, where commands sent to the webshell appear in URLs.

  • Sequence mining. Log events usually occur in specific sequences that represent inherent program flows of monitored services. Workflow mining extracts these patterns and allows to detect unusual sequences as anomalies [6, 11]. Consequences of exploits and other malicious attacker behavior often manifest in such sequences, for example, audit events generated when the attacker executes commands via the remote shell.

  • Event frequencies. As pointed out in Sect. 4.3, attacks such as scans are recognizable by high amounts of log occurrences in short time intervals. Anomaly detection techniques therefore create event count matrices and detect time windows with unusual high or low event frequencies with the aid of various machine learning methods, including time-series analysis [23]

    and principal component analysis

    [11].

  • Missing events. We deliberately designed our attack scenario to include a data exfiltration attack that is already ongoing at the beginning of the simulation and stops after some days. We expect that detectors based on machine learning add these malicious events to their model of normal behavior that is generated during the training phase, and thus poison their models. Accordingly, detectors need to raise anomalies for the stopping of event occurrences, which we consider a more challenging detection scenario than recognizing the start of the exfiltration process.

  • Statistical tests

    . System performance metrics and numeric features of network traffic are suitable for statistical analyses such as testing for certain distributions. Alternatively, hypothesis testing is also applicable for detecting changes of correlating behavior of categorical variables in log data

    [19].

We argue that our data has a large benefit over most existing datasets for IDS evaluation, as it contains data from multiple separate testbeds targeted by the same attack scenario. Due to the variations in the log traces caused by changes of the system environment, simulated normal behavior, and attack parameters, we expect that detection accuracies vary when applying the same detectors on different datasets. However, by averaging the detection metrics achieved on all datasets, the aggregated results have a higher robustness as they are more representative for a general case and not fine-tuned to only a single execution. In addition, simulating many similar infrastructures allows to evaluate approaches that leverage federated learning for intrusion detection [30].

Moreover, the ground truth tables of our datasets are not just binary labels that determine whether an event is part of an attack or not, but instead precisely state the type of attack. This means that it is also possible to evaluate attack classification accuracy in case that the detectors are capable of determining attack types, e.g., by matching them with a list of known and labeled meta-alerts.

5.2.2 Evaluation of Alert Aggregation Techniques

Intrusion detection techniques as stated in the previous section often raise large amounts of alerts for some attack steps, where the vast majority of these alerts are duplicates and only have little value to operators that monitor IDSs. Alert aggregation therefore attempts to merge these alerts to reduce the workload of operators and ease the identification of urgent alerts that require immediate actions. On top of that, advanced aggregation techniques are capable of recognizing patterns of alert occurrences and are able to connect attack steps to attack scenarios [22].

In order to merge alerts and attack steps, it is obviously necessary to have datasets at hand that contain repetitions of the same or similar attacks. Unfortunately, these datasets are rare even though they are urgently needed in research [27]. We therefore propose to forensically analyze our datasets with a desired selection of IDSs to obtain sequences of alerts that are used for aggregation. Similar to the evaluation of IDSs, the variations of our attack scenarios come in handy as they yield different alert patterns for each dataset, e.g., variable amounts of alerts for scans with varying duration or optional alerts caused by commands that the attacker only carries out with certain probabilities. This allows to evaluate whether alerts are indeed aggregated with the same attack types independent of slight variations that occur in real-world environments.

5.2.3 Evaluation of User Profiling Approaches

User profiling is a trending research topic that aims to create a profile for each user and then use these profiles to group users by their behavior or role. For this, algorithms based on pattern mining read out access logs that detail all page visits by each user [28]. Note that this application scenario is not related to cyber attacks, because only the simulation of normal user behavior is relevant. Due to the fact that our simulated users have specific roles (e.g., WordPress editor or administrator) and visit all pages based on transition probabilities, they clearly follow their own behavior profiles. The main advantage over real data is that it is easy to adjust these profiles according to the respective use case and to quantitatively compare their similarities, which is useful for evaluations and cannot be replicated with humans.

5.3 Limitations

Despite all aforementioned benefits of our log dataset, we recognize some limitations. Most important, the user simulation that generates a baseline of normal behavior for our collection of log datasets is obviously limited by the extent of our state machines. On the other hand, real datasets that contain traces of humans interacting with the monitored environments always have the possibility to involve artifacts caused by deliberate or accidental misuse of the systems that could yield incorrect alerts by IDSs. Despite our efforts to generate complex user behavior, we therefore cannot ensure that false positive rates achieved on our datasets are representative for real-world systems. Nonetheless, we are convinced that our synthetic datasets have significant advantages over real ones, as they can be freely published without the need to anonymize artifacts due to privacy concerns and may be arbitrarily recreated in modified use-cases if necessary.

We also point out that we aimed to generate the log data in the most realistic way possible, meaning that we did not configure the logging frameworks to collect data on the highest level of granularity, but instead used standard or default configurations wherever applicable. In case that logging levels need to be adapted, it is always possible to replay the attack scenarios on our open-source testbeds.

As part of varying the parameters of our testbed when generating TSMs from TIMs, we also decided to leave configurations of logging services unchanged in order to ensure that our labeling rules do not accidentally leave some events unlabeled. We leave the task of extending our labeling rules for this kind of variations for future work.

6 Conclusion

In this paper we present a collection of eight synthetic log datasets for evaluation of intrusion detection systems. We collect our datasets from testbeds generated by a model-driven methodology for testbed setup and labeling. This enables to repeat the data collection procedure arbitrary many times while at the same time varying several parameters of the simulation with low manual effort. In addition, it is simple to scale the network and extend it with additional components or services. Our datasets are openly accessible and maintainable as all code required to deploy testbeds, run simulations, and assign labels to log events is available open-source. Our datasets thus solve several problems that are prevalent in existing datasets, including control over the simulation parameters, presence of repeated attack executions in similar environments, generation of ground truth tables, complexity of the network, preprocessing of logs to protect sensitive information, and more.

Our log datasets address the common use-case of an attack on the infrastructure of a enterprise IT network. In particular, the attack scenario involves reconnaissance scans, brute-force password cracking, data exfiltration, as well as utilization of various tools and exploits to eventually obtain system access. To generate a realistic baseline of normal behavior, we simulate user activity by extensive state machines that are specifically designed to utilize services such as mail platforms and file shares. We primarily created our dataset to provide diverse attack vectors that challenge many different detection techniques, however, we also foresee applications that go beyond IDS evaluation, in particular, alert aggregation and user profiling. We see our dataset as the first in a series and foresee to extend the labeling rules to more files and attack steps in upcoming versions. For future work, we plan to extend user simulations to run on Windows hosts and mobile devices, and to create testbeds for new use-cases such as Internet-of-Things.

Acknowledgments

This work was partly funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).

References

  • [1] Y. Al-Hadhrami and F. K. Hussain (2020) Real time dataset generation framework for intrusion detection systems in IoT. Future Generation Computer Systems 108, pp. 414–423. Cited by: §1, §2.2, §2.2, Table 1.
  • [2] D. Čeponis and N. Goranin (2018) Towards a robust method of dataset generation of malicious activity for anomaly-based HIDS training and presentation of AWSCTD dataset. Baltic Journal of Modern Computing 6 (3), pp. 217–234. Cited by: §1, §2.2, Table 1.
  • [3] V. Chandola, A. Banerjee, and V. Kumar (2009) Anomaly detection: a survey. ACM computing surveys (CSUR) 41 (3), pp. 1–58. Cited by: §1, item 6.
  • [4] G. Creech and J. Hu (2013) Generation of a new ids test dataset: time to retire the KDD collection. In 2013 IEEE Wireless Communications and Networking Conference (WCNC), pp. 4487–4492. Cited by: §1, §2.2, Table 1.
  • [5] G. Creech (2014) Developing a high-accuracy cross platform host-based intrusion detection system capable of reliably detecting zero-day attacks.. Ph.D. Thesis, University of New South Wales, Canberra, Australia. Cited by: §2.2, Table 1.
  • [6] M. Du, F. Li, G. Zheng, and V. Srikumar (2017)

    Deeplog: anomaly detection and diagnosis from system logs through deep learning

    .
    In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1285–1298. Cited by: 3rd item.
  • [7] FireEye m-trends 2010: the advanced persistent threat. Note: https://www.fireeye.com/current-threats/annual-threat-report/mtrends/rpt-2010-mtrends.html2021-10-18 Cited by: §3.2.3, §3.2.3.
  • [8] M. Grimmer, M. M. Röhling, D. Kreusel, and S. Ganz (2019) A modern and sophisticated host based intrusion detection data set. IT-Sicherheit als Voraussetzung für eine erfolgreiche Digitalisierung, pp. 135–145. Cited by: §2.2, Table 1.
  • [9] Cited by: §2.2, Table 1.
  • [10] W. Haider, J. Hu, J. Slay, B. P. Turnbull, and Y. Xie (2017) Generating realistic intrusion detection system dataset based on fuzzy qualitative modeling. Journal of Network and Computer Applications 87, pp. 185–192. Cited by: item 2, §2.2, Table 1.
  • [11] S. He, J. Zhu, P. He, and M. R. Lyu (2016) Experience report: system log analysis for anomaly detection. In 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE), pp. 207–218. Cited by: 3rd item, 4th item.
  • [12] S. He, J. Zhu, P. He, and M. R. Lyu (2020) Loghub: a large collection of system log datasets towards automated log analytics. arXiv preprint arXiv:2008.06448. Cited by: §2.2, Table 1.
  • [13] (2019-02) Internet Security Threat Report. Technical report Symantec. Cited by: §3.2.1.
  • [14] (2010-12) Information technology - Security techniques - Network security - Part 3: Reference networking scenarios - Threats, design techniques and control issues. Standard International Organization for Standardization. Cited by: §3.2.1.
  • [15] H. A. Kholidy and F. Baiardi (2012) CIDD: a cloud intrusion detection dataset for cloud computing and masquerade attacks. In 2012 Ninth International Conference on Information Technology-New Generations, pp. 397–402. Cited by: §2.2, Table 1.
  • [16] M. T. Khorshed, A. S. Ali, and S. A. Wasimi (2011) Monitoring insiders activities in cloud computing using rule based learning. In IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications, pp. 757–764. Cited by: item 4, §4.3.
  • [17] A. Khraisat, I. Gondal, P. Vamplew, and J. Kamruzzaman (2019) Survey of intrusion detection systems: techniques, datasets and challenges. Cybersecurity 2 (1), pp. 1–22. Cited by: §1.
  • [18] M. Landauer, M. Frank, F. Skopik, W. Hotwagner, M. Wurzenberger, and A. Rauber (2022) A framework for automatic labeling of log datasets from model-driven testbeds for HIDS evaluation. In Proceedings of the 2022 ACM Workshop on Secure and Trustworthy Cyber-Physical Systems, Note: Accepted for publication Cited by: §1, §2.2, §3.1, §3.1, §4.4.
  • [19] M. Landauer, G. Höld, M. Wurzenberger, F. Skopik, and A. Rauber (2021) Iterative selection of categorical variables for log data anomaly detection. In European Symposium on Research in Computer Security, pp. 757–777. Cited by: 6th item.
  • [20] Cited by: footnote 1.
  • [21] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner, and A. Rauber (2020) Have it your way: generating customized log datasets with a model-driven simulation testbed. IEEE Transactions on Reliability 70 (1), pp. 402–415. Cited by: §1, item 1, item 10, item 3, §2.2, Table 1, §3.1, §3.1.
  • [22] M. Landauer, F. Skopik, M. Wurzenberger, and A. Rauber (2022) Dealing with security alert flooding: using machine learning for domain-independent alert aggregation. Transactions on Privacy and Security. Note: Accepted for publication Cited by: item 9, §5.2.2.
  • [23] M. Landauer, M. Wurzenberger, F. Skopik, G. Settanni, and P. Filzmoser (2018) Dynamic log file analysis: an unsupervised cluster evolution approach for anomaly detection. computers & security 79, pp. 94–116. Cited by: 4th item.
  • [24] H. Liao, C. R. Lin, Y. Lin, and K. Tung (2013) Intrusion detection system: a comprehensive review. Journal of Network and Computer Applications 36 (1), pp. 16–24. Cited by: item 5.
  • [25] G. Maciá-Fernández, J. Camacho, R. Magán-Carrión, P. García-Teodoro, and R. Therón (2018) UGR ‘16: a new dataset for the evaluation of cyclostationarity-based network IDSs. Computers & Security 73, pp. 411–424. Cited by: §1, item 4, item 6, item 8, §2.2, §2.2, Table 1, §3.2.1, §5.1.
  • [26] MITRE ATT&CK. Note: https://attack.mitre.org/2021-10-18 Cited by: §3.2.3.
  • [27] J. Navarro, A. Deruyver, and P. Parrend (2018) A systematic survey on multi-step attack detection. Computers & Security 76, pp. 214–249. Cited by: item 9, §5.2.2.
  • [28] S. Park, A. Matic, K. Garg, and N. Oliver (2018) When simpler data does not imply less information: a study of user profiling scenarios with constrained view of mobile HTTP(S) traffic. ACM Transactions on the Web (TWEB) 12 (2), pp. 1–23. Cited by: §5.2.3.
  • [29] Penetration testing complete tools list. Note: https://en.kali.tools/all/2021-10-18 Cited by: §3.2.3.
  • [30] D. Preuveneers, V. Rimmer, I. Tsingenopoulos, J. Spooren, W. Joosen, and E. Ilie-Zudor (2018) Chained anomaly detection models for federated learning: an intrusion detection case study. Applied Sciences 8 (12), pp. 2663. Cited by: §5.2.1.
  • [31] M. Ring, S. Wunderlich, D. Grüdl, D. Landes, and A. Hotho (2017) Flow-based benchmark data sets for intrusion detection. In Proceedings of the 16th European Conference on Cyber Warfare and Security, pp. 361–369. Cited by: §1, item 3, item 7, §2.2, Table 1.
  • [32] P. D. Scott and E. Wilkins (1999) Evaluating data mining procedures: techniques for generating artificial data sets. Information and software technology 41 (9), pp. 579–587. Cited by: §1, §1, item 10, item 2.
  • [33] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani (2018) Toward generating a new intrusion detection dataset and intrusion traffic characterization.. ICISSP 1, pp. 108–116. Cited by: §2.2, Table 1.
  • [34] F. Skopik, G. Settanni, R. Fiedler, and I. Friedberg (2014) Semi-synthetic data set generation for security software evaluation. In 12th Annual International Conference on Privacy, Security and Trust, pp. 156–163. Cited by: §2.2, Table 1.
  • [35] F. Skopik, M. Wurzenberger, and M. Landauer (2021) Smart log data analytics: techniques for advanced security analysis. Springer. Cited by: §5.2.1.
  • [36] Cited by: §2.2, Table 1.
  • [37] C. Thomas, V. Sharma, and N. Balakrishnan (2008) Usefulness of DARPA dataset for intrusion detection system evaluation. In Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security 2008, Vol. 6973, pp. 69730G. Cited by: §1, item 11, §2.2.
  • [38] R. Uetz, C. Hemminghaus, L. Hackländer, P. Schlipper, and M. Henze (2021) Reproducible and adaptable log data generation for sound cybersecurity experiments. In Annual Computer Security Applications Conference, pp. 690–705. Cited by: §1, §1, item 11, item 4, §2.2, Table 1.
  • [39] J. Woodbridge, H. S. Anderson, A. Ahuja, and D. Grant (2016)

    Predicting domain generation algorithms with long short-term memory networks

    .
    arXiv preprint arXiv:1611.00791. Cited by: 2nd item.
  • [40] W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan (2009) Detecting large-scale system problems by mining console logs. In Proceedings of the ACM Symposium on Operating Systems Principles, pp. 117–132. Cited by: Table 1.
  • [41] J. Zhu, P. He, Q. Fu, H. Zhang, M. R. Lyu, and D. Zhang (2015) Learning to log: helping developers make informed logging decisions. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1, pp. 415–425. Cited by: item 4.