Automating Privilege Escalation with Deep Reinforcement Learning

10/04/2021
by   Kalle Kujanpää, et al.
aalto
0

AI-based defensive solutions are necessary to defend networks and information assets against intelligent automated attacks. Gathering enough realistic data for training machine learning-based defenses is a significant practical challenge. An intelligent red teaming agent capable of performing realistic attacks can alleviate this problem. However, there is little scientific evidence demonstrating the feasibility of fully automated attacks using machine learning. In this work, we exemplify the potential threat of malicious actors using deep reinforcement learning to train automated agents. We present an agent that uses a state-of-the-art reinforcement learning algorithm to perform local privilege escalation. Our results show that the autonomous agent can escalate privileges in a Windows 7 environment using a wide variety of different techniques depending on the environment configuration it encounters. Hence, our agent is usable for generating realistic attack sensor data for training and evaluating intrusion detection systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/28/2019

Snooping Attacks on Deep Reinforcement Learning

Adversarial attacks have exposed a significant security vulnerability in...
11/27/2021

Deep Q-Learning based Reinforcement Learning Approach for Network Intrusion Detection

The rise of the new generation of cyber threats demands more sophisticat...
07/25/2019

Interactive Lungs Auscultation with Reinforcement Learning Agent

To perform a precise auscultation for the purposes of examination of res...
11/03/2021

Autonomous Attack Mitigation for Industrial Control Systems

Defending computer networks from cyber attack requires timely responses ...
09/23/2020

The Agent Web Model – Modelling web hacking for reinforcement learning

Website hacking is a frequent attack type used by malicious actors to ob...
05/10/2021

A Deep Reinforcement Learning Approach to Audio-Based Navigation in a Multi-Speaker Environment

In this work we use deep reinforcement learning to create an autonomous ...
05/16/2022

Many Field Packet Classification with Decomposition and Reinforcement Learning

Scalable packet classification is a key requirement to support scalable ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Defending networks and information assets from attack in a constantly evolving threat landscape remains a substantial challenge in our modern and connected world. As better detection and response methods are developed, attackers invariably adapt their tools and techniques to remain competitive. One adaptation is the use of advanced automation to perform attack sequences so quickly they cannot be responded to by defenders. An example of this is the NotPetya malware from 2017. The malware spread very rapidly by credential dumping and lateral movement techniques usually expected of human-on-keyboard attacks (see e.g. (Greenberg, 2018)).

Partial or even full automation of attacks is nothing new. Many exploits and techniques consist of discrete steps, and they can be easily scripted. Frameworks such as Metasploit, OpenVAS, Cobalt Strike, and PowerShell Empire support and automate red teaming activities. However, this type of automation relies on several assumptions about the target environment and often requires a human to configure it for a specific scenario. Moreover, the highly predictable sequences of observable events generated by automated attacks give a further advantage to defenders. Hence, these kinds of attacks can often be detected and responded to efficiently. In contrast, a human expert might be capable of determining the optimal approach for a given target almost immediately and avoid detection, thanks to the experience that they have gathered.

Can the human approach be emulated by intelligent machine learning agents that learn from experience, instead of the user having to enumerate all possibilities and creating logic trees to account for all the alternatives? There would be many potential use cases for an intelligent machine learning red teaming agent. To train comprehensive defensive systems using machine learning, enormous amounts of data of realistic attacks might be necessary. An intelligent agent enables the generation of large amounts of attack data, on-demand, that could be used to train or refine detection models. Moreover, in order to understand how to defend against attacks and perform risk assessment, the potential behavior and threats posed by these machine learning-based agents must be understood by the cyber security community.

Despite a fear of malicious actors using reinforcement learning for offensive purposes and the significant advances in deep reinforcement learning during the past few years, few studies on red teaming with reinforcement learning have been published. The most likely reason for this is that the problem of learning to perform an attack is extremely hard:

  • a complete attack typically consists of a long sequence of interdependent steps;

  • the action space is practically infinite if the actions are the commands that the agent can execute;

  • even formalizing red teaming activities as machine learning problems can be extremely challenging.

Therefore, existing research has focused on automating smaller sub-tasks such as initial access (Takaesu, 2018) or lateral movement during post-exploitation (Maeda and Mimura, 2021).

DeepExploit (Takaesu, 2018) is a deep reinforcement learning agent that is trained to automate gaining initial access using known vulnerabilities and exploits. It is built on the Metasploit framework (Rapid7, 2021). After a successful penetration, it tries to recursively gain access to other hosts in the local network of the given input IP address. DeepExploit is primarily a framework for penetration testing, and its support for post-exploitation activities is very limited: the agent treats lateral movement as a second initial access task.

The deep RL agent of Maeda and Mimura (Maeda and Mimura, 2021) is a step forward in emulating adversarial behavior in real environments: it is trained to perform lateral movement in Windows domains. The authors train the agent using the modules of PowerShell Empire as the action space. The state of the proposed agent consists of ten entries, such as the number of discovered computers in the network, the number of compromised computers, and whether the agent has local administrative privileges. The authors demonstrate that the reinforcement learning agent can learn to perform lateral movement and obtain domain controller privileges.

In this work, we present one potential use case for an intelligent machine learning red teaming agent: we use deep RL to automate the task of local privilege escalation. Privilege escalation is the typical first step performed by an attacker after gaining initial access, and it is often followed by lateral movement to other hosts in the penetrated network. We consider privilege escalation in Windows 7 environments, which may have an arbitrary number of system components (such as services, DLLs, and tasks). We propose a formalization of the privilege escalation task as a reinforcement learning problem and present a novel architecture of an actor-critic RL agent. We experimentally show that the proposed agent can learn to perform the privilege escalation task.

Although we focus on one sub-task performed by a malicious actor, the learning problem that we consider is significantly harder compared to previous works (Takaesu, 2018; Maeda and Mimura, 2021):

  • Privilege escalation needs a sequence of actions, the selection of which depends on the changing system state. The scenario of (Takaesu, 2018), for example, has one-step solutions without any changes in the system state.

  • Privilege escalation can be accomplished by multiple different strategies.

  • The attacked system can have a varying number of system components (services, DLLs, tasks), and the agent should generalize to any number of those.

  • Our training setup is more realistic and diverse compared to (Maeda and Mimura, 2021). Instead of attacking a system whose variability is implemented by noise in system parameters, we use different system configurations in each training episode.

  • The state of the system in our experiments is described by thousands of variables, which is much larger than the states of the agents in (Maeda and Mimura, 2021; Takaesu, 2018).

  • The actions that we design for our learning environment are more atomic compared to (Takaesu, 2018; Maeda and Mimura, 2021). Most of our actions can be implemented with standard OS commands instead of using modules of existing exploitation frameworks.

Thus, our study takes a step towards solving more complex red teaming tasks with artificial intelligence.

2. Related Work

Applying reinforcement learning in cyber security has been a subject of much recent research (Nguyen and Reddi, 2019). Examples of application areas include, among others, anti-jamming communication systems (Han et al., 2017), spoofing detection in wireless networks (Xiao et al., 2015), phishing detection (Chatterjee and Namin, 2019), autonomous cyber defense in software-defined networking (Han et al., 2018), mobile cloud offloading for malware detection (Wan et al., 2017), botnet detection (Alauthman et al., 2020), security in mobile edge caching (Xiao et al., 2018), and security in autonomous vehicle systems (Ferdowsi et al., 2018). In addition, reinforcement learning has been applied to research of physical security, such as grid security (Ni and Paul, 2019), and to green security games (Wang et al., 2019).

Previously, multi-agent reinforcement learning has been applied to cyber security simulations with competing adversarial and defensive agents (Bland et al., 2020; Elderman et al., 2017; He et al., 2016). It has been shown that both the attacking and the defending reinforcement learning agents can learn to improve their performance. The success of multi-agent reinforcement learning might have wider implications for information security research even though these simulation-based studies are not directly applicable to real environments.

There have also been attempts to apply reinforcement learning to penetration testing (Takaesu, 2018; Ghanem and Chen, 2018; Caturano et al., 2021; Chowdhary et al., 2020; Zennaro and Erdodi, 2021; Ghanem and Chen, 2020). The results of these efforts suggest that reinforcement learning can support the human in charge of the penetration testing process (Ghanem and Chen, 2018; Caturano et al., 2021). Reinforcement learning has also been applied to planning the steps following the initial penetration by learning a policy in a simulated version of the environment (Chowdhary et al., 2020). Penetration testing (Zennaro and Erdodi, 2021) and web hacking (Erdődi and Zennaro, 2021) have also been converted to simulated capture-the-flag challenges that can be solved with reinforcement learning. Finally, reinforcement learning has been applied to attacking static Portable Executable (PE) (Anderson et al., 2018)

and even supervised learning-based anti-malware engines

(Fang et al., 2019b). The trained agents are capable of modifying the malware to evade detection.

Reinforcement learning has also been applied to blue teaming. Microsoft has developed a research toolkit called CyberBattleSim, which enables modeling the behavior of autonomous agents in a high-level abstraction of a computer network (Team, 2021)

. Reinforcement learning agents that operate in the abstracted network can be trained using the framework. The objective of the platform is to create an understanding of how malicious reinforcement learning agents could behave in a network and how reinforcement learning can be used for threat detection. Deep reinforcement learning can also be applied to improving feature selection for malware detection

(Fang et al., 2019a).

Non-learning-based approaches to automating adversary emulation have been developed as well. Caldera is a framework capable of automatically planning adversarial actions against Windows enterprise networks. Caldera uses an inbuilt model of the structure of enterprise domains and knowledge about the objectives and potential actions of an attacker. Then, an intelligent heuristics-based planner decides which adversarial actions to perform

(Applebaum et al., 2016). Moreover, several different non-RL-based AI approaches to penetration testing and vulnerability analysis have been proposed (McKinnel et al., 2019).

Supervised and unsupervised learning with and without neural networks have been applied for blue teaming purposes. For instance, malicious PowerShell commands can be detected with novel deep learning methods

(Hendler et al., 2018)

. An ensemble detector combining an NLP-based classifier with a CNN-based classifier was the best at detecting malicious commands. The detection performance was high enough to be useful in practice. The detector was evaluated using a large dataset consisting of legitimate commands executed by standard users, malicious commands executed by malware, and malicious commands designed by security experts. The suitability of machine learning for intrusion detection, malicious code detection, malware analysis, and spam detection has been discussed

(Apruzzese et al., 2018; Cui et al., 2018; Kim et al., 2018). These methods often rely on extensive feature engineering (Kim et al., 2018; Çavuşoğlu, 2019)

. In defensive tasks, good performance can often be achieved without deep neural networks, with solutions like logistic regression, support vector machines, and random forests

(Milosevic et al., 2017). Machine learning-based systems are vulnerable to adversarial attacks (Chen et al., 2019), and as different ML-based techniques have dissimilar weaknesses, a combination of machine learning techniques is often necessary (Apruzzese et al., 2018; Çavuşoğlu, 2019).

3. Reinforcement learning

Reinforcement learning is one of the three main paradigms of machine learning alongside supervised and unsupervised learning (Sutton and Barto, 2018). In reinforcement learning, an agent interacts with an environment over discrete time steps to maximize its long-run reward. At a given time step , the environment has state and the agent is given an observation and a reward signal . If the environment is fully observable, the observation is equal to the environment state, . In a more general scenario, the agent receives only a partial observation which does not represent the full environment state. In this case, the agent has its own state which might differ from the environment state . The agent selects an action from the set of possible actions and acts in the environment. The environment transitions to a new state and the agent receives a new observation and a new reward . The goal of the agent is to maximize the sum of the collected rewards where is a discount factor used to discount future rewards (Mnih et al., 2016).

In this paper, we use a model-free approach to reinforcement learning in which the agent does not build an explicit model of the environment. The agent selects an action according to a policy function which depends on the agent state . We use an algorithm called the advantage actor-critic (A2C) (Mnih et al., 2016) in which the policy is parameterized as . The parameters of the policy are updated in the direction of

(1)

where

is an advantage function which estimates the (relative) benefit of taking action

in state in terms of the expected total reward. In A2C, the advantage function is computed as

where is the state-value function which estimates the expected total reward when the agent starts at state and follows policy :

We update the parameters of the value function using Monte Carlo estimates of the total discounted rewards as the targets

using the Huber loss (Huber, 1992). In practice, most of the parameters and are shared (see Section 5.2 and Figure 1).

4. Privilege escalation as a reinforcement learning task

4.1. Problem Definition

In this work, we focus on automating one particular step often performed by red teaming actors: local privilege escalation in Windows. For our reinforcement learning agent, there will be three possible paths to success:

  • Add the current user as a local administrator

  • Obtain administrative credentials

  • Overwrite a program that is executed with elevated privileges when a user or an administrator logs on

The first alternative is hardly how a true red teaming actor would approach the problem as changes in the local administrators of a workstation are easily detectable by any advanced detection and response system. However, if the agent is successful at doing that, it demonstrates that the agent can, with some exceptions, execute arbitrary code with elevated privileges on the victim host. The second alternative is a more realistic alternative for performing local privilege escalation. The third method is arguably inferior to the other two methods as it requires the attacker to wait for the system to be rebooted or some other event to occur that triggers the scheduled task or the AutoRun.

4.2. Learning Environment

The learning environment is a simulated Windows 7 environment with a random non-zero number of services, tasks, and AutoRuns. In each training episode, we introduce one vulnerability in the simulated system by selecting randomly from the following 12 alternatives:

  1. hijackable DLL

    • missing DLL

    • writable DLL

  2. re-configurable service

  3. unquoted service path

  4. modifiable ImagePath in the service registry

  5. writable executable pointed to by a service

  6. missing service binary and a writable service folder

  7. writable binary pointed to by an AutoRun

  8. alwaysInstallElevated bits set to one

  9. credentials of a user with elevated access in the WinLogon registry

  10. credentials of a user with elevated access in an Unattend file

  11. writable binary pointed to by a scheduled task running with elevated privileges

  12. writable Startup folder

To increase the variability of the environment states, we also randomly add services, tasks, and AutoRuns that might initially seem vulnerable to the agent. For instance, a service with one of the service-specific vulnerabilities above but without elevated privileges or a service with a writable parent folder but without an unquoted path can be added. Moreover, standard user credentials might be added to the registry, or a folder on the Windows path might be made writable, among others.

To train an autonomous reinforcement learning agent to perform local privilege escalation, we need to formalize the learning problem, that is, we need to define the reward function , the action space , and the space of the agent states.

Defining the reward function is perhaps the easiest task. We selected the simplest possible reward structure without any reward shaping. The agent is given a reward for the final action of the episode if the privilege escalation has been performed successfully. Otherwise, a zero reward is given. Based on our experiments, this simple sparse reward signal is sufficient for teaching the agent to perform privilege escalation with as few actions as possible because the reward is progressively discounted as more steps are taken by the agent. We also experimented by giving the agent only half a reward () for performing privilege escalation by the third, arguably inferior method, but the agent had trouble learning the desired behavior.

The state of the environment and its dynamics are determined by the Windows 7 environment (or its simulator) that the agent interacts with. The environment is only partially observable: the observations are the outputs of the commands that the agent executes. Working with such a rich observation space is difficult, and therefore, we have designed a custom procedure that converts the observations into the agent state . It is the agent state that is used as the input of the policy and value functions. We also manually designed a set of high-level actions that the agent needs to choose from. We describe the agent state and the action space in the following sections.

Trinary variables:
(1) Are there credentials in files?
(2) Do the credentials in the files belong to users with elevated
aal privileges?
(3) Are there credentials in the registry?
(4) Do the credentials in the registry belong to users with
aal elevated privileges?
(5) Is there a writable folder on the Windows path?
(6) Are the AlwaysInstallElevated bits set?
(7) Can the AutoRuns be enumerated using an external
aal PowerShell module?
Binary variables:
Has a malicious executable been created in Kali Linux?
Has a malicious service executable been created in Kali Linux?
Has a malicious DLL been created in Kali Linux?
Has a malicious MSI file been created in Kali Linux
Has a malicious executable been downloaded?
Has a malicious service executable been downloaded?
Has a malicious DLL been downloaded?
Has a malicious MSI file been downloaded?
Does the agent know the list of local users?
Are there users whose privileges need to be checked?
Does the agent know the services running on the OS?
Does the agent know the scheduled tasks running on the OS?
Does the agent know the AutoRuns of the OS?
Has the agent performed a static analysis of the service
a binaries to detect DLLs?
Have the DLLs loaded by the service binaries been searched?
Are there folders whose permissions must be checked?
Are there executables whose permissions must be checked?
Does the agent know the current username?
Does the agent know the Windows path?
Are there base64-credentials to decode?
Table 1. General information about the system stored in the agent state

4.3. State of the Agent

We update the state of the agent by keeping the information that is relevant for the task of privilege escalation. The agent state includes variables that contain general information about the system, information about discovered services, dynamic-link libraries (DLLs), AutoRun registry, and scheduled tasks.

The general information is represented by the 27 variables listed in Table 1. Seven of these variables are trinary (true/unknown/false) and contain information useful for the task of privilege escalation. The remaining 20 variables are binary (true/false), and they also contain information about the previous actions of the agent. The previous actions are included in the state to make the agent state as close to Markov as possible, which makes the training easier.

Service Is the service running?
Is the service run with elevated privileges?
Is the service path unquoted?
Is there a writable parent folder?
Is there whitespace in the service path?
Is the service binary in C:\Windows?
Can the service executable be written?
Can the service be re-configured?
Can the service registry be modified?
Does the service load a vulnerable DLL?
Has the service been exploited?
DLL Is the DLL missing?
Is the DLL writable?
Has the DLL been replaced with a malicious DLL?
AutoRun Is the AutoRun file writable?
Is the AutoRun file in C:\Windows?
Task Is the task run with elevated privileges?
Is the executable writable?
Is the executable in C:\Windows?
Table 2. Part of the agent state with information about services, DLLs, AutoRuns, and scheduled tasks (all are trinary variables)
Service Name, executable path, user
Executable Path, linked DLLs
DLL Name, executable calling, path
AutoRun Executable path, trigger
Task Name, executable path, trigger, user
Credentials Username, password, plaintext
File system Folders, executables, permissions
Table 3. Examples of auxiliary information used to fill the command arguments

At the beginning of each training episode, the agent has no knowledge of the services running on the host. The agent has to collect a list of services by taking the action A31 Get a list of services. Once a service is detected, it is described by its name, full path, the owning user and the 11 trinary attributes listed in Table 2. Each of these attributes can have three possible values: true (+1), unknown (0), and false (-1). Then, the agent needs to perform actions such as A25 Check service permissions with accesschk64 to fill the values of the unknown attributes.

A1. Create a malicious executable in Kali Linux A19. Change service registry to point to a malicious executable
A2. Create a malicious service executable in Kali Linux A20. Change service registry to add the user to local administrators
A3. Compile a custom malicious DLL in Kali Linux A21. Install a malicious MSI file
A4. Create a malicious MSI in Kali Linux A22. Search for unattend* sysprep* unattended* files
A5. Download a malicious executable in Windows A23. Decode base64 credentials
A6. Download a malicious service executable in Windows A24. Test credentials
A7. Download a malicious DLL in Windows A25. Check service permissions with accesschk64
A8. Download a malicious MSI in Windows A26. Check the ACLs of the service registry with Get-ACL
A9. Start an exploited service A27. Check executable permissions with icacls
A10. Stop an exploited service A28. Check directory permissions with icacls
A11. Overwrite the executable of an autorun A29. Analyze service executables for DLLs
A12. Overwrite the executable of a scheduled task A30. Search for DLLs
A13. Overwrite a service binary A31. Get a list of services
A14. Move a malicious executable so that it is executed by A32. Get a list of AutoRuns
a an unquoted service path A33. Get a list of scheduled tasks
A15. Overwrite a DLL A34. Check AlwaysInstallElevated bits
A16. Move a malicious DLL to a folder on Windows path A35. Check for passwords in Winlogon registry
a to replace a missing DLL A36. Get a list of local users and administrators
A17. Re-configure service to use a malicious executable A37. Get the current user
A18. Re-configure service to add the user to local administrators A38. Get the Windows path
Table 4. Actions

Since local privilege escalation can be performed by DLL hijacking, we also include the information about the DLLs used by the services in the state. Each DLL is described using a set of attributes listed in Table 2. This information is added to the state after taking action A29 Analyze service executables for DLLs.

Privileges can be elevated in Windows by using vulnerable executables in the AutoRun registry and misconfigured scheduled tasks. Therefore, we add information about the AutoRun files and the scheduled tasks to the agent state. Each AutoRun file and each scheduled task is described using the trinary attributes defined in Table 2.

In addition to the variables defined in Table 1 and Table 2, the agent maintains a collection of auxiliary information in its memory. The information is needed to fill the arguments of the commands executed by the agent. Examples of the auxiliary information are given in Table 3. This information is gathered and updated based on the observations, that is, the outputs of the commands performed by the agent. The auxiliary information is not given as input to the neural network, and hence, it affects neither the policy nor the value directly.

4.4. Action Space

We designed the action space of the agent by including actions needed for gathering information about the victim Windows host and performing the privilege escalation techniques. The action space consists of 38 actions listed in Table 4. Although the action space is crafted for known privilege escalation vulnerabilities (which we consider unavoidable within the constraints of the current RL), there is no one-to-one relationship between the actions and vulnerabilities. Some actions are only relevant for specific vulnerabilities, whereas many others are more general and can be used in multiple scenarios (see Appendix B). Our general principle in constructing the action space has been to make the actions as atomic as possible while keeping the problem potentially solvable by the current RL.

The actions are defined on a high level, which means that their exact implementation can vary, for example, depending on the platform. For instance, action A29 Analyze service executables for DLLs

can be implemented by static analysis of the Portable Executable files with an open-source analyzer to detect the loaded DLLs. The same action can be implemented using a custom analyzer or a script to download the executable and analyze it with Process Monitor. Our high-level action definition enables modifying the low-level implementations of the actions, such as changing the frameworks used, without affecting the trained agent. To create the necessary malicious executables, we use Kali Linux with Metasploit. The malicious DLLs needed for performing DLL hijacking are compiled manually. However, the low-level implementation of these commands can easily be changed if desired.

Each of the high-level actions is well-specified and can be performed using only a handful of standard Windows (cmd.exe and PowerShell) and Linux (zsh) commands. Many of the commands need arguments. For instance, to take the action A9 Start an exploited service, the name of the service must be specified. In this work, we automatically fill the arguments using the auxiliary information collected as discussed in Section 4.3. For example, one of the actions defined is A28 Check directory permissions with icacls. The agent maintains an internal list of directories that are of interest, and when the action to analyze the permissions of directories is selected, every directory on the list is scanned.

5. Experiments

5.1. Simulator of a Windows 7 Virtual Machine

A key practical challenge for training a reinforcement learning agent to perform red teaming tasks is the slow simulation speed when performing actions on a real virtual machine. For example, running commands necessary for privilege escalation can take longer than a minute on a full-featured Windows 7 virtual machine, even if the agent acts optimally. At the beginning of training, when the agent selects actions very close to randomly, one episode of training on a real VM can last significantly longer. Moreover, each training episode requires a new virtual machine that has been configured with one of the available vulnerabilities. Provisioning and configuring a virtual machine in such a manner will further add to the time it would take to train the agent. Training a successful agent may require thousands of training episodes, which can take a prohibitively large amount of time when training on a real operating system. Developing an infrastructure to tackle the long simulation times on a real system is a significant challenge, and it is left outside the scope of this study.

To alleviate this issue, we implemented a simulated Python environment that emulates the behavior of a genuine Windows 7 operating system relevant to the privilege escalation task. The simulation consists of, among others, a file system with access controls, Windows registry, AutoRuns, scheduled tasks, users, executables, and services. Using this environment, the actions taken by the agent can be simulated in a highly efficient manner. Moreover, creating simulated machines with random vulnerabilities for training requires little programming and computing power and can be done very fast. However, to determine whether training the agent in a simulated environment instead of a real operating system is feasible, the trained agent will be evaluated by testing it on a vulnerable Windows 7 virtual machine.

AutoRun

AutoRun

Service

DLL

DLL

Service

DLL

DLL

Task

Task

max

max

max

max

select

argmax

max
Figure 1.

The architecture of the A2C agent. Colored boxes represent multilayer perceptrons (same colors denote shared parameters). The black circles represent the concatenation of input signals.

The neural networks takes the autoruns, services and tasks as input and outputs the policy and the value

5.2. The Architecture of the Agent

We use an A2C reinforcement learning agent (described in Section 3) in which the policy and value functions are modeled with neural networks. In practice, we use a single neural network with two heads: one produces the estimated value of the state

and the other the probabilities

of taking one of the 38 actions. The network gets as inputs the state variables described in Tables 1 and 2. The complete model has less than 27,000 parameters.

The main challenge in designing the neural network is that the number of AutoRuns, services, tasks, and DLLs can vary across training episodes. A Windows host might have anything from dozens to thousands of services, and the number of tasks and AutoRuns might also vary significantly depending on the host. We want our agent to be able to generalize to any number of those. Therefore, we process the information about each service, AutoRun, and scheduled task separately and aggregate the outputs of these computational blocks using the maximum operation. The architecture of the neural network is presented in Figure 1. Computational blocks with shared parameters are shown with the same colors.

In the proposed architecture, we concatenate the max-aggregated outputs of the blocks that process the information about AutoRuns, tasks, and DLLs with the outputs of the blocks that process the information about the individual services. Intuitively, this corresponds to augmenting the service data with the information about the most vulnerable AutoRun and task. Note that we also augment the service data with the information about the DLLs used by the corresponding service. Then, we pass the concatenated information through multilayer perceptrons that output value estimates and policies for all services. Finally, we regard the service with the highest value estimate as the most vulnerable one and select the policy corresponding to that service as the final policy.

5.3. Training

Training consists of episodes in which the agent interacts with one instance of a sampled environment. At the beginning of each episode, the agent has no knowledge of the environment. The empty agent state is fed into the neural network that produces the value and the policy outputs. The action is sampled from the probabilities given by the policy output. Thus, we do not use explicit exploration strategies such as epsilon-greedy. The selected action is performed in the simulated environment, and the reward and the observations received as a result of the action are passed to the agent. The observations are parsed to update the agent state as described in Section 4.3. Then, a new action is selected based on the updated state. The iteration continues until the maximum number of steps for one episode is reached, or the agent has successfully performed privilege escalation.

The parameters of the neural network are updated at the end of each episode. The gradient (1

) is computed automatically by backpropagation with PyTorch

(Paszke et al., 2019) and the parameters are updated using the Adam optimizer (Kingma and Ba, 2014)

. The agent is trained for as long as the average reward per episode continues to increase. The hyperparameters used for training the agent are given in Appendix 

A.

Figure 2 presents the evolution of the episode length (averaged over 100 episodes) during one training run. The average episode length starts from around 200, and it gradually decreases reaching a level slightly above 11 after approximately 30,000 training episodes. We use 1,000 as the maximum number of steps per episode (see Appendix A), which implies that the agent manages to solve the problem and gets rewards from the very beginning of the optimization procedure when it takes close-to-random actions. We estimated that the average episode length is approximately 10.7 actions if the agent acts according to the optimal policy. Thus, the results indicate that the agent has learned to master the task of privilege escalation.

Figure 2. The average episode length during training on a logarithmic scale. The red line illustrates the episode length of an optimal policy that is approximately 10.7 actions per episode.

The average episode length starts from around 200 and it decreases during the training until it reaches a level slightly above 11 after approximately 30,000 training episodes.

Training the agent for 50,000 episodes in the simulated environment (see Section 5.1) takes less than two hours without any significant code optimizations using a single NVIDIA GeForce GTX 1080 Ti GPU, which is a high-end consumer GPU from 2017. Note that performing the same training on a real Windows 7 virtual machine could take weeks.

5.4. Testing the Agent

Next, we test whether the agent trained in our simulated environment can transfer to a real Windows 7 operating system without any adaptation. We also compare the performance of our agent to two baselines: a ruleset crafted by an expert that can be interpreted as the optimal policy and a random policy.

We create a Windows 7 virtual machine using Hyper-V provided by the Windows 10 operating system. We assume that the offensive actor has gained low-level access with code execution rights by performing, for example, a successful penetration test or a phishing campaign. This is simulated by installing an SSH server on the victim host. We use the Paramiko SSH library in Python to connect to the virtual machine and execute commands with user-level credentials (Forcier, 2021). We use Hyper-V to create a Kali Linux virtual machine with Metasploit for generating malicious executables. However, instead of using Metasploit for creating malicious DLLs, the agent has to modify and compile a custom DLL code by taking action A3 Compile a custom malicious DLL in Kali Linux. Paramiko is also used to connect to the Kali machine.

Using SSH for simulating low-level code execution rights on the victim Windows 7 has some limitations. Some of the Windows command-line utilities such as wmic and accesschk64 are blocked by non-privileged users over SSH. To overcome this limitation of the test scenario, we open a second SSH session using elevated credentials and run the blocked commands in that session. In practice, a malicious actor would be able to execute these utilities while accessing the victim’s environment via reverse shell or meterpreter session. Care was taken to prevent the test infrastructure from affecting the target environment. For example, due to the selection of an SSH tunnel as an off-the-shelf communication channel for testing purposes, the agent does not target any SSH-related vulnerabilities. Engineering effect was not prioritized for creating a production-ready attack agent, as it was considered beyond the scope of the research.

In order to take some of the actions listed in Table 4, the victim host has to have Windows Sysinternals with accesschk64. Moreover, we need an executable for scanning the DLLs loaded by the PE files. We used an open-source solution for that, but it failed to detect a DLL loaded by a handcrafted service executable. To work around this issue, we hard-coded the result of the scan in the agent. To properly address this issue, the high-level action of performing PE scanning could be mapped to a script that uploads the service executable on a Windows machine and uses ProcMon from Sysinternals to analyze the DLLs loaded by the service executable. Alternatively, a superior PE analyzer could be used.

First, we tested our agent on a virtual machine without external antivirus (AV) software or an intrusion detection system, but which had an up-to-date and active Windows Defender (which is essentially only an anti-spyware program in Windows 7). We kept the number of services similar to the number of services during training by excluding all services in C:\Windows from the list of services gathered by the agent. We made the agent deterministically select the action with the highest probability. Our agent was successful in exploiting all the twelve vulnerabilities. Examples of the sequences of actions taken by the agent during evaluation can be found in Tables 58. The agent took very few unnecessary actions. The performance (measured in terms of the number of actions) could be improved by gathering more information before scanning for directory permissions. Now, the agent prefers scanning the directory permissions immediately after finding interesting directories. However, the amount of noise generated by the agent would have been similar as the agent would have performed more Windows commands per high-level action.

A35. Check for passwords in Winlogon registry
A37. Get the current user
A31. Get a list of services
A28. Check directory permissions with icacls
A36. Get a list of local users and administrators
A26. Check the ACLs of service registries with Get-ACL
A25. Check service permissions with accesschk64
A34. Check AlwaysInstallElevated bits
A32. Get a list of AutoRuns
A28. Check directory permissions with icacls
A27. Check executable permissions with icacls
A22. Search for unattend* sysprep* unattended* files
A33. Get a list of scheduled tasks
A28. Check directory permissions with icacls
A27. Check executable permissions with icacls
A29. Analyze service executables for DLLs
A30. Search for DLLs
A28. Check directory permissions with icacls
A38. Get the Windows path
A28. Check directory permissions with icacls
A3. Compile a custom malicious DLL in Kali Linux
A7. Download a malicious DLL in Windows
A16. Move a malicious DLL to a folder on Windows path
aal to replace a missing DLL
A9. Start an exploited service
Table 5. Actions to exploit a missing DLL file
A35. Check for passwords in Winlogon registry
A37. Get the current user
A31. Get a list of services
A28. Check directory permissions with icacls
A2. Create a malicious service executable in Kali Linux
A6. Download a malicious service executable in Windows
A13. Overwrite a service binary
A9. Start an exploited service
Table 6. Actions to exploit a service with a missing binary
A35. Check for passwords in Winlogon registry
A36. Get a list of local users and administrators
A24. Test credentials
Table 7. Actions to exploit elevated credentials in the WinLogon registry
A35. Check for passwords in Winlogon registry
A37. Get the current user
A31. Get a list of services
A28. Check directory permissions with icacls
A36. Get a list of local users and administrators
A26. Check the ACLs of service registries with Get-ACL
A25. Check service permissions with accesschk64
A34. Check AlwaysInstallElevated bits
A4. Create a malicious MSI file in Kali Linux
A8. Download a malicious MSI file in Windows
A21. Install a malicious MSI file
Table 8. Actions to exploit AlwaysInstallElevated

After that, we did not limit the number of services (by excluding services in C:\Windows) and let the agent perform privilege escalation. The increased number of services had no negative effect on the agent, and the agent was successful at the task. An example sequence of commands is given in Appendix C. However, because the agent performs each selected action on every applicable service, the agent generates some noise by scanning through the permissions of all services in C:\Windows. That could have caused an alert in an advanced detection and response system.

The number of actions used by the agent to escalate privileges during the testing phase is given in Table 9. We compare the following agents:

  • the oracle agent, which assumes complete knowledge of the system, including the vulnerability;

  • the optimal policy, which is approximated using a fixed ruleset crafted by an expert;

  • the deterministic RL agent, which selects the action with the highest probability;

  • the stochastic RL agent, which samples the action from the probabilities produced by the policy network;

  • an agent taking random actions.

For all random trials, we used 1000 samples and computed the average number of actions used by the agent. Because of the computational cost of running thousands of episodes, all tests involving randomness were run in a simulated environment similar to the testing VM. The results suggest that the policy of the deterministic agent is close to optimal. The addition of stochasticity to action selection has a slightly negative effect on the performance but it increases the variability of the agent’s actions making the agent potentially more difficult to detect.

Vulnerability Oracle Expert Deterministic Stochastic Random
(full knowledge) ( optimal policy) RL RL
1 10 20 24 25.3 231.2
2 5 10 9 8.0 152.2
3 7 7 8 8.2 206.0
4 5 9 8 9.0 147.7
5 7 10 15 11.5 212.2
6 7 7 8 8.3 208.1
7 6 9 14 14.1 171.7
8 5 15 11 11.0 156.0
9 3 11 3 10.9 96.0
10 4 13 14 13.5 162.8
11 6 9 17 17.6 166.9
12 6 8 13 13.3 165.0
AVG 5.9 10.7 12.0 12.6 173.0
Table 9. Number of actions used to escalate privileges

We additionally tested the ability of the agent to generalize to multiple vulnerabilities which might be simultaneously present in the system. This was done in three ways. First, the agent was evaluated in an environment with six different types of vulnerable services present. Second, the agent was evaluated in an environment with all twelve vulnerabilities present. Finally, random combinations of any two vulnerabilities were tested. The agent had little trouble performing privilege escalation in any of these scenarios.

As a matter of interest, we finally evaluated the agent’s performance against a host running an up-to-date version of a standard endpoint protection software, Microsoft Security Essentials, with real-time protection enabled. As expected, the AV software managed to recognize the default malicious executables created by msfvenom in Kali Linux. However, the AV software failed to recognize the custom DLL compiled by the agent, and hence, privilege escalation using DLL hijacking was possible. Moreover, the AV software failed to detect any methods that did not involve a downloaded malicious payload, such as re-configuring the vulnerable service to execute a command that added the user as a local administrator. Hence, privilege escalation was possible in many of the scenarios, even with up-to-date AV software present. It should be noted that these techniques fall beyond the scope of file-based threat detection used by standard antivirus software and would require more advanced protection strategies to counter, such as behavioral- or heuristics-based detection. The agent’s performance against such detection engines was considered to be beyond the scope of the project and was not assessed.

6. Discussion

Our work demonstrates that it is possible to train a deep reinforcement learning agent to perform local privilege escalation in Windows 7 using known vulnerabilities. Our method is the first reinforcement learning agent, to the best of our knowledge, that performs privilege escalation with an extensive state space and a high-level action space with easily customizable low-level implementations. Despite being trained in simulated environments, the test results demonstrate that our agent can solve the formalized privilege escalation problem in a close-to-optimal fashion on full-featured Windows machines with realistic vulnerabilities.

The efficacy of our implementation is limited if up-to-date antivirus software is running on the victim host because only a handcrafted DLL is used, whereas the malicious executables are created using Metasploit with default settings. However, if the mapping from the high-level actions to the low-level commands (see Table 4) was improved so that more sophisticated payloads were used or the action space was expanded with actions for defense evasion, a reinforcement learning agent could be capable of privilege escalation in hosts with up-to-date antivirus software but without an advanced detection and response system.

While simple attacks are likely to be detected by advanced breach detection solutions, not all companies employ those for various reasons. The constant stream of breaches seen in the news reflects that reality. Moreover, if adversaries develop RL-based tools for learning and automating adversarial actions, they might prefer to target networks that are less likely to be running breach detection software.

The current threat level presented by reinforcement learning agents is most likely limited to agents capable of exploiting existing well-known vulnerabilities. The same could be achieved by a scripted agent with a suitable hard-coded logic. However, the RL approach offers a number of benefits compared to a scripted agent:

  • Scripting an attacking agent can be difficult when the number of potentially exploitable vulnerabilities grows and if the attacked system contains an IDS.

  • The probabilistic approach of our RL agent will produce more varied attacks (and attempted attacks) than a scripted robot that follows hard-coded rules, which makes our agent more usable for training ML-based defenses and testing and evaluating intrusion detection systems.

  • An RL agent may be quickly adapted to changes in the environment. For example, if certain sequences of actions cause an alarm raised by an intrusion detection system, the agent might learn to take a different route, which is not detectable by the IDS. This would produce invaluable information for strengthening the defense system.

In the long run, RL agents could have the potential to discover and exploit novel unseen vulnerabilities, which would have an enormous impact on the field. To implement this idea, agents would most likely need to interact with an authentic environment, which would require a great deal of engineering effort and a huge amount of computational resources. Crafting the action space would nevertheless be most likely unavoidable within the constraints of the current RL methods. However, the development should go in the direction of making the actions more atomic and minimizing the amount of prior knowledge used in designing the action space. This could allow the agent to encompass more vulnerabilities and could be a way to get closer to the ultimate goal of discovering new vulnerabilities.

Another research direction is to increase the complexity of the learning task. In this first step, we wanted to understand how RL-powered attacks could work in a constrained, varied setup, and our key result is showing that the RL approach works for such a complex learning task. Defeating defensive measures or expanding to a wider range of target environments would be a research topic with a significantly larger scope. It would be interesting to see whether a reinforcement learning agent can perform more steps in the cyber security kill chain, such as defense evasion. It would also be interesting to train the agent in an environment with an intrusion detection system or a defensive RL agent and perform multi-agent reinforcement learning, which has been done in previous research on post-exploitation in simulated environments (Elderman et al., 2017).

Ethical Considerations

The primary goal of this work is to contribute to building resistant defense systems that can detect and prevent various types of potential attacks. Rule-based defense systems can be effective, but as the number of attack scenarios grows, they become increasingly difficult to build and maintain. Data-driven defensive systems trained with machine learning offer a promising alternative but implementing this idea in practice is challenged by the problem of scarcity of available training data in this domain. In this work, we present a possible solution to this problem by training a reinforcement learning agent to perform malicious activities and therefore to generate invaluable training data for improving defense systems. The presented approach can also be useful to support the red teaming activities performed by cyber security experts by automating some steps in the kill chain. However, the system developed in this project can potentially be dangerous in the wrong hands. Hence, the code created in this project will not be open-sourced or released to the public.

Acknowledgements.
We thank the Security and Software Engineering Research Center SERC for funding our work. In addition, we thank David Karpuk, Andrew Patel, Paolo Palumbo, Alexey Kirichenko, and Matti Aksela from F-Secure for their help in running this project and their domain expertise, Tuomas Aura for giving us highly valuable feedback, and the Academy of Finland for the support within the Flagship Programme Finnish Center for Artificial Intelligence (FCAI).

References

  • M. Alauthman, N. Aslam, M. Al-Kasassbeh, S. Khan, A. Al-Qerem, and K. R. Choo (2020) An efficient reinforcement learning-based botnet detection approach. Journal of Network and Computer Applications 150, pp. 102479. Cited by: §2.
  • H. S. Anderson, A. Kharkar, B. Filar, D. Evans, and P. Roth (2018) Learning to evade static pe machine learning malware models via reinforcement learning. arXiv preprint arXiv:1801.08917. Cited by: §2.
  • A. Applebaum, D. Miller, B. Strom, C. Korban, and R. Wolf (2016) Intelligent, automated red team emulation. In Proceedings of the 32nd Annual Conference on Computer Security Applications, pp. 363–373. Cited by: §2.
  • G. Apruzzese, M. Colajanni, L. Ferretti, A. Guido, and M. Marchetti (2018) On the effectiveness of machine and deep learning for cyber security. In 2018 10th International Conference on Cyber Conflict (CyCon), pp. 371–390. Cited by: §2.
  • J. A. Bland, M. D. Petty, T. S. Whitaker, K. P. Maxwell, and W. A. Cantrell (2020) Machine learning cyberattack and defense strategies. Computers & Security 92, pp. 101738. Cited by: §2.
  • F. Caturano, G. Perrone, and S. P. Romano (2021) Discovering reflected cross-site scripting vulnerabilities using a multiobjective reinforcement learning environment. Computers & Security 103, pp. 102204. Cited by: §2.
  • Ü. Çavuşoğlu (2019) A new hybrid approach for intrusion detection using machine learning methods. Applied Intelligence 49 (7), pp. 2735–2761. Cited by: §2.
  • M. Chatterjee and A. Namin (2019) Detecting phishing websites through deep reinforcement learning. In 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Vol. 2, pp. 227–232. Cited by: §2.
  • T. Chen, J. Liu, Y. Xiang, W. Niu, E. Tong, and Z. Han (2019) Adversarial attack and defense in reinforcement learning-from ai security view. Cybersecurity 2 (11), pp. 1–22. Cited by: §2.
  • A. Chowdhary, D. Huang, J. S. Mahendran, D. Romo, Y. Deng, and A. Sabur (2020) Autonomous security analysis and penetration testing. In 2020 16th International Conference on Mobility, Sensing and Networking (MSN), pp. 508–515. Cited by: §2.
  • Z. Cui, F. Xue, X. Cai, Y. Cao, G. Wang, and J. Chen (2018) Detection of malicious code variants based on deep learning. IEEE Transactions on Industrial Informatics 14 (7), pp. 3187–3196. Cited by: §2.
  • R. Elderman, L. J. Pater, A. S. Thie, M. M. Drugan, and M. Wiering (2017) Adversarial reinforcement learning in a cyber security simulation.. In ICAART (2), pp. 559–566. Cited by: §2, §6.
  • L. Erdődi and F. M. Zennaro (2021) The agent web model: modeling web hacking for reinforcement learning. International Journal of Information Security, pp. 1–17. Cited by: §2.
  • Z. Fang, J. Wang, J. Geng, and X. Kan (2019a) Feature selection for malware detection based on reinforcement learning. IEEE Access 7, pp. 176177–176187. Cited by: §2.
  • Z. Fang, J. Wang, B. Li, S. Wu, Y. Zhou, and H. Huang (2019b) Evading anti-malware engines with deep reinforcement learning. IEEE Access 7, pp. 48867–48879. Cited by: §2.
  • A. Ferdowsi, U. Challita, W. Saad, and N. B. Mandayam (2018) Robust deep reinforcement learning for security and safety in autonomous vehicle systems. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 307–312. Cited by: §2.
  • J. Forcier (2021) External Links: Link Cited by: §5.4.
  • M. C. Ghanem and T. M. Chen (2018) Reinforcement learning for intelligent penetration testing. In 2018 Second World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), pp. 185–192. Cited by: §2.
  • M. C. Ghanem and T. M. Chen (2020) Reinforcement learning for efficient network penetration testing. Information 11 (1), pp. 6. Cited by: §2.
  • A. Greenberg (2018) The untold story of notpetya, the most devastating cyberattack in history. Wired, August 22. Cited by: §1.
  • G. Han, L. Xiao, and H. V. Poor (2017) Two-dimensional anti-jamming communication based on deep reinforcement learning. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2087–2091. Cited by: §2.
  • Y. Han, B. I. Rubinstein, T. Abraham, T. Alpcan, O. De Vel, S. Erfani, D. Hubczenko, C. Leckie, and P. Montague (2018) Reinforcement learning for autonomous defence in software-defined networking. In

    International Conference on Decision and Game Theory for Security

    ,
    pp. 145–165. Cited by: §2.
  • X. He, H. Dai, and P. Ning (2016) Faster learning and adaptation in security games by exploiting information asymmetry. IEEE Transactions on Signal Processing 64 (13), pp. 3429–3443. Cited by: §2.
  • D. Hendler, S. Kels, and A. Rubin (2018) Detecting malicious powershell commands using deep neural networks. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security, pp. 187–197. Cited by: §2.
  • P. J. Huber (1992) Robust estimation of a location parameter. In Breakthroughs in Statistics, pp. 492–518. Cited by: §3.
  • T. Kim, B. Kang, M. Rho, S. Sezer, and E. G. Im (2018) A multimodal deep learning method for android malware detection using various features. IEEE Transactions on Information Forensics and Security 14 (3), pp. 773–788. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.3.
  • R. Maeda and M. Mimura (2021) Automating post-exploitation with deep reinforcement learning. Computers & Security 100, pp. 102108. Cited by: 4th item, 5th item, 6th item, §1, §1, §1.
  • D. R. McKinnel, T. Dargahi, A. Dehghantanha, and K. R. Choo (2019) A systematic literature review and meta-analysis on artificial intelligence in penetration testing and vulnerability assessment. Computers & Electrical Engineering 75, pp. 175–188. Cited by: §2.
  • N. Milosevic, A. Dehghantanha, and K. R. Choo (2017) Machine learning aided android malware classification. Computers & Electrical Engineering 61, pp. 266–274. Cited by: §2.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §3, §3.
  • T. T. Nguyen and V. J. Reddi (2019) Deep reinforcement learning for cyber security. arXiv preprint arXiv:1906.05799. Cited by: §2.
  • Z. Ni and S. Paul (2019) A multistage game in smart grid security: a reinforcement learning solution. IEEE Transactions on Neural Networks and Learning Systems 30 (9), pp. 2684–2695. Cited by: §2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §5.3.
  • Rapid7 (2021) Metasploit Framework. External Links: Link Cited by: §1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §3.
  • I. Takaesu (2018) External Links: Link Cited by: 1st item, 5th item, 6th item, §1, §1, §1, §2.
  • M. 3. D. R. Team (2021) External Links: Link Cited by: §2.
  • X. Wan, G. Sheng, Y. Li, L. Xiao, and X. Du (2017) Reinforcement learning based mobile offloading for cloud-based malware detection. In GLOBECOM 2017-2017 IEEE Global Communications Conference, pp. 1–6. Cited by: §2.
  • Y. Wang, Z. R. Shi, L. Yu, Y. Wu, R. Singh, L. Joppa, and F. Fang (2019) Deep reinforcement learning for green security games with real-time information. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 1401–1408. Cited by: §2.
  • L. Xiao, Y. Li, G. Liu, Q. Li, and W. Zhuang (2015) Spoofing detection with reinforcement learning in wireless networks. In 2015 IEEE Global Communications Conference (GLOBECOM), pp. 1–5. Cited by: §2.
  • L. Xiao, X. Wan, C. Dai, X. Du, X. Chen, and M. Guizani (2018) Security in mobile edge caching with reinforcement learning. IEEE Wireless Communications 25 (3), pp. 116–122. Cited by: §2.
  • F. M. Zennaro and L. Erdodi (2021) Modeling penetration testing with reinforcement learning using capture-the-flag challenges: trade-offs between model-free learning and a priori knowledge. arXiv preprint arXiv:2005.12632. Cited by: §2.

Appendix A Supplementary material for RL

The hyperparameters used for training are listed below:
Parameter Explanation Value Discount rate 0.995 Learning rate for Adam 0.001

First moment decay rate for Adam

0.9 Second moment decay rate for Adam 0.999 Maximum number of steps 1000 Number of services [1, 20] Number of autoruns [1, 10] Number of tasks [1, 10] Number of DLLs loaded per service [1, 4]

Appendix B Actions Required to Exploit the Vulnerabilities

The sequences of actions that can be used to exploit the twelve vulnerabilities are presented in this section. The actions used in multiple scenarios are marked with the blue color.
(1.1) Missing DLL:
A37. Get the current user A31. Get a list of services A29. Analyze service executables for DLLs A30. Search for DLLs A38. Get the Windows path A28. Check directory permissions with icacls A3. Compile a custom malicious DLL in Kali Linux A7. Download a malicious DLL in Windows A16. Move a malicious DLL to a folder on Windows path a to replace a missing DLL A9. Start an exploited service
(1.2) Writable DLL:
A37. Get the current user A31. Get a list of services A29. Analyze service executables for DLLs A30. Search for DLLs A27. Check executable permissions with icacls A3. Compile a custom malicious DLL in Kali Linux A7. Download a malicious DLL in Windows A15. Overwrite a DLL A9. Start an exploited service

(2) Re-configurable Service:
A37. Get the current user A31. Get a list of services A25. Check service permissions with accesschk64 A18. Re-configure service to add the user to local a administrators A9. Start an exploited service

(3) Unquoted Service Path:
A37. Get the current user A31. Get a list of services A28. Check directory permissions with icacls A2. Create a malicious service executable in Kali Linux A6. Download a malicious service executable in Windows A14. Move a malicious executable so that it is executed by a an unquoted service path A9. Start an exploited service

(4) Modifiable ImagePath:
A37. Get the current user A31. Get a list of services A26. Check the ACLs of the service registry with Get-ACL A20. Change service registry to add the user to local a administrators A9. Start an exploited service

(5) Writable Service Executable:
A37. Get the current user A31. Get a list of services A27. Check executable permissions with icacls A2. Create a malicious service executable in Kali Linux A6. Download a malicious service executable in Windows A13. Overwrite a service binary A9. Start an exploited service

(6) Missing Service Executable:
A37. Get the current user A31. Get a list of services A28. Check directory permissions with icacls A2. Create a malicious service executable in Kali Linux A6. Download a malicious service executable in Windows A13. Overwrite a service binary A9. Start an exploited service

(7) Writable AutoRun Executable:
A37. Get the current user A32. Get a list of AutoRuns A27. Check executable permissions with icacls A1. Create a malicious executable in Kali Linux A5. Download a malicious executable in Windows A11. Overwrite the executable of an AutoRun

(8) AlwaysInstallElevated:
A37. Get the current user A34. Check AlwaysInstallElevated bits A4. Create a malicious MSI in Kali Linux A8. Download a malicious MSI in Windows A21. Install a malicious MSI file

(9) WinLogon Registry:
A35. Check for passwords in Winlogon registry A36. Get a list of local users and administrators A24. Test credentials

(10) Unattend File:
A22. Search for unattend* sysprep* unattended* files A23. Decode base64 credentials A36. Get a list of local users and administrators A24. Test credentials

(11) Writable Task Binary:
A37. Get the current user A33. Get a list of scheduled tasks A27. Check executable permissions with icacls A1. Create a malicious executable in Kali Linux A5. Download a malicious executable in Windows A12. Overwrite the executable of a scheduled task

(12) Writable Startup Folder:
A37. Get the current user A32. Get a list of AutoRuns A28. Check directory permissions with icacls A1. Create a malicious executable in Kali Linux A5. Download a malicious executable in Windows A11. Overwrite the executable of an AutoRun

Appendix C Command-line example

We exemplify our mapping from actions to commands by showing the commands taken by the agent to escalate privileges by exploiting a service with weak folder permissions and a missing binary (see Table 6). Note that all the commands disclosed below have been derived from public sources (given in the footnotes) and can be recreated by security practitioners. Furthermore, none of the commands are proprietary to F-Secure.

A35. Check for passwords in Winlogon registry:111https://github.com/sagishahar/lpeworkshop/blob/master/Lab%20Exercises%20Walkthrough%20-%20Windows.pdf
reg query ”HKLM\SOFTWARE\Microsoft\Windows NT\
CurrentVersion\Winlogon” /v DefaultUsername
reg query ”HKLM\SOFTWARE\Microsoft\Windows NT\
CurrentVersion\Winlogon” /v DefaultPassword

A37. Get the current user:222https://sushant747.gitbooks.io/total-oscp-guide/content/privilege_escalation_windows.html
whoami

A31. Get a list of services:333https://book.hacktricks.xyz/windows/windows-local-privilege-escalation
wmic service get name,pathname,startname,startmode,started
/format:csv

A28. Check directory permissions with icacls:
icacls.exe ”c:\windows\system32”
icacls.exe ”c:\windows”
(15 rows skipped)
icacls.exe ”c:\program files (x86)\microsoft\edge”
icacls.exe ”c:\program files\missing file service”
icacls.exe ”c:\program files”
(4 rows skipped)
icacls.exe ”c:\windows\system32\wbem”
icacls.exe ”c:\program files\windows media player”

A2. Create a malicious service executable in Kali Linux:444https://infosecwriteups.com/privilege-escalation-in-windows-380bee3a2842
sudo -S msfvenom -p windows/exec CMD=’net localgroup
administrators user /add’ -f exe-service -o java_updater_svc

A6. Download a malicious service executable in Windows:555https://adamtheautomator.com/powershell-download-file/
powershell.exe -command ”Invoke-WebRequest -Uri
’82.130.20.144/java_updater_service’
-OutFile ’C:\Users\user\Downloads\java_updater_svc”
move /y ”C:\Users\user\Downloads\java_updater_svc”
”C:\Users\user\Downloads\java_updater_svc.exe”

A13. Overwrite a service binary:
copy /y ”C:\Users\user\Downloads\java_updater_svc.exe”
”c:\program files\missing file service\missingservice.exe”

A9. Start an exploited service:
sc start missingsvc