Modeling Penetration Testing with Reinforcement Learning Using Capture-the-Flag Challenges and Tabular Q-Learning

by   Fabio Massimo Zennaro, et al.

Penetration testing is a security exercise aimed at assessing the security of a system by simulating attacks against it. So far, penetration testing has been carried out mainly by trained human attackers and its success critically depended on the available expertise. Automating this practice constitutes a non-trivial problem, as the range of actions that a human expert may attempts against a system and the range of knowledge she relies on to take her decisions are hard to capture. In this paper, we focus our attention on simplified penetration testing problems expressed in the form of capture the flag hacking challenges, and we apply reinforcement learning algorithms to try to solve them. In modelling these capture the flag competitions as reinforcement learning problems we highlight the specific challenges that characterize penetration testing. We observe these challenges experimentally across a set of varied simulations, and we study how different reinforcement learning techniques may help us addressing these challenges. In this way we show the feasibility of tackling penetration testing using reinforcement learning, and we highlight the challenges that must be taken into consideration, and possible directions to solve them.



There are no comments yet.


page 8


A survey of benchmarking frameworks for reinforcement learning

Reinforcement learning has recently experienced increased prominence in ...

Multiagent Deep Reinforcement Learning: Challenges and Directions Towards Human-Like Approaches

This paper surveys the field of multiagent deep reinforcement learning. ...

A Survey of Reinforcement Learning Techniques: Strategies, Recent Development, and Future Directions

Reinforcement learning is one of the core components in designing an art...

CubeTR: Learning to Solve The Rubiks Cube Using Transformers

Since its first appearance, transformers have been successfully used in ...

Simulating SQL Injection Vulnerability Exploitation Using Q-Learning Reinforcement Learning Agents

In this paper, we propose a first formalization of the process of exploi...

The Agent Web Model – Modelling web hacking for reinforcement learning

Website hacking is a frequent attack type used by malicious actors to ob...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Securing modern systems and infrastructures is a central challenge in computer security. As an increasing amount of data and services are delivered through electronic platforms, guaranteeing their correct functioning is crucial for the working on modern society. Given the complexity of current systems, assessing their security constitute a hard problem, both from a theoretical and a practical point of view.

A traditional approach to evaluating security adopts a defensive stance, in which systems are analyzed and hardened from the point of view of a defender. An alternative pro-active perspective is offered by an offensive stance. Penetration testing (PT), or ethical hacking, consists in performing authorized simulated cyber-attacks against a computer system, with the aim of identifying weaknesses and assessing the overall security. The usefulness of offensive security as a tool to discover vulnerabilities is undisputed [1]. PT, though, is a complex and costly activity, requiring relevant knowledge of the target system and of the potential attacks that may be carried against it. Thus, in order to produce relevant insights, PT needs experts able to carefully probe a system and uncover known and, ideally, still unknown vulnerabilities.

A way to train human experts and allow them to acquire ethical hacking knowledge is offered by capture the flag competitions (CTF). In a CTF, participants are given the opportunity to conduct different types of real-world attacks against dedicated systems, with the aim of exploiting vulnerabilities behind which they can collect a flag. A CTF is a simplified and well-defined model of PT, usually designed as an educational exercise.

Software applications have been developed to automate some aspects of PT, but they mostly reduce to tools that carry out specific tasks under the direction of a human user. Traditional approaches from artificial intelligence, such as

planning, were also deployed in the hope of further automating PT through the generation of attack plans [2]; however, human input is still critical to model the context and the target system, and to finally derive conclusions about the actual vulnerabilities.

Recent advances in artificial intelligence and machine learning may offer a way to overcome some of the current limitations in automating PT. In particular, the paradigm of

reinforcement learning (RL) [3] was proven to be a versatile and effective method for solving complex problems involving agents trying to behave optimally in a given environment. RL applications embrace a large number of algorithms and methods with varying degrees of computational and sample complexity, including methods that require minimal expert modeling. Games provide an excellent benchmark for RL, and state-of-the-art methods have achieved remarkable or super-human performances in solving many complex games, ranging from the traditional Go [4] to modern Atari games [5, 6].

These developments suggest the possibility of adopting RL for tackling the PT problem. As a form of gamification of PT, CTFs provide an ideal setting for deploying RL algorithms and training agents that, in the long run, may learn to carry out complete PT independently of human supervision. This idea is not new, and it was in fact spearheaded some years ago by DARPA, which hosted in 2016 the Cyber Grand Challenge Event, a cyber-hacking tournament open to artificial agents trained using machine learning [7].

In this paper, we study the problem of modeling PT as a set of CTF challenges that can be solved using RL. While adopting the game paradigm offered by CTF may seem a perfect fit for RL, we critically analyze the specific challenges that arise from applying RL to PT. A first challenge is given by obscurity, or the problem of discovering the structure underlying a CTF problem. A second challenge is given by unimodality, or the problem of RL agents of using a single mode of reasoning (inference) for learning. We analyze these problems experimentally, and we evaluate how different RL techniques (lazy loading, state aggregation, and imitation learning) may help in tackling these challenges. At the end we show that, while RL may in principle allow for model-free learning, reliance on some form of prior knowledge may be in practice required to make the problem solvable. We argue that RL provides an interesting avenue of research for PT not because it allows for pure model-free learning (as in contrast with more traditional model-based artificial intelligence algorithms), but because it may offer a more flexible way to trade off the amount of prior knowledge an agent is provided and the amount of structure an agent is expected to discover.

The rest of the paper is organized as follows. Section 2 offers a review of the main ideas in PT and RL relevant to this work, as well as a review of previous related work. Section 3 discusses the problem of modeling PT and CTF as a learning problem, and highlights the specific challenges connected to security. Section 4 gives specific details of our experimental modeling, Section 5 presents the results of simulations, and Section 6 discusses the results in light of the challenges we uncovered. Section 7 suggests future avenues of research. Finally, Section 8 summarizes our conclusions.

2 Background

In this section we provide the basic concepts and ideas in the fields of PT and RL, and we review previous applications of artificial intelligence and machine learning to the PT problem.

2.1 Penetration Testing

Modern computer systems, digital devices and networks may present several types of vulnerabilities, ranging from low-level software binary exploitation to exploitation of network services. These vulnerabilities can be the target of hackers having multiple types of motivations and ways of attacking. PT aims at assuming the perspective of such hackers and at performing attacks in order to unveil potential vulnerabilities.

2.1.1 Hacking attacks

Although there is no strict rule on how to carry out a hacking attack, it is still possible to identify the steps that are common in many scenarios. From the perspective of the attacker, hacking basically consists of steps of information gathering and steps of exploitation (exploiting the vulnerabilities to perform the attack).

In the first stage of information gathering an attacker usually collects technical information on the target by probing the system (e.g.: mapping the website content to finding useful information, identifying the input parameters of server side scripts). The bottleneck of the attack process is determining a specific vulnerability. Vulnerabilities may be very different, and the second step of exploitation requires understanding the dynamics of the the target system and tailoring the actions to the identified weakness. An attacker has generally to rely on a wide spectrum of competences, from human logic to intuition, from technical expertise to previous experiences. After successful exploitation the attacker usually has multiple ways to proceed depending on the aim of the attack. It may simply keep an open unauthorized channel to its target, it may extract private or protected information, or it may use the target system to carry on further attacks.

A relevant example of hacking that may be the concern of PT is web hacking

. The process of web hacking can be decomposed in several successive and alternative steps. Typically, the attacker starts by identifying a target web service through a port scanning and by establishing a communication with the service over the http protocol. She can then access the website files inside the webroot folder, download them, process them within a web browser, and execute locally client-side scripts. Data can also be sent to the remote files in order to be processed on the server-side and obtain customized web responses. Server-side scripts can do many complex actions, such as querying a database, or reading and writing local files; the attacker may send well-crafted inputs in order to compromise these operations. Although web pages may present different and sometimes unique vulnerabilities, typical vulnerabilities can identified and classified


2.1.2 Capture the flag hacking competitions

CTFs are practical learning platform for ethical hackers [9]. CTF events are normally organized as 48-hour competitions during which different hacking challenges are provided to the participants.

CTF competitions usually present a set of well-formalized challenges. Each challenge is defined by one vulnerability (or a chain of vulnerabilities) associated with a flag. The aim of a participant is to exploit the vulnerability in each challenge, and thus capture the associated flag. No further steps are required from a player (such as, sending data to a command and control server or maintaining the access); the capture of a flag provides a unambiguous criterion to decide whether a challenge was solved or not. Challenges may be classified according to the type of problem they present (e.g., web hacking challenge or binary exploitation). Normally, human factors are excluded from the solution, so that an attacker has to rely on her knowledge and reasoning, but not on social engineering. In some instances, information about the target system and the vulnerability may be provided to the participants.

Standard CTFs run in Jeopardy mode, meaning that all the participants are attackers, and they are presented with a range of different static challenges. In other variants, participants may be subdivided in a red team, that is a team focused on attacking a target system, and blue team, that is a team tasked with defending the target system. Alternatively, each team may be provided with an infrastructure they have to protect while, at the same time, attacking the infrastructure of other teams. These last two variants of CTF defines non-static, evolving vulnerabilities, as the defenders in the blue team can change the services at run time by observing the red team actions and patching their own vulnerabilities.

In sum, CTFs, especially in the Jeopardy mode, define a set of well-defined problems that can capture the essence of PT and that can be easily cast in the formalism of games.

2.2 Reinforcement Learning

The reinforcement learning (RL) paradigm offers a flexible framework to model complex control problems and solve them using general-purpose learning algorithms [3]. A RL problem represents the problem of an agent trying to learn an optimal behavior or policy within a given environment. The agent is given minimal information about the environment, its dynamics, and the nature or the effects of the actions it can perform; instead, the agent is expected to learn a sensible behavior by interacting with the environments, thus discovering which actions in which states are more rewarding, and finally defining a policy that allows it to achieve its objectives in the best possible way.

2.2.1 Definition of a RL problem

Formally, a RL problem [3] is defined by a tuple or a signature:


  • is the state set, that is the collection of all the states of the given environment;

  • is the action set, that is the collection of all the actions available to the agent;

  • is the transition function of the environment

    , that is the probability for the environment of transitioning from state

    to state were the agent to take action ;

  • is the reward function, that is the probability for the agent of receiving reward were the agent to take action in state .

In this setup, it is assumed that the state of the environment is perfectly known to the agent. This setup constitute a

fully observable Markov decision process

(MDP) [3].

The behavior of the agent is encoded in a behavior policy:

that is a probability distribution over the available actions

given the state of the environment. We measure the quality of a policy as its return, that is the sum of its expected rewards over a time horizon :

where is a discount factor that underestimate rewards in the far future with respect to rewards in the near future. The discount factor provides a formal solution to the problem of a potentially infinite sum, and an intuitive weighting that makes our agent favor close-in-time rewards instead of postponement. Given this notion of return, the aim of the agent is to learn the optimal policy that maximizes the return , that is the policy , not necessarily unique, such that no other policy produces a higher return. Learning an optimal policy requires the agent to balance between the drive for exploration (finding previously unseen states and actions that provide high reward) and for exploitation (greedily choosing the states and actions that currently are deemed to return the best rewards).

Interaction with the environment (and, therefore, learning) happens over steps and episodes. A step is an atomic interaction of the agent with the environment: taking a single action according to the policy , collecting the reward , and observing the environment evolving from state to state . An episode is a collection of steps from an initial state to an ending state.

Notice that during different episodes, even if the signature of the RL problem is unchanged, the setup may be different. An RL agent is trained not to solve just one specific instance of a problem, but an entire set of problems with a similar structure described by the formalism . This variability among episodes is important, and it allows a RL agent to generalize.

2.2.2 Algorithms for RL

Several algorithms have been proposed to solve the RL problem. One of the simplest, yet well-performing, family of RL algorithms is the family of action-value methods. These algorithms tackle the problem of learning an optimal policy

through a proxy function meant to estimate the value of each pair of (state, action).

Formally, these methods define an action-value function for a policy as:

that is, the action-value function for a pair (state, action) is the expected return from state after taking action according to policy . Generally, action-value methods follow an approach to learning an optimal policy called generalized policy iteration based on two steps:

  1. given a starting policy interact with the environment to learn an approximation of the function ;

  2. improve the policy by defining a new policy where, in each state , the agent takes the action that maximizes the action-value function , that is .

Iteratively repeating this process (and allowing space for exploration), the agent will finally converge to the optimal policy . While the second step is quite trivial, consisting just of a maximizing operation, the first step requires fitting the action-value function, and it may be more challenging and time-consuming. There are two main ways of representing the action-value function:

  • Tabular representation

    : this representation relies on a matrix or a tensor

    to exactly encode each pair of (state, action) and estimate its value; tabular representations are simple, easy to examine, and statistically sound; however they have limited generalization ability and they do not scale well with the dimension of the state space and the action space .

  • Approximate representation: this representation relies on fitting an approximate function ; usual choices for

    are parametric functions ranging from simple linear regression to complex deep neural networks; approximate functions solve the problem of dealing with large state space

    and the action space , and provide generalization capabilities; however they are harder to interpret and they often lack statistical guarantees of convergence.

2.2.3 Q-Learning.

A standard action-value algorithm for solving the RL problem is Q-learning. Q-learning is a temporal-difference off-policy RL algorithm; temporal-difference means that the algorithm estimates the action-value function starting from an initial guess (bootstrap), and updates step-by-step its estimation with reference to the value of future states and actions; off-policy means that Q-learning is able to learn an optimal policy while exploring the environment according to another policy . Q-learning constitutes a versatile algorithm that allows to tackle many RL problems; it can be implemented both with a tabular representation of the action-value function or with an approximate representation .

Formally, given a RL problem with a discount , an agent interacting with the environment in real-time can gradually construct an approximation of the true action-value function via a tabular representation by gradually updating its estimation according to the formula:


where is a scalar defining a step-size [3]. Intuitively, at every step the estimation of moves towards the true action-value function by a step in a gradient ascent-like way.

2.3 Related Work

Automated tools for PT consists mainly of security scanners that can send predefined requests and analyze the answers in order to detect specific vulnerabilities (e.g.: Nessus [10]). These tools heavily rely on human intervention, in defining scripts, analyzing information and carrying out actual exploitation. Some applications, such as sqlmap [11], may perform exploitation too, although always with some degree of user interaction.

Automating the whole process of developing PT strategies has been the object of study for some times, and different models have been proposed to tackle the problem, such as attack graphs, Markov decision process, partially observable Markov decision processes [12], Stackelberg games [13], or Petri nets [14]. Many of the existing solutions follow a model-based approach: a PT scenario is first encoded in one of these well-defined models relying domain expertise, and then processed using model checking or artificial intelligence algorithms to produce optimal plans [15, 2, 16]. Recently, the use of RL algorithms has been proposed to analyze these models [14, 17], too. However, one of the main strengths of RL is its ability to tackle the learning problem using a model-free approach: instead of relying on a model carefully designed by an expert, a RL agent can interact with an environment by itself and infer an optimal strategy. This line of research has been studied in [18], with the implementation of tabular and approximate Q-learning algorithms to tackle a paradigmatic PT problem; our work follows the same approach, although our study focuses on a critical assessment of the use of RL across a wider set of CTF problems, and on the evaluation of different RL techniques aimed at addressing the specific problems we have encountered.

It is also worth mentioning that the encounter between PT and RL has been promoted by DARPA through the Cyber Grand Challenge Event hosted in Las Vegas in 2016 [7]. This challenge was a CTF-like competition open to automated agent. The organizers developed a special environment called DECREE (DARPA Experimental Cyber Research Evaluation Environment) where the operating system executed binary files in a modified format and only 7 system calls to limit the number of possible actions. Our work takes inspiration from this challenge, and it aims at studying RL agents that may be deployed to solve similar simplified CTF problems.

3 Modeling PT as a RL Problem

In this section we discuss how we can model PT as a RL problem by examining the challenges and the opportunities in this task. We start by arguing that PT can be naively seen as another game that can be solved by RL. We then move to discuss specific issues in dealing with PT as a game: we first focus on the issue of assessing the structure of a problem in CTF challenges; then we review how the inferential limitations of RL that may be particularly relevant when dealing with PT. Bringing together these considerations, we express what are the specific challenges in applying RL to PT that we will experimentally examine in this paper.

3.1 PT as a Game

PT, especially when distilled as a CTFs, may be easily expressed in terms of a game. It is immediate to identify the players of the game (a red team and a blue team), the rules of the game (the logic of the target system), and the victory condition (capture of the flag). Given the success of RL in tackling and solving games, it seems natural to try to express PT as a game.

Furthermore, at first sight, the distinction between the types of actions performed by an attacker(information gathering and exploitation) seems to reflect the same division between exploration actions and exploitation actions in RL. Since RL is assumed to learn to balance exploration and exploitation, it may seem that the PT problem would perfectly fit the RL paradigm.

However, casting the PT problem as a simple game solvable by RL risks missing some challenges peculiar to PT. The difficulty for an artificial agent to solve a CTF problem is due not only to the sheer size of the problem domain, but also to the limited structure of the PT problem and the limited number of channels on which an agent can rely to learn.

3.2 Structure of a Learning Problem

A RL agent is able to solve a problem by exploiting some structure underlying the problem itself. This structure is captured by an agent in the probability distribution of its policy; as it interacts with the environment, the agent updates its policy and reconstructs the structure of the problem.

We then need to consider PT problems having some form of structure. From the perspective of red team agent, a target system may present different level of vulnerability and structure. At one extreme, we have perfect systems, that is systems where defense has no vulnerabilities; these systems are of no interest here, since nothing but failure could be learned either by a human or artificial attacker. Similarly challenging are max-entropy systems, that is systems that have a vulnerability but they have no structure allowing an attacker to find this vulnerability. A max-entropy system is a system where each action or set of actions of the attacker returns as a feedback information only whether that action or set of actions

was successful or not; no further inference about other actions may be drawn from the feedback. In this setup, if we represent the starting knowledge of the agent as a policy with a uniform distribution over all actions or over all set of actions, then every interaction will provide only a single bit of information, that is the binary outcome of the chosen action or set of actions; no information is provided for the agent to learn about other possible courses of action and thus decrease the entropy (uncertainty) of its policy. A max-entropy system is not absolutely secure, but, provided that there must be a vulnerability, is the safest possible static configuration for a defender. Indeed, the only possible strategy of an attacker against such a system is just to try out all the possible actions. As such, we do not take into consideration this type of problem as the policy or strategy to be learned is structureless and trivial

111This setup may be better suited to be formalized as a multi-armed bandit problem [19].. We will instead focus on CTF systems, that is systems that have a vulnerability and have enough structure to allow an attacker to find such a weakness. This setup is consistent with an actual CTF game, where the red team players, by reasoning and following their intuitions, can discover and exploit the vulnerability. By analogy, an artificial agent is expected to exploit the structure of a system to learn an optimal strategy.

3.3 Single Channel and Mode of Reasoning

A RL agent has some inherent limitations due to the nature of the learning algorithms; the most concerning one with respect to the problem of PT is the restricted number of channels and mode of reasoning that can be used by an RL agent.

A RL agent learns through a single channel, that is, by sending requests to the target system and processing the responses. It starts with no knowledge of the actions it can take, nor with any internal model of the environment Learning happens only by inference: from the rewards it achieves by interacting with the environment, the agent will infer the value of its actions in various states.

Standard RL makes minimal assumptions on prior knowledge, and thus it defines a much more demanding challenge than the learning challenge faced by a human attacker in the real-world; indeed, a human player may rely on several additional channels, modes of reasoning and heuristics to restrict her space of options and direct her action. For instance, she can collect information on the target system from alternative sources on the Internet, rely on social engineering, make deductions about the target systems, or produce and test hypothesis. All these alternatives are normally precluded to any standard RL agent which has to rely only on its ability to collect experience in a fast and efficient way and perform induction over it.

3.4 Specific Challenges in Applying RL to PT

While RL has proved successful in solving many control problems and in achieving super-human performances in many games, the combination of the peculiar structure of many PT problems and the limitations of RL give rise to unique challenges when we try to apply RL to hacking.

The first big specific challenge follows from the limited amount of structure of many hacking problems, up to the limit of max-entropy systems described above. Ideally, a system to be defended has little structure and it exchanges with the potential attacker messages carrying as little information as possible. In such a setting, the biggest challenge for a red team RL agent is not to learn on optimal strategy over a known structure (as in the case of a game), but it is to discover the structure of the system itself. CTFs stress the need for exploration: for an RL agent managing an efficient exploration is as important as developing a complex exploitation strategy. Good RL algorithms should take this aspect in particular consideration.

The second hard challenge derives from the limited amount of information and types of reasoning available to the RL agent. Human red-team players rely on a wider set of knowledge to solve a CTF challenge. This inevitably raises the question of how efficiently trial-and-error alone can be in addressing a CTF challenge. Conversely, this problem invites to consider how RL algorithms can be enriched with side information which would provide useful background knowledge, thus directing and speeding the learning process.

These two challenges will provide a criterion to study the application of RL to PT and CTF, evaluate our simulations, and suggest future developments.

4 RL Model of a CTF Problem

In this section we move on to propose a formalization of CTF challenges using the formalism of RL. We first identify the classes of CTF challenges that we will study experimentally, and then present a precise formalization of these CTF problems using the formalism of RL.

4.1 Types of CTF Problems

CTFs may be categorized in groups according to the type of vulnerability they instantiate and the type of exploitation that a player is expected to perform. Each class of CTF problems may exhibit peculiar forms of structure and may be modeled independently. In this paper, we will consider the following prototypical classes of CTF problems:

  • Port scanning and intrusion: in this CTF problem, a target server system exposes on the network a set of ports, and an attacker is required to check them, determine a vulnerable one, and obtain the flag between the vulnerable port using a known exploit;

  • Server hacking: in this CTF problem, a target server system exposes on the network a set of services, and an attacker is required to interact with them, discover a vulnerability, either in the form of a simple unparameterized vulnerability or as a parametrized vulnerability, and obtain the flag by exploiting the discovered vulnerability;

  • Website hacking: a sub-type of a server hacking, in this CTF problem, a target server system exposes on the network a web site, and an attacker is required to check the available pages, evaluate whether any contains a vulnerability, and obtain the flag behind one of the pages by exploiting the discovered vulnerability.

These three classes provide well-known tasks that can be modeled as RL problem at various level of simplification and abstraction.

4.2 RL Formalism

We consider as a RL agent an artificial red team hacking player interacting with a vulnerable target system. The target system constitutes the environment with which the agent interact. The goal of the agent is to capture the flag in the target environment in the fastest possible way. Given the RL problem we set the following requirements and conditions:

  • The state space is assumed to be an unstructured finite set of states that encode the state of the environment and, implicitly, the state of knowledge of the agent.

  • The action space is assumed to be an unstructured finite set containing all the possible actions that may be performed by the agent. Notice that the set is the same in any state; even if some action may not be available to the agent in some states, this information is not provided to the agent; an agent is expected to learn by experience which action are possible in any state.

  • The transition function is assumed to be a deterministic function that encodes the logic of the specific CTF scenario that will be considered.

  • The reward function is assumed to be a deterministic function defining how well the agent is performing. Rewards will normally be dense but not highly informative: the agent receives a small negative reward for each attempt performed (normally ), and a large positive reward for achieving its objective (normally ); this setup will push the agent to learn the most efficient strategy (in terms of attempts) to capture a flag.

In the following experimental analysis we will focus on one particular algorithm, that is, tabular Q-learning. Our choice is motivated by several factors: (i) in general, Q-learning is a classical and well-performing algorithms, allowing us to relate our results with the literature; (ii) it guarantees that the agent will converge to an optimal policy; (iii) the use of a tabular representation allows for a simpler interpretation of the results; (iv) Q-learning is step-wise fast and efficient, thus allowing us to easily repeat experiments and guaranteeing reproducibility; (v) Q-learning has few hyper-parameters, allowing for a more effective tuning. The main drawback of adopting tabular Q-learning is scalability, which, implicitly, reduces the complexity of the problems that we will be able to consider. Despite this limitation, though, our results will probe and validate the possibility of solving CTF problems using RL, and they will allow us to assess the relevance of the challenges we identified.

5 Experimental Analysis

In this section we provide concrete instances of simple CTF challenges, we model them in the form of RL problems using the formalism discussed in Section 4, and we solve them using Q-learning. We consider CTF challenges with increasing complexity, and as we face the challenges we identified in Section 3, we instantiate different methods from the RL literature to tackle them. All the simulations are implemented following the standard RL interface defined in the OpenAI gym library222 [20], and they are made available online 333

to guarantee reproducibility and further experiments and extensions. Detailed explanations about the action set and hyperparameter configuration of each simulation are provided in the Supplemental Material.

5.1 Simulation 1: Port Scanning CTF Problem

In this simulation we consider a very simple port scanning problem. We use the basic tabular Q-learning algorithm to solve it, and we analyze our results in terms of structure of the solution and inference steps to convergence.

CTF scenario. The target system is a server which runs only one service affected by a known vulnerability. The port number on which the service runs is unknown; however, once the service port is discovered, the agent knows for certain where the vulnerable service is and how to exploit it. The red team agent can interact with the server by running a port scan or by sending the known exploit to a specific port. In this simplified scenario the vulnerability can be targeted with a ready exploit with no parameters; also it is assumed that no actions are performed by the blue team on the target system.

RL Setup. We define a target server exposing ports, each one providing a different service; one of the services is affected by a vulnerability, and behind it lies the objective flag.

We model the action set as a collection of actions: one port scan action, and one exploitation action for each of the existing ports. We also model the state set as a collection of binary states: one initial state representing the state of complete ignorance of the agent, and one state for each port taking value of one when we discover it is the vulnerable port. The dimensionality of a tabular action-value matrix scales as .

This simple exercise allows us to have a basic assessment of the learning ability of the agent. Notice that the agent is not meant to learn simply the solution to a single instance of this CTF game; in other words, it is not learning that the flag will always be behind port . In every instance of the CTF game the flag is placed behind a different port; thus, the agent has to learn a generic strategy that allows it to solve the problem independently from the initial setup.

In general terms, this problem constitutes a very simple challenge, in which the optimal strategy is easily acknowledged to be a two-step policy of scanning and then targeting the vulnerable port with an exploit. However, the RL agent is not aware of the semantics of the available actions and it can not reason out an optimal strategy, but it can only learn by trial and error.

Results. We run our simulation setting ports. We randomly initialized the policy of the agent and we run episodes. We repeat each simulation times in order to collect reliable statistics.

As discussed, in this simple scenario we know what would be the optimal policy and, therefore, what we expect the agent to learn. Figure 1(a) shows a plot of the action-value matrix at the end of the episodes. The matrix shows a clear diagonal pattern, meaning that in state , for , the agent has learned to favor action . This makes sense: in the initial complete-ignorance state the agent selects action corresponding to the port scan action; in state , for , corresponding to the knowledge that port is vulnerable, the agent selects action , corresponding to an exploit on the relative port. We can thus conclude that the agent has successfully learned the desired optimal strategy. The blue plot in Figure 1(b) shows the convergence towards the optimal strategy as a function of the number of episodes. The y-axis reports the ratio between the sum of the diagonal of , and the sum of all the entries of , that is . Since we know that the optimal strategy is encoded along the main diagonal of , this statistics tells us how much of the mass of is distributed along the diagonal. After around episodes the learning of the agent enters a phase of saturation. Notice that this ratio would converge to only in a infinite horizon. The purple plot in Figure 1(b) illustrates the number of steps per episode. After around episodes the agent has learned the optimal strategy and completes the challenges in the minimum number of actions.

Fig. 1: Results of Simulation 1. (a) Learned action-value matrix . (b) Plot of ratio as a function of the number of episodes (in blue), and number of steps as a function of the number of episodes (in purple).

Discussion. The success of RL in this proof-of-concept simulation is not surprising; yet, it highlights the specific challenges of addressing hacking using RL: solving the CTF challenge requires learning the structure of the problem; this is feasible, but, using only experiential data and inference means that the RL agent has to rely strongly on exploration. Almost two hundred episodes were necessary to converge to a solution, a number of attempts far greater than what necessary for a human red team to find an optimal strategy.

5.2 Simulation 2: Non-stationary Port-scanning CTF Problem

In this simulation we extend the previous problem by considering a more challenging scenario in which the target system is not stationary, but it may randomly change in response to the actions of the agent.

CTF scenario. In this scenario the blue team is not passive anymore, but it can act in response to actions perpetrated by the red team. We setup the same target system as before: the server has a single exploitable service running on a port whose number is unknown to the attacker. To model an attack-defense scenario, we suppose that the blue team is aware of the exploitable service but that they cannot stop it because this would affect their continuous business operation. The blue team cannot filter out traffic, and the only option they have is to move the service to another port if they observe actions that may prelude to an attack. This case is rather unrealistic, but we use it as a simplified attack-defence contest with limited actions.

RL Setup. We consider the same port scanning scenario defined in the previous simulation. However, we add a non-stationary dynamic: whenever the attacker uses a port scan action, the target server detects it with probability ; if the detection is successful the flag is randomly re-positioned behind a new port. Given the non-stationarity, this problem constitutes a more challenging learning problem than the previous one. In particular, knowledge of the structure gained by the agent via port scanning may not be reliable. In this stochastic setting, the optimal strategy is not necessarily the deterministic policy used in Simulation 1.

Results. We run our simulation setting ports. All the remaining parameters of this simulation are the same as in Simulation 1. We consider all the possible values of in the set . We repeat each simulation times in order to collect reliable statistics.

Figure 2 reports the action-value matrices learned for , and . While for small value of the action-value matrix resembles closely the pattern we observed in Simulation 1, for higher values of we lose this structure. In the almost-deterministic case (Figure 2(a)) it is reasonable to use a port scan action at the beginning, followed by an exploit action that has a high probability of success; therefore we observe the usual diagonal shape. In the more stochastic case (Figure 2(b)) it is likely that a port scan action is detected and that the flag is moved; yet using a port scanning action and a targeted action is still a reasonable bet, even if less effective (notice the different scale for the matrices in Figure 2(a) and Figure 2(b)). Finally in the completely random case (Figure 2(c)) a port scan action certainly results in a detection, and no plan can be built over the information gathered; the agent is basically reduced to resort to plain random guessing.

Fig. 2: Results of Simulation 2. Learned action-value matrix for: (a) , (b) , and (c) .

Consistently, Figure 3 shows the number of steps per episode when using , , . In the almost-deterministic case, the number of episodes sets almost immediately close to optimal; as we increase the stochasticity the number of steps increases because the agent can only try to guess the location of the vulnerability. Notice that the average number of steps in the completely random setting is higher than the number of ports; this is due to the fact that the agent tries out from time to time the port scan action, thus causing the flag to move, and requiring the agent to re-try its exploit on already checked ports.

Fig. 3: Results of Simulation 2. Number of steps as a function of episodes for , , .

Discussion. A non-stationary and non-monotonic problem constitutes, as it is well-known, a more challenging learning problem. Despite this, thanks to its formalization, a Q-learning agent is still able to solve this CTF problem in a reasonable, yet sub-optimal, way, as allowed by the degree of stochasticity and non-stationarity. Shifting the value of shows the dependence of the performance of the RL algorithm on its ability to discover a structure underlying the problem. For , the CTF system has a clear structure and it can learn an optimal policy; as increases, we slowly move from what we defined as a CTF system to a max-entropy system (see Section 3). Indeed, for our problem represents a max-entropy system: no action provides actual information on the structure of the target server (the port scan action is essentially unreliable and useless); unable to reconstruct any structure, RL has a very limited use: all we can do is just guessing, that is trying out one by one all the ports looking for the vulnerability. This underlies the role of structure in learning using RL agents.

5.3 Simulation 3: Server Hacking CTF Problem with Lazy Loading

In this simulation we consider a more realistic problem representing a simple server hacking scenario. To manage the large dimensionality of the state and action space and prune non-relevant state we adopt a lazy loading approach for our action-value table . We analyze how learning happens under this scenario, and what is the effect of the adopted approach on inference.

CTF scenario. In this simulation a target server provides different standard services, such as web, ftp, or ssh. Each service may have a vulnerability, either a simple vulnerability easily exploitable without a parameter (such as a Wordpress page with a plugin that may lead to an information disclosure in a specific known url) or a vulnerability requiring the attacker to send a special input (such as a Wordpress plugin with sql injection).

The attacker can carry out three types of information gathering actions. (i) It can check for open ports and services on the server. (ii) It can try to interact with the services using well-known protocols; this allows it to obtain basic information (such as banner information), and discover known vulnerabilities, such as weaknesses recorded in a vulnerability databases. (iii) It can interact more closely with potentially unique service setups or customized web pages; this will allow the attacker to identify undocumented vulnerabilities and the input parameters necessary for exploitation; for instance, in case of a ftp service, the agent may discover the input parameters for username and password, or, in the case of more complex services such as web, it may obtain get and post web parameters. In addition, the attacker has also two exploitation actions. (i) It can exploit a non-parametrized vulnerability by accessing the vulnerable service and retrieving the flag. (ii) It can choose a parameter out of a finite pre-defined set, and send it to a service to exploit a parametrized vulnerability and obtain the flag. In this scenario we make the simplified assumption that the agent can identify just a parameter name from a fixed and limited set, and it does not need to select a parameter value.

RL Setup. We define a target server exposing ports, each providing one of different services. One of the services is taken to be flawed, and behind it lies the objective flag. The vulnerability may be a simple non-parametrized vulnerability or a parametrized vulnerability. In the last case, the vulnerability may be already known, or it may be previously unknown thus requiring deeper probing and analysis of the service. The parameter for the parametrized vulnerability is chosen out of a set of possible parameters.

The collection of basic actions available to the agent gives rise to a larger set of concrete actions, where each action type is instantiated against a specific port. The set of states has a large dimensionality as well, due to the problem of tracking what the agent has learned during its interaction with the server. As a rough estimation, in our implementation we estimate the number of total states as:


Refer to the Supplemental Material for the derivation of this approximation. This encoding forms a sufficient statistics that tracks all the actions of the agent and records all its knowledge. It is not meant to be an optimal encoding, and the dimensionality of the set may be reduced through a smarter representation of the states. However, even if we were to make the encoding more efficient, the overall dimensionality would quickly become unmanageable when the parameters , or were to grow. Instead of relying on expert knowledge of the specific problem at hand, we rely on the assumption that several (state, action) pairs that may not be relevant or informative. This leads us to adopt a lazy-loading technique: instead of instantiating from the start a large unmanageable action-value matrix , we progressively build up the data structure of as the agent experiences new (state, action) pairs.

This problem constitutes a more realistic model of a CTF challenge, presenting a target system with multiple services, each one potentially having different type of weakness (unparameterized and parametrized vulnerabilities) at different levels (easy vulnerabilities already known or more treacherous vulnerabilities yet unknown). In this more challenging problem it is harder to define a simple deterministic optimal solution as it was in Simulation 1. A standard approach is undoubtedly to use more exploratory actions at the beginning, and leave exploitative actions for the end. However, the variability in the location of the flag and the sharp dynamics of the system make the problem far from trivial.

Results. We run our simulation setting ports, services, parameters. We randomly initialize pairs of (state, action) at run-time, and we run the agent for episodes. We repeat each simulation times in order to collect reliable statistics in a feasible amount of time.

Figure 4(a) reports the number of steps taken by our agent to complete a task, and, conversely 4

(b) shows the reward obtained by the agent. These plots are quite noisy, but they show a clear improvement in the first few thousand episodes: we can clearly see a drop in the number of steps and an increase in the amount of reward collected. Notice that the high variance recorded is in part due to the highly exploratory behavior of the agent (

) that leads the agent to take a random action almost one third of the times. Interestingly, though, the upper bound of the reward curve approaches a reward of or higher, pointing out that the agent was indeed able to learn a sensible strategy as it was able to solve the CTF problem in few actions compared to the large number of possible combinations of actions it could try.

Fig. 4: Results of Simulation 3. (a) Plot of number of steps as a function of the number of episodes; (b) plot of reward as a function of the number of episodes.

Figure 5 shows the number of entry in the action-value table during the episodes. The plot seems to have a parabolic behavior growing fast at the beginning and slowing down towards the end. This makes sense, as at the beginning every state encountered by the agent is new and needs to be added to the table . The continual increase in size is due to the strong exploratory policy () followed by the agent. Notice, that if we were to substitute the values of , and of this simulation in Equation 2 we would get a rough estimate for of over ; therefore the number of states learned so far is an order of magnitude smaller (), and it has allowed the agent to learn swiftly a reasonable policy with a significantly smaller consumption of memory.

Fig. 5: Results of Simulation 3. Number of entries in the action-value table as a function of number of episodes.

Discussion. This more realistic simulations highlights at the same time the standard strengths and weaknesses of RL agents. An RL agent may be able to tackle a challenging problem with a subtle and sharp structure like the one presented, but, potentially, at a high computational cost. A trivial implementation may still be able to solve the problem, but it may quickly become unmanageable if it were to treat explicitly all the possible states. Instead of relying on expert knowledge to determine which states are important and which not, lazy loading has allowed the agent to discriminate between relevant and non-relevant states based on its experience. In the following simulations we evaluate alternative ways to improve the learning process of the agent.

5.4 Simulation 4: Website Hacking CTF Problem with State Aggregation

In this simulation we run an environment similar to the previous one, and we adopt an additional strategy to address the challenge of exploration by instructing the agent to perform state aggregation over similar states. Again, we run our simulations and we study the dynamics and the performance of inference and learning.

CTF scenario. In this simulation we assume that the attacker knows the location of a target web page, so no port scan or protocol identification is required. The webpage consists of a set of files: starting from an index file, the attacker can map the visible files by reading the html content and by following the links inside the content. The webpage may also host hidden files not linked to the index. Some of the files contain server-side scripts and the attacker may identify customized inputs that may be sent to perform an exploitation and capture the flag. The attacker is given three types of information gathering actions. (i) It can read the index file, follow recursively all links, and thus obtain a map of all the linked files on the server. (ii) It can try to find hidden files by parsing the content of a visible file and infer the existence of hidden files; for instance, looking at a file on a Wordpress site, the attacker may suspect the existence of /wp-login/index.php. (iii) It can analyze a visible or hidden file in order to find input parameters that can be used for an exploitation. A single exploitation action is possible. (i) The attacker can send an input parameter to a file and, if correctly targeting the vulnerable file, obtain the flag. Here, again we restrict our model to the problem of identifying a vulnerable parameter name out of a set, and not its parameter value.

RL Setup. We define a target server hosting files, partitioned in visible files and hidden files. Visible files are linked to the index file and connected among them in a complete graph; hidden files are files not openly linked to the index files but referenced or related to one of the visible files. One of the files, either visible or hidden, contains a parametrized vulnerability behind which lies a flag. The vulnerable parameter is chosen out of a set of possible parameters.

As before, the dimensionality of the action-value matrix grows exponentially with the number of files and the number of parameters . In order to make the problem manageable we decided to introduce prior knowledge in our model. We know that files on the target servers may be different, but the way to interact with them is uniform: we explore and inspect files using the same actions; we target files with the same vulnerability in an identical way. Notice that, in the real-world, the concrete way in which we implement actions on different files may be different, but these distinctions are abstracted away in the current model. The dynamics of interacting with files are then homogeneous among all the files. Thus, instead of requiring the agent to learn a specific strategy on each file, we instruct it to learn a single policy that will be used on all the files. We achieve this simplification using state aggregation [3]. At each time step, the agent will be focused only on a single file, interact with it and update a global policy valid for any file.

Results. We run our simulations randomly setting visible files and hidden files. We randomly initialize pairs of (state, action) using lazy loading and state aggregation. We run a single agent for episodes and then we test it on episodes during which we set the exploration parameter to 0. We repeat the testing of a trained agent times in order to collect reliable statistics.

Fig. 6: Results of Simulation 4. (a) Number of entries in the action-value matrix as a function of the number of episodes; (b) reward and number of steps as a function of the number of episodes.

Figure 6(a) shows the number of (state, action) pairs in the action-value table of our agent during the episodes of learning. The number of states saturates very quickly, enumerating all the states encountered by the agent. Figure 6(b) shows the reward and the number of steps on further episodes when running the same agent with the exploration parameter () set to zero. As expected the two plots are perfectly complementary, with the number of steps oscillating between and , and the reward between and . Removing the exploration parameter is a risky choice that may lead the agent to get stuck if it were to face sudden changes in the environment, but it allows us to better appreciate the fact that the agent indeed was able to learn a clear policy that allowed it to capture a flag with a minimal number of actions; eight to ten steps is indeed what is necessary to probe the target server, collect information on the files, and finally retrieve the flag.

Discussion. This simulation preserves most of the complexity of Simulation 3, and it shows how using proper RL algorithms and techniques (lazy loading and state aggregation), a RL agent may manage to solve a challenging CTF problem. Notice that state aggregation allowed us to introduce a form of knowledge that a RL agent would not normally have. A human red team player may reach the conclusion that it is reasonable to act in a uniform way with different files from her previous experience with files; this knowledge provides her with an effective shortcut to reach a solution. A RL agent has no similar possibility as it has no formal concept of files; it could end up learning by inference a policy that is actually uniform for all the files, but this would require collecting a large sample of experiences. State aggregation allowed to inject useful prior information about the structure of the problem, thus simplifying exploration and reducing the number of (state, action) pairs.

Another interesting feature of this simulation is the use of a graph to represent the filesystem on the target website. In this simulation, given the small size of the graph comprising between two and six files we relied on a simple linear exploration of the graph; however, smarter and more sophisticated way of manipulating and exploring the graph may be taken into consideration to improve the performance of the agent. This may be a further development of interest, but in the next simulation we will consider a different modification and rely again on an alternative state-of-the-art approach to the problem to direct the learning of the RL agent.

5.5 Simulation 5: Web Hacking CTF Problem with Imitation Learning

In this simulation we consider a way to direct the learning process more explicitly by using imitation learning, in which we simulate learning in a teacher-and-student setting. We analyze the behavior of the agent under this setup and we compare the results of this process with the results obtained in the previous simulations.

Hacking scenario. We consider again the same server hacking problem presented in Simulation 3, as this constitutes the most challenging problem we have faced so far.

RL Setup. We consider the same setup used in Simulation 3. Beyond lazy-loading, this time we also rely on another standard RL technique, that is, imitation learning (or learning from demonstrations) [21]. In imitation learning, an agent is provided with a set of trajectories defined by human experts; in our case, these trajectories encode the behavior of a hypothetical human red team player trying to solve the web hacking CTF problem. These trajectories represent samples of successful behavior and provide information to the RL agent about the relevance of different options. Indeed, in imitation learning, the agent, instead of starting in a state of complete ignorance, is offered examples of how actions can be combined to reach a solution of the problem. This simplifies the exploration problem: instead of searching uniformly in the whole space policies, the search is biased towards expert-defined policies. This bias allows to solve the problem more efficiently, but it also makes less likely that the agent will discover policies that are substantially different from human behavior.

Results. We run our simulations using the same setting used in Simulation 3. First, we train a standard RL agent for episodes. Then, we train three imitation learning agent, each one being provided with , and demonstration respectively; after that the three imitation learning agent are further trained for episodes.

Figure 7 shows the rewards obtained by the different agents. The dotted lines represent the average reward obtained by the imitation learning agent during the episodes of training; notice that these lines are independent from the scale on the x-axis and are plotted as constants for reference. The blue line shows the reward, averaged every episodes, obtained by the standard RL agent during training. The graph shows that the standard RL agent need to be trained on almost episodes before reaching the average reward that an imitation learning agent can achieve with demonstrations; similarly, the whole training time of episodes is necessary to match an imitation learning agent provided with demonstrations. The overall rewards are still far from being optimal, but imitation learning allows for a reduction of the number of episodes of training of one order of magnitude.

Fig. 7: Results of Simulation 5. Reward achieved by RL agents with and without imitation learning (see text for explanation).

Discussion. Imitation learning proved to be an effective techniques to enable faster learning for the RL agent. This improvement is again due to the possibility of introducing in the agent knowledge on the structure of the problem. Indeed, demonstrations are an implicit way to express human knowledge about the structure of the CTF problem: instead of encoding knowledge on the structure of the problem in a formal mathematical way, we provide the RL agent with concrete observations about the structure of the problem. This information can successfully be exploited by the agent in order to learn an optimal policy.

6 Discussion

The simulations in this paper showed the feasibility of using RL for solving CTF problems, as well as the central role that the challenges of discovering structure and providing prior knowledge play in this context. While RL was able to solve optimally a simple CTF with an elementary structure (Simulation 1), we observed that changes in the structure of the CTF problem may make the problem harder to solve. We considered two ways in which the structure of the problem may change and raise concrete challenges. First, a progressively more undefined problem structure, shifting from a stationary CTF system to a max-entropy system, highlighted the limits of learning by inference (Simulation 2). Second, a stationary CTF problem with a progressively more complex structure required an exponential number of samples for the agent to work out the structure of the problem. In this last case, we showed how RL techniques, such as lazy loading, state aggregation, or imitation learning, may allow the RL agent to tackle more complex problems (Simulation 3, 4, 5). These techniques were explained and justified in terms of providing the agent with elementary prior information about the structure of the problem. Lazy loading corresponded to the assumption that certain configuration in the problem space would never be experienced, and therefore could be ignored; state aggregation expressed the assumption that certain configuration would be pragmatically identical to others; and imitation learning codified the assumptions that an optimal solution would not be too far from well-known demonstrations. Notice that while imitation learning necessarily require expert knowledge, lazy loading and state aggregation are based on simple assumption needing limited expertise. Although implemented in specific simulations, all these forms of prior knowledge are not semantically-tied to a specific problem, and they may be easily deployed across a wide range of other CTF problems. Devising way to efficiently discover the structure of a CTF problem pure inference is then crucial to develop agents that can effectively perform PT.

7 Future Work

Several of the scenarios that we considered in this paper were very simplified versions of CTFs, more similar to toy problems than real challenges. Progressing forward would mean, at the same time, scaling the complexity of CTFs and improving the way in which a RL agent manages structure and prior knowledge.

In terms of scaling structure, a direct way to achieve this would be to increase the sheer complexity of the problems by expanding the size of state and action space in order to resemble more closely what we see in reality. Complexity may also be increased by consistently adopting the assumption of non-stationarity, as we briefly did in Simulation 2. Alternatively, we may require that the interaction between the RL agent and the target system does not use custom-made machine-interpretable messages, but real-world protocol messages that may be processed by a dedicated language processing module.

In terms of learning the structure of the problem and integrate prior knowledge, better generalization (and scalability) can be achieved by switching from tabular algorithms to approximate algorithms, thus sacrificing interpretability. More interestingly, it is possible to consider the possibility of learning through multiple channels or relying on other forms of prior knowledge; promising directions would be the integration of planning [4], hierarchical decomposition of a CTF in sub-tasks, reliance on relational inductive biases, [22] or integration of logical knowledge in the learning process [23].

Tangentially, other challenges include the use of model learning, in order to allow the agent to learn an approximate model of the transition function of the environment, so that it could learn off-line via simulation; and proper reward shaping, that is, providing rewards that may better guide the learning process. Finally, real-world agents may have to consider the problem of transfer learning [24], that is how to port the knowledge obtained from a class of CTF problems to another set of CTF problems.

8 Conclusions

In this work we considered CTF competitions as concrete instances of PT, and we modeled them as RL problems. We highlighted that two crucial challenges for a RL agent confronting a CTF problem are: (i) the challenge of discovering a structure that is often limited and protected; (ii) the challenge of learning using only inference, whereas human players may rely on many other forms of knowledge and reasoning. We ran a varied set of simulations, implementing tabular Q-learning agents solving diverse CTF problems. Our results confirmed the relevance of the challenges we identified, and we showed how different RL techniques (lazy loading, state aggregation, imitation learning) may be adopted to address these challenges and make RL feasible.

We observed that while a strength of RL is its ability to solve model-free problems with minimal prior information, some form of side information may be extremely useful for allowing the solution of a CTF in a reasonable time. We believe a constructive approach would be for RL to learn from standard artificial intelligence model-based methods and balance RL model-free inference with model-based deductions and inductive biases.

Our implementations are open and use standard interfaces adopted in the RL research community. It is our hope that this would make an exchange between the fields easier, with researchers in security able to borrow state-of-the-art RL agents to solve their problems, and RL researchers given the possibility of developing new insights by tackling the specific challenges instantiated in CTF games.