Log In Sign Up

A Holistic Approach to Evaluating Cyber Security Defensive Capabilities

Metrics and frameworks to quantifiably assess security measures have arisen from needs of three distinct research communities - statistical measures from the intrusion detection and prevention literature, evaluation of cyber exercises, e.g., red-team and capture-the-flag competitions, and economic analyses addressing cost-versus-security tradeoffs. In this paper we provide two primary contributions to the security evaluation literature - a representative survey, and a novel framework for evaluating security that is flexible, applicable to all three use cases, and readily interpretable. In our survey of the literature we identify the distinct themes from each community's evaluation procedures side by side and flesh out the drawbacks and benefits of each. Next, we provide a framework for evaluating security by comprehensively modeling the resource, labor, and attack costs in dollars incurred based on quantities, accuracy metrics, and time. This framework is a more "holistic" approach in that it incorporates the accuracy and performance metrics, which dominate intrusion detection evaluation, the time to detection and impact to data and resources of an attack, favored by educational competitions' metrics, and the monetary cost of many essential security components used in financial analysis. Moreover, it is flexible enough to accommodate each use case, easily interpretable, and comprehensive in terms of costs considered. Finally, we provide two examples of the framework applied to real-world use cases. Overall, we provide a survey and a grounded, flexible framework and multiple concrete examples for evaluating security that addresses the needs of three, currently distinct communities.


A short review on Applications of Deep learning for Cyber security

Deep learning is an advanced model of traditional machine learning. This...

Intrusion Detection Systems: A Cross-Domain Overview

The cybersecurity ecosystem continuously changes with the growth of cybe...

Research Communities in cyber security: A Comprehensive Literature Review

In order to provide a coherent overview of cyber security research, the ...

How to Quantify the Security Level of Embedded Systems? A Taxonomy of Security Metrics

Embedded Systems (ES) development has been historically focused on funct...

Cry Wolf: Toward an Experimentation Platform and Dataset for Human Factors in Cyber Security Analysis

Computer network defense is a partnership between automated systems and ...

1. Introduction

As security breaches continue to affect personal resources, industrial systems, and enterprise networks, there is an ever growing need to understand, “How secure are my systems?” This need has driven diverse efforts to systematize an answer. In the research literature, evaluation of information security measures has developed from three different but related communities. A vibrant and growing body of research on intrusion detection and prevention systems (IDS/IPS) has produced algorithms, software, and system frameworks to increase security; consequently, evaluation criteria to assess the efficacy of these ideas has been informally adopted. Most common methods require representative datasets with labeled attacks and seek traditional statistical metrics, such as true/false positive rates.

Analogous developments emerged from cyber security exercises, which have become commonplace activities for education and for enterprise self assessment. Scoring for red-team and capture-the-flag educational exercises are necessary to the framework of the competitions, and generally provide more concrete measures of security as they can accurately quantify measures of network resources, e.g., the time that a server was online or the number or records stolen.

More pragmatic needs for quantifying security arise at the interface of an organization’s financial and security management. Justifying security budgets and identifying optimal use of resources requires concise but revealing metrics intelligible to security experts and non-experts alike. To this end, intersections of the security and economic research communities have developed cost-benefit analyses that give methods to determine the value of security. As we shall see, the desired summary statistics provide a platform for quantifiable analysis, but are often dependent on intangible estimates, future projections, and lack semantic understanding.

While these sub-domains of security research developed rather independently, their overarching goal is the same—to provide a quantifiable, comparable metric to validate and reason about the efficacy of security measures. This work delivers two primary contributions. First, we provide a representative survey of the security evaluation literature (Section 2). Works are chosen that collectively highlight the trends and landmarks from each subdomain (IDS evaluation, cyber competition scoring, economic cost-benefit analyses) allowing side-by-side comparison. We illustrate drawbacks and beneficial properties of the evaluations techniques.

Second, we propose a general “holistic” framework for evaluating security measures by modeling costs of non-attack and attack states. By holistic, we mean this approach is comprehensive in terms of the real-world factors that contribute to the overall models, and it is flexible enough to satisfy all three use cases. Specifically, it incorporates the accuracy and performance, which dominate IDS evaluation; the time to detection as well as the confidentiality, integrity, and availability of resources, favored by competitions’ metrics; and the dollar costs or resources, labor, and attacks comprised by cost-benefit analyses.

Our model, described in Section 3, is a cost model that can be configured for many diverse scenarios, and permits a variety of granularity in modeling each component to accommodate situations with ample/sparse information. Unlike many previous frameworks, ours uses a single, easy-to-interpret metric, cost in dollars, and is readily analyzable as each component of this cost uses a fairly simple model. As is commonplace for such economic models, finding accurate input values (e.g., maximum possible cost of an attack, or the quantity of false alerts expected) is difficult and a primary drawback of our and all previous similar models (see Section 2.3). In response, we provide a sensitivity analysis is Section 3.3, to identify the model parameters/components that have the greatest effect, so users know where to target efforts to increase accuracy—a practice that our survey reveals is unfortunately rare.

We employ the new model in Section 4. As the driving force behind this research, we give a detailed configuration of our cost model to be used as the evaluation procedure for an upcoming IAPRA Challenge involving intrusion detection (Section 4.1). We provided simulated attack and defense scenarios to test our scoring framework and exhibit results confirming that the evaluation procedure encourages a balance of accuracy, timeliness, and resource costs. We expect our simulation work to provide a baseline for future competitors.

Finally, as another example we configure this new model to evaluate the GraphPrints IDS from our previous work (Harshaw et al., 2016). This example shows the efficacy of the evaluation model from many viewpoints. For researchers it provides an alternative to simple accuracy metrics, by incorporating the accuracy findings and resource costs into a realistic, quantifiable cost framework. This allows, for example, optimizing thresholds, accuracy, and performance considerations rather than just reporting each. From the point of view of a security operation center (SOC), we provide an example of how to evaluate a tool with the perspective of a potential purchase. Finally, we consider the model from a vendor’s eye, and derive bounds for the potential licensing costs.

Overall, the main contributions of this work are (1) a survey of three distinct but related areas, and (2) a general framework allowing computation and comparison of security that satisfies the needs of all three use cases with examples of how this metric can be used.

2. Related Work: A Survey of Security Evaluation

Our findings from related work is that evaluation of cyber security measures has developed in three, rather independent threads. This section gives a survey of our findings and strives to be representative of the main ideas in each rather than comprehensive in terms of citing every paper possibly related.

2.1. Evaluation of Intrusion Detection & Prevention Systems

In the intrusion detection and prevention research, which focuses on evaluating and comparing detection capabilities, researchers generally seek statistical evidence for detection accuracy and computational viability as calculated on test datasets. While such evaluations are commonplace, curation of convincing test sets and developing relevant metrics for efficacy in real-world use has proven difficult. The default metrics employed, such as computational complexity for performance and the usual accuracy metrics (e.g., true/false positive rate, precision, receiver operator characteristic (ROC) curve, and area under the curve (AUC) to name a few) are indeed important statistics to consider. Yet, these metrics cannot account for many important operational considerations, e.g., valuing earlier detection over later, valuing high-priority resources/data, or including the costs of operators’ time. Research to incorporate these aspects is emerging, e.g. (Garcia et al., 2014), but still not does not incorporate all of these real-world concerns. Often the cost to implement the proposed security measures in operations, e.g., hardships of training the algorithms on network-specific data or configuration/reconfiguration costs, is either neglected or considered out of scope.

Further, validation of a proposed method requires data with known attacks and enough fidelity to demonstrate the method’s abilities. In general, there is limited availability of real network datasets with known attacks, and there is little agreement on what qualities constitute a “good”, that is, representative and realistic, dataset. This is exacerbated by privacy concerns that inhibit releasing real data, and unique characteristics inherent to each network, which limits generalizability of any given dataset. Notably, there are a small number of publicly available datasets, that have catalyzed a large body of detection research, in spite of many of these datasets receiving ample criticism. See Glass-Vanderlan et al. (Glass-Vanderlan et al., 2018) for a list of datasets appearing in the research literature and a survey of IDS works by data type. In other cases, researchers often use one-off, custom-made datasets, e.g. (Jewell and Beaver, 2011; Harshaw et al., 2016). While this can potentially address some of the concerns above, these datasets generally are not made publicly available, which sacrifices reproducibility of results, inhibits meaningful comparison across publications, and makes it difficult or impossible to verify the quality of these datasets.

2.1.1. DARPA Dataset

One of the earliest attempts to systematize this performance evaluation was the DARPA 1998 dataset, which was originally used to evaluate performers in the DARPA/ AFRL 1998 “Intrusion Detection Evaluation,” a competition-style project. This dataset included both network and host data, including tcpdump and list files, as well as BSM Audit data. The test network included thousands of simulated machines, generating realistic traffic in a variety of services, over a duration of several weeks; this also included hundreds of attacks of 32 attack types (Lippmann et al., 2000). Interestingly, results were measured in “false alarms per day” (presumably the averaged per day in the dataset) not the raw false alarm rate (number of false positives / number of negatives). This false alarm per day rate was compared with the true positive rate in percent. The authors chose this representation of the false alarm rate to emphasize the costs in terms of analysts’ time; however, future researchers generally used the raw false alarm rate instead, since the two are directly proportional.

This was apparently the first use of the ROC curve111Reciever Operator Characteristic (ROC) refers to the curve plotting the true positive versus false positive rate as the detection threshold is varied. in intrusion detection, which has since become a common practice. As the authors explain:

ROC curves for intrusion detection indicate how the detection rate changes as internal thresholds are varied to generate more or fewer false alarms to tradeoff detection accuracy against analyst workload. Measuring the detection rate alone only indicates the types of attacks that an intrusion detection system may detect. Such measurements do not indicate the human workload required to analyze false alarms generated by normal background traffic. False alarm rates above hundreds per day make a system almost unusable, even with high detection accuracy, because putative detections or alerts generated can not be believed and security analysts must spend many hours each day dismissing false alarms. Low false alarm rates combined with high detection rates, however, mean that the putative detection outputs can be trusted and that the human labor required to confirm detections is minimized.

This dataset has faced various criticism for not being representative of real-world network conditions. Some researchers have observed artifacts of the data creation, due to the simulated environment used to generate this data, and explained how these artifacts could bias any detection metrics (Mahoney and Chan, 2003). Other researchers noted that high-visibility but low-impact probing and DoS attacks made up a large proportion of the attacks in the dataset (Brugger and Chow, 2007), giving them increased importance in the scoring, while other work has criticized the attack taxonomy and scoring methodology (McHugh, 2000).

2.1.2. KDD Cup 1999 dataset & evaluation

The KDD Cup competition was based on the same original dataset as DARPA 98, but included only the network-oriented features and pre-processed them into convenient collections of feature vectors. This resulted in a dataset consisting essentially of flow data with some additional annotations. This simplified dataset was easier for researchers to use and spawned a large variety of works which applied existing machine learning techniques to this dataset 

(Glass-Vanderlan et al., 2018).

In the performer evaluation, the scores were determined by finding the confusion matrix (number of true/false positives/negatives) and multiplying each cell by a factor between 1 and 4. This weighting did penalize false positives for the “user to root” and “remote to local” attack categories more heavily than other types of mis-categorization; however, this was apparently to compensate for the uneven sizes of the classes, and did not seem to consider any relative, real-world costs of false positives versus false negatives, as these costs were discussed in neither the task description nor the evaluation discussion. Other than these weights, the evaluation and the discussion of results placed no particular emphasis on the impact to the operator, like the DARPA 98 evaluation discussed above. Omitting any ROC curves also seemed to be a step backwards, although understandable because not all submissions included a tunable parameter that this requires.

Overall, the performers achieved reasonable results, but none were overwhelmingly successful. Notably, some very simple methods (eg. nearest neighbor classifier) performed nearly as well as the winning entries 

(Elkan, 2000).

Because it was derived from the DARPA dataset above, this dataset inherited many of the same problems discussed earlier. In particular, Sabhnani & Serpen (Sabhnani and Serpen, 2004) note that it was very difficult for performers to classify the user-to-root and remote-to-local categories:

Analysis results clearly suggest that no pattern classification or machine learning algorithm can be trained successfully with the KDD data set to perform misuse detection for user-to-root or remote-to-local attack categories.

These authors explain that the training and testing sets are too different (for these categories) for any machine learning approach to be effective. After merging the training and test sets, they re-tested using five-fold cross validation to observe that the same methods on this modified dataset resulted in vastly superior detection performance.

Other researchers released variants of this dataset to address some of these problems, such as the NSL-KDD dataset (Tavallaee et al., 2009), which removed many redundant flows and created more balanced classes. However, other problems still remain, and at this point the normal traffic and the attacks are no longer representative of modern networks.

2.1.3. Later Developments

Following the release of these early benchmark datasets, researchers have generally focused on creating more recent and/or higher-quality datasets, and generally have not not focused on the methodology/metrics used in evaluation. Examples are the UNM dataset (of the University of New Mexico, 2006) of system call traces for specific processes, the ADFA datasets (Creech and Hu, 2013) of Windows and Linux host audit data, and the VAST competition 2012222 and 2013333 datasets, which focus on various network data sources.

There are many additional data sources which are now publicly available, but most are not suitable as-is for training and testing these systems. Some are more specialized datasets, such as the Active DNS Project (Kountouras et al., 2016), many contain only malicious traffic, and some data sources are both specialized and malicious, such as only containing peer-to-peer botnet command and control traffic.

There are much fewer examples of normal traffic datasets, although a few of them have been released, such as CAIDA’s anonymized internet traces444The CAIDA UCSD Anonymized Internet Traces These are generally either anonymized and/or aggregated in some way (eg. flows versus packets, modifying addresses, etc.) that can limit some types of analysis. More importantly, there is little consensus on what “normal traffic” means, or what datasets would be representative of what networks. If these systems are deployed on networks with different characteristics than the evaluation datasets, the performance could differ significantly. Some works have proposed criteria which high-quality datasets should strive to achieve, but these are not universally accepted currently (Sharafaldin et al., 2017).

Because of these issues, current datasets can vary significantly in quality, and most are flawed in some way. There is currently no consensus on general-purpose benchmark datasets, the role which the DARPA and KDD datasets used to fill. This situation does seem to be gradually improving over time, but for now these remain significant issues.

2.1.4. CTU Botnet Dataset

One notable recent contribution presents both a new datasets, as well as a new methodology for evaluating IDS performance for detecting botnet traffic from PCAPs (packet captures) or just flows. (Garcia et al., 2014). The datasets contain a variety of individual malware PCAP files, collected over long time frames, and also some collections of (real) background traffic from a university network. In addition, the authors discuss some important shortcomings of the generally-used metrics, and propose some significant improvements.

The classic error metrics were defined from a statistical point of view and they fail to address the detection needs of a network administrator

To address this, they propose new criteria:

  • Performance should be measured by addresses instead of flows. This is important because, for example, one malware sample may generate much more command and control traffic than another, even while performing similar actions. This incidental difference in behavior should not artificially impact the detection scores.

  • When correctly detecting botnet traffic, a True Positive, early detection is better than latter.

    • They define as “A True Positive is accounted when a Botnet IP address is detected as Botnet at least once during the comparison time frame.”

  • When failing to detect actual botnet traffic, a False Negative, an early miss is worse than latter.

    • They define as “A False Negative is accounted when a Botnet IP address is detected as Non-Botnet during the whole comparison time frame.”

  • The value of correctly labeling non-botnet traffic (True Negative) is not affected by time.

    • They define as “A True Negative is accounted when a Normal IP address is detected as Non-Botnet during the whole comparison time frame.”

  • The value of incorrectly alerting on normal traffic (False Positive) is not affected by time.

    • : “A False Positive is accounted when a Normal IP address is detected as Botnet at least once during the comparison time frame.”

They then use these to define the following time-dependent versions of the usual true/false positive/negative counts:


Here is the number of the “comparison time frames,” representing the relative time of the event; is an adjustable time-scaling parameter; is the number of unique botnet IP addresses in the comparison time frame, and is the number of unique normal IP addresses in the comparison time frame. Finally, the time-respecting analogues of the usual metrics are as follows:


To summarize, detecting or failing to detect real botnet traffic (or other attack traffic) is time-sensitive, while for normal traffic it is not. Also, any number of alerts for flows that are related to the same address should be aggregated into one item for evaluation purposes, since the analyst is primarily concerned with the machine-level, and not directly concerned with the flow-level. Additionally, these authors define new time-based measures of FPR/TPR/TNR/FNR, Presicion, Accuracy, Error rate, and F1 score. They provide a public tool to calculate these new scores

(Garcia et al., 2014)555

2.1.5. Common Problems with ML Approaches

Sommer & Paxson (Sommer and Paxson, 2010) review the difficulties in using machine learning in intrusion detection, and help to explain why it has been less successful when compared with other domains, such as optical character recognition (OCR). Some of the main issues they highlight involve the lack of quality training data, specifically insufficient quantity of data, training with one-class datasets, and non-representative data causing significant problems. Additionally they re-emphasize some important practical issues, such as the relatively high costs of both false positives and false negatives, and the “semantic gap” referring to the difficulty in interpreting alerts. All of these factors result in difficulties in performing evaluations, with the simple statistical metrics of FP & FN rates being insufficient, and real-world usability being more important, but more difficult to measure. The authors emphasize that designing and performing the evaluation is generally “more difficult than building the detector itself.”

A common problem for intrusion detection metrics is the base-rate fallacy; e.g., see Axelsson (Axelsson, 2000). Concisely, the base-rate fallacy is the presence of both a low false positive rate (percentage of negatives that are misclassified) but a high false alert rate (percentage of alerts are false positives, equivalently, precision). The base-rate fallacy is often caused by high class imbalance, usually orders of magnitude more negatives (normal data) than positives (attack data). That is, the denominator of the false positive rate calculation is usually an enormous number; hence, for nearly any detector the false positive rate can be exceptionally low. This can give a false sense of success and it means that ROC curves are only in effect depicting the true positive rate. On the other hand, the false alert rate, or simply quantity of false alerts are often important to take into account.

Their overall conclusion is that the intrusion detection problem is fundamentally harder than many other machine learning problem domains, and that while these techniques are still promising, they must be applied carefully and appropriately to avoid these additional difficulties.

2.1.6. Evaluating Other Tools

Most of these works discussed above focus on IDSs specifically, but these difficulties also apply to IPSs, malware detection, and related problems. Similarly, these apply regardless of the data source or architecture being considered—host-based, network-based, virtual machine hypervisor-based, or other approaches.

Additionally, when evaluating other security-related systems, such as firewalls, SIEMS (security information and event management systems), ticketing systems, etc., we encounter even more difficulty. Not only are there no widely accepted datasets, the relevant metrics and the testing methodology are often not considered systematically. Some of these factors, such as the user experience, integration with current workflow and current tools, etc. are inherently harder to quantify, and are often organization-dependent. The situation at present is somewhat understandable, but we maintain that a holistic approach to evaluation should help in addressing these areas as well.

2.2. Evaluation criteria for cyber competitions

Red team & capture-the-flag (CTF) competitions exercise both offensive and defensive computing capabilities. These activities are commonly used as educational opportunities and for organizational self assessment (Drinkwater and Zurkus, 2017; Doupé et al., 2011; Patriciu and Furtuna, 2009; Reed et al., 2013; Werther et al., 2011; Mullins et al., 2007)666Also see DOE’s cyber-physical defense competition, DefCon’s CTF, NCX (NSA’s), National Collegiate Cyber Defense Competition (NCCDC) (, and Mitre’s CTF ( among others.. These competitions require a set of resources to be attacked and/or defended and an evaluation criteria to determine winning teams (among other necessities).

In addition to traditional statistical evaluation metrics, the competitions integrate measures of operational viability, such as, the duration or number of resources that remained confidential, unaltered (integrity measure), or function (availability measure), in addition to statistical measures, e.g., true positive rate, etc. For example, Patriciu & Furtuna 

(Patriciu and Furtuna, 2009) list the following scoring measures for cyber competitions (for attackers:) the count of successful attacks, accesses to target system, and number of successfully identified open services compared to the total number, an analogue true positive rate, (for defenders:) true positive rates for detection (identification) and forensics (classification), time duration to recover from an attack, downtime of services.

While there is wide variety across competitions, the main trend in evaluation is to augment the usual detection accuracy metrics with some measure of how well an operation remained healthy and unaffected. This greatly increases trust in the evaluation procedure because the effect of the security measures on the operational objective are built into the metrics. We note that the object under evaluation is usually the participants’ skill level, and significant effort is needed to assemble the test environment. While cyber exercise publications often focus on a combination of pedegogy, design, implementation, etc., we only survey the evaluation procedures. Below we discuss two works that give novel evaluation metrics for cyber competitions.

2.2.1. iCTF’s Attacker Evaluation

Doupé et al. (Doupé et al., 2011) describes the 2010 International Capture the Flag Competition (iCTF), which employed a novel “Effectiveness” score for each attacker. For each service, , and time the binary functions , taking values in , are defined as follows: is a binary function that indicates criticality of service at that time ; specifically, it indicates if the function is in use for this application. encodes risk to an attacker, e.g., being detected, and in this competition was simply the opposite bit as , punishing attacks on unused services. is the indicator function for when is positive and represents the “Optimal Attacker”. For an attacker , represents the risk to service by attacker at time . Toxicity is defined as

a score that is increasing with the effectiveness of the attacker. Note that toxicity () is maximal when . Hence, the final score is the normalized toxicity, with

2.2.2. MIT-LL CTF’s CIA Score

Werther et al. (Werther et al., 2011) describes the MIT Lincoln Laboratory CTF exercise, in which teams are tasked with protecting a server while compromising others’ servers. A team’s defensive score is computed as


  1. [leftmargin = *]

  2. is the percent of the team’s flags not captured by other teams (confidentiality),

  3. is the percent of the team’s flags remaining unmodified (integrity), and

  4. is the percent of successful functionality tests (availability).

Weights allow flexibility in this score. The offensive score is fraction of flags captured from other teams’ servers, and the total score is

with parameter encoding the tradeoff between offensive and defensive scores. This is an appealing metric because it intuitively captures all three facets of information security—confidentiality, integrity, availability.

2.3. Cost-benefit analyses of security measures

There is a robust literature that bloomed around 2005 providing quantifiable cost-benefit analysis of security measures using applied economics. Arising from the tension between the operational need for security and the organization’s budget constraints, these researches provide frameworks for quantifiable comparison of security measures. The clear goal of each work is assisting decision makers (e.g., C-level officers) in optimizing the security-versus-cost balance. To quote Leversage & Byres (Leversage and Byres, 2008),

One of the challenges network security professionals face is providing a simple yet meaningful estimate of a system or network’s security preparedness to management, who typically aren’t security professionals.

While many different models exist, overarching trends are to enumerate/estimate (a) internal resources and their values, (b) adversarial actions (attacks) and their likelihood, and (c) security measures’ costs and effects, then use a given model to produce a comparable metric for all combinations of security measures in consideration. This subject bleeds from academic literature into advisory reports from government agencies and companies (Institute, 2010, 2018b), textbooks for management (Gordon and Loeb, 2006; Tipton and Nozaki, 2007), and security incident summary costs and statistic reports (IC3, 2016; Institute, 2017; FireEye, 2016).

The main drawback is all proposed models rely on untenable inputs (e.g., likelihood of a certain attack with and without a security tool in place) that are invariably estimated and often impossible to validate. Academic authors are generally open about this as are we. Perhaps surprisingly, our survey of the literature did not identify use of sensitivity analysis to identify the most critical assumptions, a reasonable step to identify which inputs are most influential, especially when validation of input assumptions is not possible. In response, for our model we provide such a discussion in Section 3.3.

A prevalent, but less consequential drawback is a tendency to oversimplify for the sake of quantification. This often results from unprincipled conversions of incomparable metrics (e.g., reputation to lost revenue), or requiring users to rank importance of incomparable things. The outcome is a single quantity that is simple to compare but hard to interpret.

User studies shows that circa 2006, many large organizations used such models as anecdotal evidence to support intuitions on security decision (Rowe and Gallaher, 2006). The advantages are pragmatic—these models leverage the knowledge of security experts and external security reports to (1) reason about what combination of security measures is the “best bang for the buck”, and (2) they provide a financial justification required by chief financial officers to move forward with security expenditures (Institute, 2010).

2.3.1. SAEM: Security Attribute Evaluation Method

Perhaps the earliest publication on cost-benefit analysis for information security, Butler (Butler, 2002) provides a detailed framework for estimating a threat index, which is a single value representing the many various expected consequences of an attack. Working with an actual company, Butler describes examples of the many estimates in the workflow. Users are to list (1) all threats, e.g,. 28 attacks were enumerated by the company using this framework each in three strengths, (2) all potential consequences with corresponding metrics, e.g., loss of revenue measured in dollars, damaged reputation measured on a 0-6 scale, etc., and (3) the impact of each attack on each consequence. Weights are assigned to translate the various cost scales into a uniform “threat index” metric; note that this step allows a single number to represent all consequences, but is hard to interpret. Next, the likelihood of each attack is estimated, and the weighted average gives the threat index per attack. The per attack threat indices are summed to a single, albeit hard-to-interpret number. By estimating the effect of a desired security measure on the inputs to the model, analysts can see the plot of costs for each solution versus the change in threat index. Notably, authors mention that uniformly optimistic or pessimistic estimates will not change rankings of solutions, and suggest a sensitivity analysis, although none is performed.

2.3.2. ROSI: Return on Security Investment

Sonnenreich et al. (Sonnenreich et al., 2006) and Davis  (Davis, 2005) discuss a framework for estimating the Return on Security Investment (ROSI). The calculation requires estimation of the Annual Loss Expected (). Tsiakis et al. (Tsiakis and Stephanides, 2005) provide three formulas for estimating . One example is to let be the set of attacks, the cost of the attacks, the frequency of the attack, and then . ROSI is a formula to compute the percent of security costs saved if implemented. It requires users to estimate the percent of risk mitigated by the security measure and the cost of measure . Then the expected costs are , and ROSI (the percent of cost returned) is . These authors expect estimate formulas to vary per organization, point to public cost-of-security reports, e.g., (IC3, 2016) to assist estimation, and suggest internal surveys to estimate parameters needed. They go on to say that “accuracy of the incident cost isn’t as important as a consistent methodology for calculating and reporting the cost”, a dubious claim.

2.3.3. ISRAM: Information Security Risk Analysis Method

Karabacak (Karabacak and Sogukpinar, 2005) introduces ISRAM, a survey procedure for estimating attack likelihood and cost, the two inputs of an

estimate. For both attack likelihood and attack costs, a survey is proposed. Each survey question (producing an answer which is a probability) is given a weight, and the weighted average in converted to a threat index score, which is averaged across participants. The ALE score is the product of these two averages.

2.3.4. Gordon-Loeb Model

Perhaps the most influential model is that of Gordon and Loeb (GL Model), which provides a principled mathematical bound on the maximum a company should spend on security in terms of their estimated loss. See the 2002 paper (Gordon and Loeb, 2002) for the original model. Work of Gordon et al. (Gordon et al., 2015) extends the model to include external losses of consumers and other firms (along with costs only to the private firm being modeled).

To formulate the GL model, let denote the monetary value of loss from a potential cyber incident, the likelihood of that incident, and denote the likelihood of an attack given dollars are spent on security measures. Initial assumptions on are that , is twice differentiable, and strictly convex; e.g., for is a particular example. It follows that is their estimate. The goal is to optimize the expected cost, for positive. In the initial work Gordon & Loeb show that for two classes of satisfying the above assumptions, argmin the spending amount that minimizes expected costs, satisfies


That is, optimal security will cost no more than the the expected loss of the attack (Gordon and Loeb, 2002)!

Follow-on mathematical work has shown this bound to be sharp and valid for a much wider class of functions  (Lelarge, 2012; Baryshnikov, 2012). Specifically, the work of Baryshnikov (Baryshnikov, 2012) is particularly elegant with mathematical results so striking they are worth a summary. Let be the set of all security actions a firm could enact, the cost of a set of actions , and the likelihood of an attack after actions are enacted. Baryshinkov assumes enactable collections of actions are measurable, and is a measure; this is a mild assumption and its real-world meaning is simply that the cost of disjoint collections of security actions will be additive, i.e., . Next, is also a set function with interpreted as the likelihood of an attack after actions are enacted. There are two critical assumptions—

  1. [leftmargin = *]

  2. , so indeed is a measure.

  3. is a non-atomic measure, i.e., any can be broken into smaller -measurable sets.

These assumptions are made to satisfy the hypotheses of Lyapunov’s convexity theorem (see (Tardella, 1990; Liapounoff, 1940)). Finally, set

the likelihood of an attack given one has enacted the optimal set of security actions that cost less than . Lyapunov’s theorem furnishes that the range of vector-valued measure is closed and convex. The closedness, implies that for any (amount of money spent), the optimal set of counter measures exists, while the convexity can be used to show that the (the optimal cost) satisfies the 37% rule (Equation 3)!

This dizzying sequence of mathematics is striking because it starts with few and seemingly reasonable assumptions and proves the cost of optimal security is bounded by of potential losses. The conundrum of these results is they are deduced with no real-world knowledge of a particular organizations, security actions, costs, or attacks. While the assumptions seem mathematically reasonable, e.g. “ is convex” translates to “decreasing returns on investment (the first dollar spent yields more protection than the next)”, the result, the 37% rule, presupposes the solution to a critical question—that for any given dollar amount, , the optimal security measures with cost less than will be found. No method for finding an optimal set of measures is given or widely accepted.

Gordon et al. (Gordon et al., 2016) focuses on “insights for utilizing the GL model in a practical setting”. Since the model is formulated as optimizing a differentiable function, the optimum occurs when , or equivalently, the increment of spending in which the marginal likelihood of attack is estimated at 1 is the amount to spend. Authors work with a company as an example, and the company is tasked with identifying resources to protect, the losses if each is breached, and change in likelihood for each $1M spent. In practice this model mimics the many other works in the area. The burden is on the company to estimate cost, likelihood, and efficacy of potential attacks and countermeasures, and then the reasoning is straightforward. On the other hand, the 37% rule gives an indicator if an organizations’ security expenses are non-optimal. See Section 4.1 for an application.

2.3.5. Leversage & Byres’ Mean Time to Compromise Estimate

Research by Leversage & Byres (Leversage and Byres, 2008) uses the analogy of burglary ratings of safes, which is given in terms of time needed for one to physically break into the safe, as a way to quantify security. Specifically, the research seeks an estimate of the average time to compromise system. Network assets are divided into zones of protection levels and network connectedness is used to create an attack graph using some simplifying assumptions, e.g., a target device cannot be compromised from outside its zone. Attackers are classified into three skill levels, and functions are estimated that produce the time to compromise assets given the attacker’s level and other needed estimates, such as, average number of vulnerabilities per zone. Finally, a mean time to compromise can be estimated for each adversary level using the paths in the attack graph to targets and estimated time functions. While this model still requires critical inputs that lack validated methods to estimate, the work addresses the problem of quantifying security in a different light. Unlike the other models discussed here, it embodies the fact that time is an extremely important aspect of security for two reasons: (1) The more adversarial resources are needed to successfully compromise a resource, the less likely they are to pursue/succeed; (2) The more time and actions needed between initial compromise and target compromise, the more chance of detection and prevention before the target is breached (Institute, 2017).

2.3.6. Other works on quantifying security

Tangential to the three research areas discussed above are various researches and non-academic reports that address quantifiability of security.

Vendor and government reports are common resources for estimating costs based on historical evidence. Broad statistics about the cost and prevalence of security breaches are provided annually by the US Federal Bureau of Investigation (FBI) (IC3, 2016). More useful for estimating costs of a breach are industry reports that provide statistics conditioned on location, time, etc. (Institute, 2010, 2018b, 2017, 2018a; FireEye, 2016). Notably, Ponemon’s Cyber Cost report gives the average monetary cost per record compromised per country per year—$225 & $233 per record in the US in 2017, 2018 respectively they report—an essential estimate for all economic models above. Further, Ponemon’s reports that if the mean time to compromise (MTTC) was under 30 days, the average increase total cost was nearly $1M less than breaches with MTTC greater than 30 days.

Acquisiti et al. (Acquisti et al., 2006) seek the cost of privacy breaches through statistical analysis of the stock prices of many firms in the time window surrounding a breach. Their conclusion is short-term negative effects are statistically significant, but longer term are not.

See Rowe & Gallaher (Rowe and Gallaher, 2006) for results of a series of interviews with organizations on how security investments decisions are made (circa 2006). Anderson & Moore (Anderson and Moore, 2006) provide a 2006 panoramic review of the diverse trends and disciplines influencing information security economics.

Verendel (Verendel, 2009) provides a very extensive pre-2008 survey of researches seeking to quantify security, concluding that “quantified security is a weak hypothesis”. That is to say, the methods proposed lack repeated testing resulting in refinement of hypotheses and ultimately validation through corroboration.

3. New Evaluation Framework

Our goal is to provide a comprehensive framework for accruing security costs that can be flexible enough to accommodate most if not all use cases by modeling and estimating costs of defensive and offensive measures modularly. By design the model balances accuracy of detection/prevention capabilities, resources required (hardware, software, and human), timeliness of detection/containment of attacks. Viewed alternatively, the model permits cost estimates for true negative (not under attack), true positive (triage and response costs), false negative (under attack without action), and false positive (unnecessary investigative) states.

Our approach can be seen as adopting the same general cost-benefit framework as the works in Section 2.3, and incorporating the more specific metrics described in Sections 2.1 and 2.2 to address the other two use cases, namely IDS evaluation and competition events. Consequently, we rely on many of the same cost estimates as the works in Section 2.3, which does present some practical difficulties, especially when estimating probability or costs of an attack; however, these difficulties in estimation are unavoidable, and we provide estimates for costs based on research to be used as defaults in Section 3.4.

Our general approach for evaluating the impact of any technology or policy changes is to estimate the change in the total cost () by estimating the costs of breaches () and the cost of all network defenses (). This approach can be applied to a wide variety of technology or policy changes; however, in this discussion we are primarily considering the case of IDS evaluation. The costs of network defense () can be considered a combination of labor costs and resource costs .


The attack cost, is analogous to the Annual Loss Expected () following Section 2.3, with the difference being that covers an arbitrary time period, and can include actual or estimated losses (e.g., cost due to data files stolen). Note that these breach costs include both direct costs (monetary or intellectual property losses) as well as less direct losses such as reputation loss, legal costs, etc. Defense costs include all costs of installing, configuring, running, and using all security mechanisms and policies. While effective defenses will primarily reduce the number of breaches expected, effective incident response will primarily reduce the impact (and therefore the cost) of any specific breach , so both approaches would be expected to reduce , at the cost of somewhat increasing . The defense cost includes both resource costs and labor costs. Both of these will generally include up-front as well as ongoing costs. Ongoing costs can vary over time, and can depend on adversary actions, because analysts will be reacting to adversary actions when detected.

When comparing IDSs, we want to consider the total costs of all candidate systems, meaning the defense costs of the IDS, plus the projected breach costs above. A typical analysis might compare a baseline of no defenses, (meaning and maximal ,) versus current practices, versus new proposed system(s). The total costs () will be positive in all cases, but successful approaches will minimize this total.

For a simple example, incorporating an enterprise-wide policy that all on-network computers must have a particular host-based anti-virus alerting and blocking system will incur an upfront licensing fee, costs of hardware needed to store and process alerts, labor costs for the time spent installing and configuring, time spent responding to alerts, and a constant accrual of costs in terms of memory, CPU, and HD use per host per hour. However, these costs will presumably be offset by a reduced . Estimating all of these costs included in is relatively simple, however estimating is more difficult. Specific examples for using the model for such evaluations are the topic of Section 4.

This section defines and itemizes attack and defense costs in Subsections 3.1 & 3.2. We strive for relatively fine-grained treatment of costs (e.g., breaking attack cost models into kill-chain phases), permitting one to drill down into costs if their data/estimates permit detailed analysis, or to stay at a more peripheral level and model with coarser granularity. The section concludes with our estimates for quantifying the main components in Subsection 3.4.

3.1. Attacks and Breaches: Definition and Cost Model

Beginning with the familiar triad of confidentiality, integrity and availability, we define a “breach” as any successful action by an attacker that compromises confidentiality or integrity. In principle, attacks against availability could also be considered, but ignoring them at present simplifies the discussion below. This also fits with the common understanding among practitioners, where loss of availability (e.g., due to a DDoS attack) is generally considered much less severe, and may be handled primarily by the network operation center (NOC) instead of the security operation center (SOC) (CyberSponse, 2017; Intellectual Point, 2016).

Building on this, we consider an “attack” to be a series of actions which, if successful, will lead to a loss or corruption of data or resources. The attack begins with the first actions that could lead to compromising confidentiality or integrity, and the attack ends when these are no longer threatened. For example, an attacker may re-try a failed action several times before adapting or giving up, and this would all be considered part of the same attack. An attack can potentially be thwarted by both automated tools and manual response of the SOC.

If each attack were instead viewed as one atomic event, this type of reaction by network defenders would not be possible within that framework; however, real attacks almost always involve a sequence of potentially-detectable attacker actions. For example, the “cyber kill chain” model (Hutchins et al., 2011) describes a seven-phase model of the attacker’s process, beginning with reconnaissance, continuing through exploitation, command and control, and ending with the attacker completing whatever final objectives they may have. At that point, the breach is successful. As the authors describe:

The essence of an intrusion is that the aggressor must develop a payload to breach a trusted boundary, establish a presence inside a trusted environment, and from that presence, take actions towards their objectives, be they moving laterally inside the environment or violating the confidentiality, integrity, or availability of a system in the environment. The intrusion kill chain is defined as reconnaissance, weaponization, delivery, exploitation, installation, command and control (C2), and actions on objectives.

The authors later describe how this model can map specific countermeasures to each of these steps taken by an adversary, and how this model can be used to aid in other areas such as forensics and attribution.

Other authors have expanded this kill chain model to related domains such as cyber-physical systems (Hahn et al., 2015) or proposing related approaches based on the same insights (Caltagirone et al., 2013). The creators of the STIX model discuss this kill-chain approach in some depth when presenting their STIX knowledge representation (Barnum, 2012). They define a “campaign” as “a set of attacks over a period of time against a specific set of targets to achieve some objective.”

Our definitions of “attack” and “breach”, discussed above, is a simplified view of these same patterns. In this case, the specific sequence of actions is less important than the general pattern: an “attack” consists of a series of observable events, which potentially leads to a “breach” if successful. The events that make up an attack can be grouped into “phases”, where one phase consists of similar events, and ends when the attacker succeeds in progressing towards their objectives. An example would be dividing the attack into seven phases corresponding to the kill chain above; however, we generally make no assumptions about what may be involved in each phase, only that they occur sequentially, and that succeeding in one phase is a prerequisite for the next.

3.1.1. Breach Costs Model,

The general pattern for the attack is that each phase incurs a higher cost than the previous phase, until the maximum cost is reached when the attacker succeeds. Within each phase, the cost begins at some initial value, then increases over time until it reaches some maximum value for that phase. Consider an attacker with user-level access to some compromised host. Initially, the attacker may make quick progress in establishing one or more forms of persistence, gathering information on that compromised system, evaluating what data it contains, etc. However, over time the attacker will maximize that system’s value, and will need to move on to some other phase to continue towards their objectives.

Figure 1. Plot of cost versus time for one phase of an attack, where with the starting cost at , the maximal cost (limit in time), and term determines how quickly this maximum cost is approached.

The cost versus time for any phase of the attack can be seen in Figure 1, and can be represented by the equation , where with the starting cost at , the maximal cost (limit in time), and term determines how quickly this maximum cost is approached. In the worst case, where is relatively large, this cost versus time curve is approximately a step function.

Figure 2. Plot of cost versus time for all phases of an attack, specifically, with phase beginning at time .

Because an attack is composed of several of these phases, if we assume that each phase is more severe and more costly than the previous, we can view the cost versus time for the attack overall as seen in Figure 2. This can be represented by a sum of the cost of all phases, which using the equation above would be with phase beginning at time . As increases, this will approach the maximum cost for this breach.

Figure 3. Plot of S-curve estimate of attack cost and corresponding marginal cost (derivative) ignoring individual phases of an attack.

This model is crafted to give flexibility based on the situation. In cases where we have sufficient data on a real attack, this approximation maybe unneeded, and one can replace the curves above with observed costs for each phase. On the other hand, when estimating cost of a general attack without specific cost versus time data, we propose two options. First, one may consider estimate each phase’s cost as constant and estimate the attack as a series of steps (effectively letting each phase’s ). Secondly, one may ignore phases and approximate the total cost of the attack as an S-curve, such as , where denotes the maximal cost of an attack over time, and controls how fast the attack cost approaches . See Figure 3.

This second approach is useful when the total cost of a breach can be estimated, but the individual phases of an attack either cannot be modeled well, or are not the primary concern. An example of this may be when estimating future breach costs for planning purposes. Intuitively, this cost estimation gives a marginal cost of

, a skewed bell curve. This matches the expected costs for a common attack pattern, beginning with low-severity events such as reconnaissance, reaching maximum marginal cost as the attack moves laterally, exfiltrates data, or achieves its main objectives, starting with the most important objectives if possible. Over time, after the main objectives have been completed, and the maximum cost is being approached, the attack will again reach a lower marginal cost simply because few or no objectives remain. To view this another way, this model captures the common sense view that attacks should be stopped as early as possible, and that stopping an attack after it has largely succeeded provides little value. Interestingly, modeling a particularly slow-moving attacker, or a particularly fast one, can be achieved by varying

. In practice one would fit the two parameters to their data/estimates. Examples are given in Section 3.4.1 and 4.1.

3.2. Defense Cost Model

We break defense cost into labor and resource (e.g., hardware) costs, denoted , , respectively,


Both the labor costs and resource costs can be sub-divided into several terms for easier estimation. These can represented as a sum of the following:

  • initial costs, , covering initial install, configuration, and related tasks,

  • baseline costs, , covering ongoing, normal operation when no alerts are present,

  • alert triage costs, , representing the cost of determining if an alert is a true positive or a false positive,

  • incident response costs, , representing the costs of responding to a real incident after it is detected and triaged.

This can be summarized as the following equations:


3.2.1. Labor Cost Model,

Labor costs of analyst time and other technical staff time are a significant cost for many organizations. If there is any noticeable cost or productivity impact to the end user, this must also be included. This could include reduced productivity from machine slowdowns, AV false positives incorrectly deleting needed files, false positives in web or DNS filtering blocking useful sites, or any downtime needed to respond to real incidents. These costs can be sub-divided as described above, into initial costs, baseline costs, triage costs, and incident response costs, which allows the labor costs to be related directly to the sensor behavior and the status of any attacks. See Table 1 with functional models to accompany descriptions below.

The initial labor costs, , covers any initial installation, configuration, and all related tasks such as creating/updating any documentation. This also includes any the costs of any required training for both analysts and end users.

The baseline labor costs, , covers normal operation when no alerts are present. This would include any patching, routine re-configuration, etc. This would also include any possible impact on the end user from normal operation, such as updating credentials, maintaining two-factor authentication, etc.

The alert triage labor costs, , represents the cost of determining if an alert is a true positive or a false positive. In principle this could apply to both analysts and end users; however, generally end users will not be involved in or aware of this process, so in those cases that would not contribute to costs. Note that the time needed to triage any alerts can depend significantly on their interpretability. For example, an alert giving “anomalous flow from IP <X>, port <x> to IP <Y>, port <y>” would be less useful than “Unusually low entropy for port 22(ssh), this indicates un-encrypted traffic where not expected”.

The incident response labor costs, , represents the costs of responding to a real incident after it is detected and triaged. The actual cost of this can vary over a large range, but we can make many similar observations as in Section 3.1.1

—the attack can be considered a series of discrete events, grouped into phases of escalating severity and cost, and that each phase reaches some maximum cost before potentially advancing to the next phase. Overall, we can model the costs of incident response with a sigmoid function, similar to the attack costs model:

with parameters fit to incident response costs data if available. Like the attack costs model, if we have data from an actual observed attack, we then no longer need this model, and can calculate this cost directly from available information. The incident response will primarily impact the analysts and other staff responding directly; however, some incidents may impact users as well, for example due to re-imaging machines, or due to network resources being unavailable during the response. The impact on users can either be calculated based on real event data, or predicted using a similar model as the analysts’ incident response costs.

Notation (Cost) Analysts End Users
(Incident Response)
  • Table of labor costs for analysts and end users. Note that is the cost estimate function, and is not needed if real cost data is available.

Table 1. Labor Costs

3.2.2. Resource Cost Model,

Resource costs are another significant component of overall costs of network defense. These can be broken down similarly to the labor costs above, into initial costs , baseline costs , triage costs , and incident response costs . As shown in Table 2 these resource costs can also be sub-divided by resource type. This specificity helps in estimating costs and in relating costs to IDS performance and attack status.

The sub-categories of resources considered include the following:

  • Licensing - In most cases this will either be free, fixed cost, or a subscription based cost covering some time period. However, this also could potentially involve a cost per host, cost per data volume, or some other system. This will be a significant cost in many cases.

  • Storage - This is one of the easier costs to estimate; this increases approximately linearly with data volume. This will generally be a function of the number of alerts generated, or a function of time if more routine information is being logged, such as logging all DNS traffic.

  • CPU - The computational costs of analysis, after data is collected, will (hopefully) scale approximately linearly with the data volume. This cost can vary based on algorithm, indexing approach, and many other factors. This is a function of time, and does not generally depend on number of alerts, unless considering some process that specifically ingests alerts, e.g., security information and event management (SIEM) systems. There are additional costs of instrumentation and collecting data, for example capturing full system call records will impose some non-trivial cost on the host. Most end-users are not CPU bound under normal workloads, so this cost is minimal as long as it’s under some threshold. In a cloud environment, this may be included in their billing model, or if self-hosting this will reduce the ability to oversubscribe resources, so in either of these cases the costs will be more direct.

  • Memory - There is some memory cost required for analysis and indexing. This is generally a function of time, or in some cases a function of the number of alerts. There is also some memory cost for collection on the host. Like CPU costs on the host, most physical machines are over-provisioned, so costs are minimal if under some threshold. In a cloud environment, this will typically be a linear cost per time.

  • Disk IO - These costs are generally a function of time and/or a function of the number of alerts. This cost is not a major concern until it passes some threshold where it impacts performance on either a server or the user’s environment.

  • Bandwidth - Like Disk IO, these costs are a function of time and of the number of alerts. This is also not a major concern until it passes some threshold that causes performance degradation.

  • Datacenter Space - While in practice this is a large up-front capital cost, it would typically make sense to consider any appliances as ‘leasing’ space from the datacenter. Optionally, the rate set may account for how much of the datacenter’s capacity is currently used, so that space in an underutilized datacenter is considered a lower cost. In commercial cloud environments, this is not a directly visible cost, but is included in other hosting costs.

  • Power and Cooling - These costs are similar to the costs of datacenter space discussed previously, except that representing the costs as a function of time is more direct. In most cases this is not a major concern, but it could be in some cases, and is included for completeness.

Notation (Cost) Licensing Storage CPU Memory Disk IO Bandwidth Space Power
(Incident Response)
  • Table of resource costs for each type of resource described in this section.

Table 2. Resource Costs

Initial costs would primarily consist of licensing fees and hardware purchases, as needed. Hardware purchases and related capital costs, such as datacenter capacity, can be either included in the initial costs or averaged over their expected lifespan, which would be captured in the baseline costs . Either is acceptable, as long as they are not over- or under-counted.

Baseline costs represent the cost of normal operation, when no alerts are being generated. This may include licensing costs, if those are on a subscription basis. This also would often include storage, CPU, memory, datacenter costs, and related costs, in cases where hardware costs are amortized over time, or in cases where cloud services are used, and these resources are billed based on usage. This case is what is shown in Table 2.

Alert triage costs represent costs of servicing and triaging alerts, above the baseline costs of normal operation. This is potentially a labor-intensive process, but generally imposes little or no direct resource costs in terms of CPU, memory, etc. The amount of storage and bandwidth needed for each alert is extremely small, and is not significant until alert volumes become much higher than analysts could reasonably handle. There are some exceptions, such as large volumes of low-priority alerts, or unusual licensing arrangements, so these costs are included for completeness.

Incident response costs represent the costs of actually responding to a known attack. Like the triage costs, this is labor-intensive, but involves little or no direct resource costs outside of highly unusual circumstances. This is included here simply for completeness.

3.3. Full Model & Parameter Analysis

Combined, these give the following:


As with all cost-benefit models, the primary downfall is estimating input parameters; e.g., populating requires estimating the full impact of a future breach over time, an inherently imprecise endeavor. While we give some defaults and examples for many of the estimates in Sections 3.4 and 4, here we give a broad overview of sensitivity of the model to the parameters allowing users to target estimation efforts to those inputs that are most influential.

Terms and are constants; hence, unless for some particular situation they are very large, they will not cause large effect when estimating costs over long time spans. Ongoing costs, and are linear, increasing functions of time. These will generally have a greater effect than the constant one-time costs. In some cases these can be an outstanding contributor, but for most applications we expect them less influential than attack, triage, response costs.

Triage costs, and , are linear, increasing functions of the number of alerts, and incident response costs, and are linear, increasing functions of the number of incidents and their cost. These are potentially very influential on the final costs. We note importantly that hidden variables are the false positive and true positive rates/quantities. The final costs of a security measure can vary widely with quantities of alerts and the accuracy of detectors, so these terms are very influential. This is supported by our examples where costs incurred by the quantity of false positives drastically vary overall costs.

Finally, and breach costs (attack models) are potentially non-linear in time. Consequently, they are the most influential parameters, along with hidden parameters “how often do we expect to be attacked?” and “what type of attacks do we expect?” As a quick example, the Ponemon’s 2018 Report (Institute, 2018a) gives statistics for breach costs, but also separate figures “mega breach costs” with the difference being two orders of magnitude in cost. Changing an attack or response model based on these two different estimates could potentially change total costs on the order of $100M!

In summary, for most applications, estimates of attack, incident response and triage costs will be most influential parameters. Importantly, estimating these requires latent variables such as true/false positive rates, which are in turn very influential.

3.4. Quantifying Costs

When cost data or information on the effects of actual attacks are available, the cost model’s parameters can be computed relatively precisely. When this data is not available, such as when evaluating a new product or scoring a competition, general estimates are available using prevailing wage information, cloud hosting rates, and similar sources. To aid in application of the model, this section provides examples and reference values for the cost models introduced earlier in the section.

3.4.1. Breach costs

The costs of breaches () can be estimated based on historical data, data aggregated from other organizations, and an estimate of the value of the data being protected. For example, if a single host is infected with ransomware, it may simply need to be re-imaged, and the cost of this may be simple to estimate from labor costs to reimage a host (assuming that no data was exfiltrated as part of the attack). If a more advanced adversary can infiltrate the network, and they persist for long enough to find and exfiltrate valuable data, the cost of the breach rises dramatically after the adversary begins to steal data. Modeling costs of such an attack will require estimates of the worth of the data in the organization and/or can rely on historical reports of similar breaches.

Using reports on breach costs from 2018, we provide an example of how to estimate an S-curve model (Section 3.1 of costs induced by an attack. Ponemon’s Institute provides a yearly report giving statistics on data breach costs and related statistics (Institute, 2017, 2018b, 2018a). From the 2018 report we find “The mean time to identify [a breach] was 197 days,” and containing a breach in less than 30 days resulted in an average $3.09M cost, while containment taking greater than 30 days cost $4.25M. We use these facts to fit , the cost in $M of a breach given discovery and containment occurred at days. As no statistics are given about the distribution of time to discovery, we use the given average, 197 days as a default detection time in our calculations. From the statement that containment taking greater than 30 days cost $4.25M we obtain

For large , , hence the limit on the left approaches , giving $4.25M. Next, from the second piece of data we have

Numerically solving gives . Altogether our fitted S-curve breach cost model is . See Figure 4.

Figure 4. Plot of , the S-curve estimate of attack cost given assumptions derived from Ponemon’s 2018 data (Institute, 2018a).

Further examples of attack cost estimates are given in Section 4.1.

3.4.2. Resource costs

Many of the resource costs described in Section 3.2.2 involve datacenter and hosting costs. If these are unknown for a particular organization, calculating pricing based on cloud hosting provides a real-world default for these costs. These costs are readily available from cloud hosting providers such as Amazon Web Services (AWS)777

As shown in Table 3, these costs primarily depend on the volume of data generated, and the amount of computational and memory resources needed for processing this volume of data. Table 3 only includes AWS bandwidth costs, which does not charge for uploading data to their cloud services. This does not include any costs from the ISP or other bandwidth costs. Of course in a real-world scenario, a price for uploading data would be incurred, although not by AWS, but through bills for internet service, power, etc. This also assumes no software licensing costs—some licenses, such as Splunk ( a popular SIEM system, can significantly increase in cost based on data volume.

As noted in an earlier study of SOCs (Bridges et al., 2018):

Reported size of host data varied widely … On the low end an approximation of 300MB/day were given. One respondent works across many organizations and reported 100GB to 10TB per day, with the latter the largest estimate given during our surveys … Overall, Splunk subscription costs were cited directly by some as the constraint for data collection after mentioning they would benefit from more data collection. Perhaps this is unsurprising given estimates from the numbers above—the sheer quantity of host data collected and available to security operator centers is between 1GB-1TB/day, stored for 3 months 100days = 100GB-100TB.

Combining the figures in the quote above with the costs from Table 3 results in baseline storage costs ranging from $10 to $10,000 per month; this shows that even though storage and bandwidth costs are very low per unit, they can be substantial across a large organization depending on what sources are collected.

Notation (Cost) Storage CPU & Memory Bandwidth
(Baseline) $0.10 per GB per month $20 per instance per month
(Triage) $0.10 per GB per month $0.09 per GB out
(Incident Response)
  • Table of estimated resource costs, assuming and only taking into account cloud hosting fees. This is calculated assuming long-term use of reserved t3.large AWS instances with EBS SSD at current (2018) prices. For real-world scenarios Bandwidth would include prices for internet service, etc, and subscriptions depending on data volume (e.g. Splunk SIEM tool) fees would need to be added.

Table 3. Estimates of Resource Costs
Notation (Cost) Analysts
(Triage) $80 per alert
(Incident Response) $400 per incident
  • Table of estimated labor costs for analysts. This assumes little or no impact on end users, which may not be true in all organizations.

Table 4. Estimates of Labor Costs

3.4.3. Estimates of Labor costs

Some estimates place an average analyst salary at around $75k to $80k per year888See salaries for “Cyber Security Analyst” and “Information Security Analyst” titles. (about $35 to $38 per hour). This figure does not include benefits, which generally make up 30 or 40 percent of total compensation999See Table A.

, does not include any bonuses or other non-salary compensation, and does not include any overhead costs. In the absence of any more detailed information, an estimate of $70 per hour may currently be a reasonable starting point when including benefits and allowing some padding for other overhead.

Our interaction with SOC operators indicates that tens of thousands of alerts per month are automatically handled (e.g., AV firing and quarantining a file), but a much smaller minority require manual investigation, usually tracked through a ticketing system. A typical ticketed alert requires several minutes to triage by tier 1 analysts, and if escalated can require hours (or potentially days) to fully investigate and remediate according to some published sources (Zimmerman, 2014; Sundaramurthy et al., 2016). Our interaction with SOC operators confirmed that tier 1 analysts spend 10 minutes per ticketed alert, that tier 2 analysts use up to 2 hours, and tier 3 analysts time is potentially unbounded. If we assume that 50% of alerts can be triaged and resolved by tier 1 analysts101010In reality, this figure may be significantly lower; we consulted a few SOC operators who reported about 90% of ticketed alerts advance to tier 2 analysts. This is highly dependent on the organization and the source of the alerts., in an average of 10 minutes, and that additional investigation by higher tiers takes an average of 2 hours, that means an average alert would cost approximately 10 minutes at $70/hr ($11.67) 50% 2 hours $70/hr ($70), or about $80 in total.

After triage is, the incident response begins, i.e., handling cleanup, mitigation, and related tasks after an attack. Section 3.2.1 proposes an S curve of increasing cost over time, and with some information on costs of incident clean up one could fit an S curve similar to the example in Section 3.4.1. For a simpler model, if we assume an average of 6 hours for incident response at $70/hr, we obtain a cost of $420 or about $400.

4. Examples

Here we provide specific examples of the using the framework. The first example explains estimates and configuration of the framework for an upcoming IARPA grand challenge, a cyber competition and the target application driving this research. Secondly, we evaluate a detection algorithm proposed in our previous work as though it was to be deployed. The second application gives examples of how the evaluation framework is useful from the point of view of the researcher in developing novel tools/algorithms, from the SOC in considering purchase of a new tool, and from a vendor deciding the worth of their product.

4.1. VirtUE Contegrity Breach Detection Challenge

A target application of this framework is evaluating detection capabilities for a competition as part of the IARPA VirtUE (Virtuous User Environment) research and development (R&D) program111111 The VirtUE R&D program is developing a computing environment where each of a user’s daily computing roles occupies its own isolated virtual environment (a Virtue) without significant impact on the functionality to a user, e.g., a separate Docker (Merkel, 2014) container could be launched for a user’s email browsing, Internet Browsing, and Sharepoint administration roles, while the user sees and interacts with a single unified desktop presenting all these roles. Building isolated virtual environments specifically for constrained, well-defined user roles creates enhanced opportunities to sense and protect those environments. VirtUE hopes to contrast this with the traditional user interface model where all user roles are merged indistinguishably into one single shared memory environment.

In the VirtUE Contegrity Challenge, competitors are tasked with accurately identifying attacks on confidentiality and integrity (contegrity) as efficiently as possible. Specifically, the competitors will employ their detection analytics to analyze the security logs of six different role-specific Virtues. Competitors will be tasked with minimizing the total amount of log data that their analytics process while accurately detecting the presence of contegrity attacks on a Virtue. Each Virtue will experience zero to two attacks over a time period of an hour for a total of 12 possible attacks. The attacks fall into 16 categories (e.g., “Capturing or transporting encryption keys”, “Corrupting output of a computation”), and performers must identify the class of the attack with each alert. Alerts with the wrong classification are considered a false positive.

4.1.1. Competition Scoring Model

The goal of this section is to produce a scoring procedure that

  • rewards accuracy of the detector,

  • rewards timeliness of detection,

  • penalizes performers for bandwidth, processing, memory, and storage use,

  • and is practical to compute for evaluating such a competition.

In short, the scoring should take into account the accuracy, timeliness, and resource requirements of the detection capability. We leverage the model above to determine a “cost of security” score for each participant’s detector and the detector incurring the lowest cost wins.

To model the attacks, we assign a total value of the data each Virtue contains, and this provides the asymptote () for the S-curve model as described in Section 3.1.1. Consulting Ponemon’s 2018 report (Institute, 2018a), the average cost per client record affected in a breach was $148 in 2018. (We note that Ponemon’s report focused on stolen customer data, which may not be an accurate estimate for enterprise data.) Assuming 100 files per Virtue furnishes $14,800. To accommodate the 1-hour competition duration, we choose the time parameter, , so that 50% of the maximum possible cost, , is obtained in 5 minutes. Thus, . Note that while we model integrity attacks with equal cost as confidentiality costs, in an alternate scenario, one may adjust the integrity model to incur greater cost than confidentiality attacks following the assumption that usually the adversary has access to but also corrupts data in the former, while only has access in the latter. Altogether, our competition’s model for the cost of an integrity attack (in thousands of dollars) minutes after initiation is


See Figure 5. Finally, for each attack administered in the competition, we charge the participant thousand dollars, where is the attack duration lasting from the start time of the attack to either the time of correct detection or the end of the 1-hour competition. Note that since is increasing, this rewards early detection over later.

Figure 5. Plot of S-curve , the estimate of attack cost for VirtUE challenge.

For labor and resource costs, we follow Section 3.4, namely Tables 3 and 4, with some tweaks. As the goal is to evaluate the efficacy of the detection algorithms, we can ignore licensing and configuration costs for the detection software ( $0, = $0), which is equivalent to assuming each competitor’s software incurs the same licensing and configuration costs, as well as baseline labor costs ( $0) and resources needed for incident response ( $0). Competing detection capabilities will be furnished a uniform CPU and memory platform, hence CPU and memory costs can be ignored.

As explained in Section 3.4.2 every alert will cost $80 to triage, regardless of whether the alert is a true positive or false positive. If the alert is a true positive, the attack is considered detected and remediated, and ceases to accrue cost; however, a fixed $400 fee is incurred to represent the cost of this remediation.

Ongoing resource costs for use () and triage costs () will be traceable during the competition. To estimate these costs, we monitor the volume of data sent in or out of the detector’s analytic environment and charge a single per-volume rate of $150/MB to account for any bandwidth, storage, or per-volume subscription fees (e.g., SIEMs). Just as the expected number and time-duration of attacks (and therefore alert costs) are condensed to accommodate the 1-hour duration, we inflated the estimated data costs to make it comparable to the attack and alert costs expected.

Here we itemize this estimate. SOCs generally store logs and alerts for at least a year (e.g., see (Bridges et al., 2018)). This requires movement of data to a datastore, storage fees for a year, and SIEM fees. We estimate bandwidth costs at $1/GB based on the low estimate of $0.09 cloud bandwidth fees (Table 3) and high estimates from mobile networks (e.g., needed by shipping vessels and deployed military units) that can cost $10-$15/GB. For 1 year of storage we consult the cloud costs in Table 3 and obtain $0.10/GB/month 12 months = $1.20/GB for 1 year of storage. For storage and management software price, we reference Splunk, costing $150 / GB / month.121212Price from as of 09/25/2018. Our estimate is $150 / GB / month (1/730.5) month / hour = $0.20 / GB / hour for the portion of the SIEM fee incurred in this 1-hour competition. Altogether, a reasonable estimate for data moved in or out of the detector’s environment is $3.40/GB, comprised of: $1/GB for the observed data movement, another $1/GB assuming a copy of it is sent to long term storage, $1.20/GB for storage fees, and $0.20/GB for SIEM fees. Finally, we need to scale this price to be comparable to the condensed attack costs in the hour. The attack volume is on the order of the number of attacks expected of a single host in perhaps a calendar year, yet the competition involves only about a fifth of the Virtues needed for a single virtual host. Consequently, to make the data costs comparable to the attack and alert triage/remediation costs, we multiply by 5 24 hours/day 365 days/year, giving $148.92/MB. For simplicity we use $150/MB in the competition.

Altogether, the scoring evaluation is as follows:

  • When the competition starts, the system is in a true negative state, and only the bandwidth/storage used by the detector will be accruing costs. Total volume of data used by the detector’s virtual environment will incur a fee of $150/MB.

  • Every time an alert is given a $80 fee will be charged for triage.

  • For every true positive alert, an additional $400 fee will be charged to represent the cost of remediating the attack.

  • Once an attack is detected, , the time from the start of the attack until detection, is determined. The attack is considered ended and thousand dollars is charged.

  • Finally, at the end of the competition, for any ongoing (undetected) attacks their duration (from start of the attack until the end of the competition) is determined. For each, thousand dollars is charged.

4.1.2. Testing the Virtue IDS Evaluation Framework: Simulations & Baselines

Importantly, we seek confirmation that the scoring procedure does indeed reward a balance of accuracy, timeliness, and preservation of resources. To investigate, we simulate some attack scenarios and defense schemes and present the results. We create four separate scenarios with number of attacks, 0, 4, 8, and 12, and the attacks occurring at randomly sampled times (rounded to nearest two minutes) in the hour with replacement.

Our detection models are as follows:

  • Null detector - This detector simulates having no security measures. It uses no data, throws no alerts and has maximum time to detection (60m - attack start time).

  • Periodic hunting (10m) - This detector and the one below simulates a full system check at preset intervals without continually monitoring any data. We assume it will detect any ongoing attacks in each scan but will also incur many false positives by issuing all 16 alerts at 10m, 20m, …, 50m. Data cost is $0 as it does not monitor hosts.

  • Periodic hunting (30m) - Same as above but with only two scans at 15m and 45m.

  • Low-data, low-speed detector - This and the detectors below simulate a real-time monitoring IDS. It uses 5 MB of data (initial overhead) plus 0.1 MB per attack; MB. It throws 5 alerts plus 5 per attack; . (Hence we assume that with no attacks we obtain 5 false positives, and we assume that for each attack it sends an additional 4 false alerts then the correct fifth alert.) It detects every attack at 3 minutes.

  • Low-data medium-speed detector - Same as the above detector, but we assume it detects every attack at 1.5m.

  • High-data medium-speed detector - It uses 10 MB of data plus 1 MB per attack; MB. It throws 4 alerts plus 4 per attack; . It detects every attack at 1.5m.

  • High-data high-speed detector - Same as above but it detects every attack at 15s.

Figure 6. For each scenario (No. Attacks = 0, 4, 8, 12), the cost incurred by each simulated detector was computed over 1000 runs and the average cost reported in the bar charts. Top bar chart presents the full results, while the lower bar chart is simply a zoom-in on the last three detectors.

Results are displayed in Figure 6 giving the average cost incurred by each simulated detector over 1000 runs in each attacks scenario (0, 4, 8, 12 attacks). As expected, the Null Detector (no security measures) incurs no cost if there is no attack, but averages wildly high attack costs in the attack scenarios. Periodic hunting for threats incurs a large cost during the investigations, but strictly limits the attack costs. Performing the scans for attacks every 10m vastly outperforms periodic hunting on 30m intervals when attacks are present. We see a dramatic drop in costs across all scenarios with attacks when even real-time monitoring is assumed (last four detectors) even with only approximately 1 true alert in 5 was assumed. See the bottom bar chart for a zoom-in on the three best simulated detectors. First note that decreasing time to detection in both the low-data and high-data detectors also decreases costs, as desired. Further, as these four models increase linearly with the number of attacks, simply looking at the 12-attack scenario suffices. Next, note that the overall best performance is by the low-data, medium-speed detector, which takes 1m to identify an attack ($11,010 for 12-attack scenario). We note that while the high-data, high-speed detector detects attacks six-times faster, its use of data increases its cost ($12,260 for 12-attack scenario), but it is still slightly better than the high-data medium-speed detector ($12,340 for 12-attack scenario) as expected. These nearly identical costs for the two high-speed detectors imply that detection within the first minute or so of an attack effectively prevents the attack; hence, it is not worth the data costs to increase time to detection in this case.

Overall, these are comforting results because they suggest that simple heuristics for sending alerts without actually monitoring activity will incur too large a penalty in terms of false positive costs and attack costs to be as effective as intelligent monitoring. Further, this shows that cost model requires a balance of data use, accuracy, and timeliness to minimize costs.

We hope these simulations provide useful baselines for competitors. As a final baseline, we revisit the rule of the Gordon-Loeb (GL) Model 2.3.4, which states that the optimal cost of security should be bounded above by the estimated loss to attacks over . For the 12-attack scenario, the Null detector (no security) simulated costs was $159,093. Dividing by gives the GL upper bound for optimal security costs of $58,527. We note that both the periodic detectors are above this bound, while all four monitoring simulations are under it.

4.2. GraphPrints Evaluation Example

In this section we revisit our previous work (Harshaw et al., 2016) that introduced GraphPrints, a graph-analytic network-level anomaly detector. Our goal is to provide an example of the evaluation framework as an alternative to the usual true-positive/false-positive analysis given in the original paper and commonly used for such research works. Additionally, the example illustrates how the cost-benefit analysis can benefit (1) the researchers evaluation of a new technology, (2) SOC operators from the perspective of considering adoption as if GraphPrints were a viable commercial off-the-shelf technology and (3) from the point of view of a vendor deciding on the price of such a product.

4.2.1. GraphPrints Overview

GraphPrints algorithm processes network flow data131313Network flow data, or flows, are the meta-data of IP-to-IP communications including but not limited to the source and destination IPs, source and destination ports, protocol, timestamp, and quantities of information (bytes, packets, etc.) sent in each direction.. The algorithm builds a directed graph from a time slice (e.g., 30s.) of flows. The graph’s nodes represent IPs and directed, colored edges represent connections with port information.

Graph-level detection:

For each graph the number of graphlets—small, node-induced sub-graphs—are counted. This gives a feature vector encoding the local topology of the communications in that time window. A streaming anomaly detection algorithm is performed on the sequence of graphlet vectors. Specifically, a multivariate gaussian distribution is fit to the history of observed vectors, and new graphlet vectors with a sufficiently low p-value (equiv. high mahalanobis distance) are detected as anomalous. Finally, the newly scored vector is added to the history of observations, and the process repeats upon receipt of the next vector. This provides an anomaly detector for the whole IP space represented by the graph. We note that the original GraphPrints paper

(Harshaw et al., 2016) also describes a related, node-level detector (following Bridges et al. (Bridges et al., 2015)), but for the sake of clarity, we provide the evaluation for only the network-level technology.

4.2.2. GraphPrints Evaluation

For testing in the original paper (Harshaw et al., 2016), real network flow data was implanted with bittorrent traffic as a surrogate for an attack. As torrenting was against policy it indeed constituted anomalous traffic. Secondly, it was chosen as bittorrent traffic appears as an internal IP contacting many external, abnormal IPs and moving data, potentially similar to malware beaconing or data exfiltration.

The initial evaluation sought to show the existence of a window of thresholds for the detector that gave “good” true/false positive balance. See Figure 7. At the network level with the depicted threshold the test exhibited perfect true positive rate and 2.84% false positive rate. We manually investigated the false positives finding they were IP-scans originating from internal hosts assigned to the company’s IT staff. Presumably this was legitimate activity causing false positives, e.g., a vulnerability or asset scanning appliance.

We provide an instantiation of our evaluation framework as a more informative alternative to the true/false positive analysis of the detection capability. To estimate the initial resource cost, the cost of necessary hardware is tallied. Based on preliminary testing we conducted, to run the algorithm in real time a separate instance should be used to model roughly 2,500 IPs. That is, we expect a large network to be divided into subnets with separate GraphPrints instances per subnet, e.g., an operation w/ 10,000 IPs would require GraphPrints servers. Since all costs except the initial subscription will scale by we neglect this factor in the analysis and note that the final figures grow linearly with the network size. We contacted a few SOCs regarding server specifications for such a technology, and they pointed us to Thinkmate and Cisco UCS C220 M4 rack costing approximately between $2K to $15K depending on configuration options. Additionally they mentioned adding 15% for un-included hardware, e.g., racks, cords, etc. Most software used is opensource (e.g., Linux OS). Altogether, we estimate per instance. Additionally, the initial labor costs to configure the servers we estimate at one day giving = $70/hour hours = $560, following rates estimated in Section 3.4.3.

Figure 7. Original plot from GraphPrints paper (Harshaw et al., 2016) depicting network-level anomaly scores. Suggested threshold is depicted by the horizontal, red, dashed line. Vertical green dashed lines indicate beginning/ending of simulated attack (positives). Spikes above the threshold, outside the attack period are false positives.

For we assume the operation already collects and stores network flows, so adding this technology will add only alerts to the storage costs, and will add flows and alerts to the bandwidth cost, as alerts are sent to the SIEM and flows must be sent from the flow sensor to the GraphPrints server. For storage costs, if we assume 500KB/month/server, estimates in Table 3 give GB/month $0.10/GB = $/month. Since this is a negligible amount of money, we ignore this term in the estimate. For bandwidth costs, assuming $0.10/GB, a low-end estimate for data transfer since it is internal, we estimate about 15GB of flows are produced per subnet per month, giving $1.5/month. This is again, a negligible amount comparatively, so we ignore

We estimate that each instance of GraphPrints will require weekly reconfiguration, e.g., threshold adjustment or a heuristic implemented to reduce false positives, and we allocate 1 person hour per week per instance. From estimates in Section 3.4.3, $70 / week 4 weeks / month = $280 / month.

For triage and incident response costs, we reuse the estimates from the VirtUE challenge above; namely, we assume a flat average of $80 per false positive, and $480 per true positive. Similarly, for breach costs we use the S curve fron the VirtUE challenge given in Equation  9. Given the results above (Figure 7) we assume a perfect true positive rate with near immediate detection (assume, containment within 1 minute response time), and a 2.84% false positive rate. For each GraphPrints instance, a scored event occurs every 30s time window giving 86,400 events 2.84% = 2,454 false positives per month, accruing 2,454 $80 $196,320/month. If we assume each instance incurs one attacks per month, then a 1-minute response time gives response + attack costs of $480 + = $480 / month (attack costs are negligible with fast detection).

Altogether, the cost of adopting this technology, neglecting licensing or subscription fees is estimated as an initial one-time resource and configuration fee of $8,625560 = $ 9,185 and a ongoing monthly cost of $ (196,320 + 2,400 + 280) = $199,000 per instance! To put this figure in perspective, we consider two alternatives—the estimate without adopting the technology and the estimate assuming reconfiguration addresses the false positives.

Without this technology, if the lone attack per required 10m for detection and containment, then our cost estimate is simply $480 / month / instance = $12,905 / month / instance. Applying the GL rule of thumb, we see optimal security costs should be below $12,905/ = $4,747 per month per instance or $56970 per year per instance.

In the more interesting reconfiguration scenario, we note that the false positives found in testing were occurring from legitimate network scanning appliances tripping the GraphPrints detector. Common practice for handling such false positives involves continually tuning tools (Bridges et al., 2018; Sundaramurthy et al., 2016). As we included the labor costs for monthly reconfiguration, it is reasonable to assume that each such false positive would occur one time, then reconfiguration would prevent the same alert. In this case, the there would be only the lone, first false positives in the testing window (Figure 7), so our false positive rate drops to 0.56%. With this false positive rate, we incur 86,400 events 0.56% = 14 false positives per month per instance, for a cost of 14 $80 = $1,120. Now the cost for adopting the GraphPrints technology is the initial $9,185 server cost plus $ (1,120 + 2,400 + 280) = $3,800/month/instance. We note that this is indeed below the GL upper bound. Neglecting initial costs, the technology promises a yearly savings to customers of 12 months $(12,905 - $3,800)/month = $109,260. With the initial costs for hardware and configuration included, we see operations will save about $100,000 / year.

From the point of a researcher, such an analysis is enlightening, as it allows quantitative reasoning about the impact of the true/false positive analysis and resource requirements. Further, it gives a single metric to optimize when, for example, deciding a threshold for detection. From a SOC’s perspective, provided the numbers above are reasonable estimates, the conclusion is clear—if false positives can be mostly eliminated with one-off reconfiguration tuning, then this is a good investment; if not, then this is a terrible investment. We recommend a testing period to give a much more informed decision on both the figures estimated above and the final decision. Finally, from the perspective of the vendor, such an estimate can help dial in their yearly subscription fees. Yearly cost to use this technology are $9,185 (server cost) = $54,785 plus subscription fees. If a subscription is required per instance (scales with ), then annual subscription cost can be bounded above by the GL bound minus the operational costs, that is, they should be less than $56,970 - 54,785 = $2,185 to keep total costs under the GL rule of thumb.

5. Conclusion

Useful security metrics are important for estimating the efficacy of new products or new technologies, important for evaluating red team or competition events, and important for organizations which must weigh the cost verses benefit of security practices. As we have described, each of those three areas have developed their own generally accepted metrics within their topic areas, but these have been focused too narrowly, and cannot easily be applied from one area to another, For example, it is currently difficult to take the statistical metrics from researcher testing of an IDS and estimate the impact on a specific organization. In this paper, we have proposed a holistic approach, which generalizes and combines the traditional metrics in these areas in a flexible framework by comprehensively modeling the various costs involved. This provides a configurable cost model that balances accuracy, timeliness, and resource use. Moreover, it is easy to interpret and analyze. To illustrate the efficacy of the new model, we tune it to be used as the scoring procedure for an upcoming IARPA IDS competition, and use simulated attack/defense scenarios to test the efficacy of the cost framework. Our results support that a balance of accuracy, response time, and resource use are promoted by the model. Finally, we exhibit the use of this new model to evaluate a new security tool from multiple points of view, specifically the researcher, the SOC (client), and the vendor. Our results show the model can provide clear and actionable insights from each.

The authors would like to thank the many reviewers who have helped polish this document. Special thanks to Kerry Long for his insights and guidance during our authorship of this paper, to Miki Verma, Dave Richardson, Brian Jewell, and Jason Laska for helpful discussions, and to the many SOC operators who provided consultation in the preparation of this manuscript. The research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the Department of Energy (DOE) under contract D2017-170222007. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.



  • (1)
  • Acquisti et al. (2006) Alessandro Acquisti, Allan Friedman, and Rahul Telang. 2006. Is there a cost to privacy breaches? An event study. ICIS 2006 Proceedings (2006), 94.
  • Anderson and Moore (2006) Ross Anderson and Tyler Moore. 2006. The economics of information security. Science 314, 5799 (2006), 610–613.
  • Axelsson (2000) Stefan Axelsson. 2000. The Base-rate Fallacy and the Difficulty of Intrusion Detection. ACM Trans. Inf. Syst. Secur. 3, 3 (Aug. 2000), 186–205.
  • Barnum (2012) Sean Barnum. 2012. Standardizing cyber threat intelligence information with the Structured Threat Information eXpression (STIX). MITRE Corporation 11 (2012), 1–22.
  • Baryshnikov (2012) Yuliy Baryshnikov. 2012. IT Security Investment and Gordon-Loeb’s 1/e Rule.. In WEIS.
  • Bridges et al. (2015) R. A. Bridges, J. P. Collins, E. M. Ferragut, J. A. Laska, and B. D. Sullivan. 2015. Multi-level anomaly detection on time-varying graph data. In 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). 579–583.
  • Bridges et al. (2018) Robert A. Bridges, Michael D. Iannacone, John R. Goodall, and Justin M. Beaver. 2018. How do information security workers use host data? A summary of interviews with security analysts. CoRR abs/1812.02867 (2018). arXiv:1812.02867
  • Brugger and Chow (2007) S Terry Brugger and Jedidiah Chow. 2007. An assessment of the DARPA IDS Evaluation Dataset using Snort. UCDAVIS department of Computer Science 1, 2007 (2007), 22.
  • Butler (2002) Shawn A Butler. 2002. Security attribute evaluation method: a cost-benefit approach. In Proceedings of the 24th International Conference on Software Engineering. ACM, 232–240.
  • Caltagirone et al. (2013) Sergio Caltagirone, Andrew Pendergast, and Christopher Betz. 2013. The diamond model of intrusion analysis. Technical Report. CENTER FOR CYBER INTELLIGENCE ANALYSIS AND THREAT RESEARCH HANOVER MD.
  • Creech and Hu (2013) Gideon Creech and Jiankun Hu. 2013. Generation of a new IDS test dataset: Time to retire the KDD collection. In Wireless Communications and Networking Conference (WCNC), 2013. IEEE, Shanghai, China, 4487–4492.
  • CyberSponse (2017) CyberSponse. 2017. The Difference between the Security Operations Center (SOC) & Network Operations Center (NOC). white paper, online. (2017).
  • Davis (2005) Adrian Davis. 2005. Return on security investment–proving it’s worth it. Network Security 2005, 11 (2005), 8–10.
  • Doupé et al. (2011) Adam Doupé, Manuel Egele, Benjamin Caillat, Gianluca Stringhini, Gorkem Yakin, Ali Zand, Ludovico Cavedon, and Giovanni Vigna. 2011. Hit’em where it hurts: a live security exercise on cyber situational awareness. In Proceedings of the 27th Annual Computer Security Applications Conference. ACM, 51–61.
  • Drinkwater and Zurkus (2017) Doug Drinkwater and Kacy Zurkus. 2017. Red team versus blue team: How to run an effective simulation. (July 2017).
  • Elkan (2000) Charles Elkan. 2000. Results of the KDD’99 classifier learning. Acm Sigkdd Explorations Newsletter 1, 2 (2000), 63–64.
  • FireEye (2016) FireEye. 2016. The Uncomfortable Cyber Security Tradeoff. (January 2016).
  • Garcia et al. (2014) Sebastian Garcia, Martin Grill, Jan Stiborek, and Alejandro Zunino. 2014. An empirical comparison of botnet detection methods. computers & security 45 (2014), 100–123.
  • Glass-Vanderlan et al. (2018) T. R. Glass-Vanderlan, M. D. Iannacone, M. S. Vincent, Qian, Chen, and R. A. Bridges. 2018. A Survey of Intrusion Detection Systems Leveraging Host Data. ArXiv e-prints (May 2018). arXiv:cs.CR/1805.06070
  • Gordon and Loeb (2002) Lawrence A Gordon and Martin P Loeb. 2002. The economics of information security investment. ACM Transactions on Information and System Security (TISSEC) 5, 4 (2002), 438–457.
  • Gordon and Loeb (2006) Lawrence A Gordon and Martin P Loeb. 2006. Managing cybersecurity resources: a cost-benefit analysis. Vol. 1. McGraw-Hill New York.
  • Gordon et al. (2015) Lawrence A Gordon, Martin P Loeb, William Lucyshyn, and Lei Zhou. 2015. Externalities and the magnitude of cyber security underinvestment by private sector firms: a modification of the Gordon-Loeb model. Journal of Information Security 6, 1 (2015), 24.
  • Gordon et al. (2016) Lawrence A Gordon, Martin P Loeb, and Lei Zhou. 2016. Investing in cybersecurity: Insights from the Gordon-Loeb model. Journal of Information Security 7, 02 (2016), 49.
  • Hahn et al. (2015) Adam Hahn, Roshan K Thomas, Ivan Lozano, and Alvaro Cardenas. 2015. A multi-layered and kill-chain based security analysis framework for cyber-physical systems. International Journal of Critical Infrastructure Protection 11 (2015), 39–50.
  • Harshaw et al. (2016) Christopher R Harshaw, Robert A Bridges, Michael D Iannacone, Joel W Reed, and John R Goodall. 2016. Graphprints: towards a graph analytic method for network anomaly detection. In Proceedings of the 11th Annual Cyber and Information Security Research Conference. ACM, New York, NY, 1–15.
  • Hutchins et al. (2011) Eric M Hutchins, Michael J Cloppert, and Rohan M Amin. 2011. Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains. Leading Issues in Information Warfare & Security Research 1, 1 (2011), 80.
  • IC3 (2016) FBI IC3. 2016. 2016 Internet Crime Report. (2016).
  • Institute (2010) American National Standards Institute. 2010. The Financial Management of Cyber Risk: An Implementation Framework for CFOs. (2010).
  • Institute (2017) Ponemon Institute. 2017. 2017 Cost of Data Breach Study. (June 2017).
  • Institute (2018a) Ponemon Institute. 2018a. 2018 Cost of Data Breach Study: Global Overview. (June 2018).
  • Institute (2018b) Ponemon Institute. 2018b. The Third Annual Study on the Cyber Resilient Organization. (March 2018).
  • Intellectual Point (2016) Intellectual Point. 2016. Cyber Security: Security Operations Center (SOC) vs. Network Operations Center (NOC). white paper, online. (2016).
  • Jewell and Beaver (2011) Brian Jewell and Justin Beaver. 2011. Host-based data exfiltration detection via system call sequences. In ICIW2011-Proceedings of the 6th International Conference on Information Warfare and Secuirty: ICIW. Academic Conferences Limited, Academic Conferences Limited, England, 134.
  • Karabacak and Sogukpinar (2005) Bilge Karabacak and Ibrahim Sogukpinar. 2005. ISRAM: information security risk analysis method. Computers & Security 24, 2 (2005), 147–159.
  • Kountouras et al. (2016) Athanasios Kountouras, Panagiotis Kintis, Chaz Lever, Yizheng Chen, Yacin Nadji, David Dagon, Manos Antonakakis, and Rodney Joffe. 2016. Enabling Network Security Through Active DNS Datasets. Springer International Publishing, Cham, 188–208.
  • Lelarge (2012) Marc Lelarge. 2012. Coordination in network security games: a monotone comparative statics approach. IEEE Journal on Selected Areas in Communications 30, 11 (2012), 2210–2219.
  • Leversage and Byres (2008) David John Leversage and Eric James Byres. 2008. Estimating a system’s mean time-to-compromise. IEEE Security & Privacy 6, 1 (2008).
  • Liapounoff (1940) AA Liapounoff. 1940. Sur les fonctions-vecteurs completement additives. Izvestiya Rossiiskoi Akademii Nauk. Seriya Matematicheskaya 4, 6 (1940), 465–478.
  • Lippmann et al. (2000) Richard P Lippmann, David J Fried, Isaac Graf, Joshua W Haines, Kristopher R Kendall, David McClung, Dan Weber, Seth E Webster, Dan Wyschogrod, Robert K Cunningham, et al. 2000. Evaluating intrusion detection systems: The 1998 DARPA off-line intrusion detection evaluation. In DARPA Information Survivability Conference and Exposition, 2000. DISCEX’00. Proceedings, Vol. 2. IEEE, 12–26.
  • Mahoney and Chan (2003) Matthew Mahoney and Philip Chan. 2003. An analysis of the 1999 DARPA/Lincoln Laboratory evaluation data for network anomaly detection. In Recent advances in intrusion detection. Springer, Amsterdam, The Netherlands, 220–237.
  • McHugh (2000) John McHugh. 2000. Testing intrusion detection systems: a critique of the 1998 and 1999 darpa intrusion detection system evaluations as performed by lincoln laboratory. ACM Transactions on Information and System Security (TISSEC) 3, 4 (2000), 262–294.
  • Merkel (2014) Dirk Merkel. 2014. Docker: lightweight linux containers for consistent development and deployment. Linux Journal 2014, 239 (2014), 2.
  • Mullins et al. (2007) Barry E Mullins, Timothy H Lacey, Robert F Mills, Joseph E Trechter, and Samuel D Bass. 2007. How the cyber defense exercise shaped an information-assurance curriculum. IEEE Security & Privacy 5, 5 (2007).
  • of the University of New Mexico (2006) The Regents of the University of New Mexico. 2006. Sequence-based intrusion detection. (2006).
  • Patriciu and Furtuna (2009) Victor-Valeriu Patriciu and Adrian Constantin Furtuna. 2009. Guide for designing cyber security exercises. In Proceedings of the 8th WSEAS International Conference on E-Activities and information security and privacy. World Scientific and Engineering Academy and Society (WSEAS), 172–177.
  • Reed et al. (2013) Theodore Reed, Kevin Nauer, and Austin Silva. 2013. Instrumenting competition-based exercises to evaluate cyber defender situation awareness. In International Conference on Augmented Cognition. Springer, 80–89.
  • Rowe and Gallaher (2006) Brent R Rowe and Michael P Gallaher. 2006. Private sector cyber security investment strategies: An empirical analysis. In The fifth workshop on the economics of information security (WEIS06).
  • Sabhnani and Serpen (2004) Maheshkumar Sabhnani and Gursel Serpen. 2004. Why machine learning algorithms fail in misuse detection on KDD intrusion detection data set. Intelligent data analysis 8, 4 (2004), 403–415.
  • Sharafaldin et al. (2017) Iman Sharafaldin, Amirhossein Gharib, Arash Habibi Lashkari, and Ali A Ghorbani. 2017. Towards a Reliable Intrusion Detection Benchmark Dataset. Software Networking 2017, 1 (2017), 177–200.
  • Sommer and Paxson (2010) Robin Sommer and Vern Paxson. 2010. Outside the closed world: On using machine learning for network intrusion detection. In Security and Privacy (SP), 2010 IEEE Symposium on. IEEE, 305–316.
  • Sonnenreich et al. (2006) Wes Sonnenreich, Jason Albanese, Bruce Stout, et al. 2006. Return on security investment (ROSI)-a practical quantitative model. Journal of Research and practice in Information Technology 38, 1 (2006), 45.
  • Sundaramurthy et al. (2016) Sathya Chandran Sundaramurthy, John McHugh, Xinming Ou, Michael Wesch, Alexandru G Bardas, and S Raj Rajagopalan. 2016. Turning contradictions into innovations or: How we learned to stop whining and improve security operations. In Proc. 12th Symp. Usable Privacy and Security.
  • Tardella (1990) Fabio Tardella. 1990. A New Proof of the Lyapunov Convexity Theorem. SIAM Journal on Control and Optimization 28, 2 (1990), 478–481.
  • Tavallaee et al. (2009) Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A Ghorbani. 2009. A detailed analysis of the KDD CUP 99 data set. In Computational Intelligence for Security and Defense Applications, 2009. CISDA 2009. IEEE Symposium on. IEEE, Ottawa, ON, Canada, 1–6.
  • Tipton and Nozaki (2007) Harold F Tipton and Micki Krause Nozaki. 2007. Information security management handbook. CRC press.
  • Tsiakis and Stephanides (2005) Theodosios Tsiakis and George Stephanides. 2005. The economic approach of information security. Computers & security 24, 2 (2005), 105–108.
  • Verendel (2009) Vilhelm Verendel. 2009. Quantified security is a weak hypothesis: a critical survey of results and assumptions. In Proceedings of the 2009 workshop on New security paradigms workshop. ACM, 37–50.
  • Werther et al. (2011) Joseph Werther, Michael Zhivich, Tim Leek, and Nickolai Zeldovich. 2011. Experiences in Cyber Security Education: The MIT Lincoln Laboratory Capture-the-Flag Exercise.. In CSET.
  • Zimmerman (2014) Carson Zimmerman. 2014. Ten strategies of a world-class cybersecurity operations center. (2014).