Decisions are often based on imprecise, uncertain or vague information. Likewise, the consequences of an action are often equally unpredictable, thus putting the decision maker into a twofold jeopardy. Assuming that the effects of an action can be modeled by a random variable, then the decision problem boils down to comparing different effects (random variables) by comparing their distribution functions. Although the full space of probability distributions cannot be ordered, a properly restricted subset of distributions can be totally ordered in a practically meaningful way. We call theseloss-distributions, since they provide a substitute for the concept of loss-functions in decision theory. This article introduces the theory behind the necessary restrictions and the hereby constructible total ordering on random loss variables, which enables decisions under uncertainty of consequences. Using data obtained from simulations, we demonstrate the practical applicability of our approach.
1 Introduction11todo: 1the introduction has been rewritten almost entirely, towards a better connection with the text and to be shorter and more condensed.
In many practical situations, decision making is a matter of urgent and important choices being based on vague, fuzzy and mostly empirical information. While reasoning under uncertainty in the sense of making decision with known consequences under uncertain preconditions is a well-researched field (cf. [1, 2, 3, 4, 5, 6, 7] to name only a few), taking decisions with uncertain consequences has received substantially less attention. This work presents a decision framework to take the best choice from a set of options, whose consequences or benefit for the decision maker are available only in terms of a random variable. More formally, we describe a method to choose the best among two possible random variables by constructing a novel stochastic order on a suitably restricted subset of probability distributions. Our ordering will be total, so that the preference between two actions with random consequences is always well-defined and a decision can be made. As it has been shown in [8, 9], there exist several applications where such a framework of decision making on abstract spaces of random variables is needed.
To illustrate our method, we will use a couple of example data sets, the majority of which comes from the risk management context. In risk management, decisions typically have uncertain consequences that cannot be measured by a conventional von Neumann-Morgenstern utility function. For example, a security incident in a large company can either be made public, or kept secret. The uncertainty in this case is either coming from the public community’s response, if the incident is made public (as analyzed by, e.g., ), or the residual risk of information leakage (e.g., by whistleblowing). The question here is: Which is the better choice, given that the outcomes can be described by random variables? For such a scenario, suitable methods to determine the consequence distributions using simulations are available , but those methods don’t support the decision making process directly.
Typically, risk management is concerned with extreme events, since small distortions may be covered by the natural resilience of the analyzed system (e.g., by an organization’s infrastructure or the enterprise itself, etc.). For this reason, decisions normally depend on the distribution’s tails. Indeed, heavy- and fat-tailed distributions are common choices to model rare but severe incidents in general risk management [11, 12]
. We build our construction with this requirement of risk management in mind, but originate from the recognized importance that the moments of a distribution play for decision making (cf.). In section 3, we illustrate a simple use of the first moment in this regard that is common in IT risk management, to motivate the need to include more information in a decision. Interestingly, the ordering that we define here is based on the full moment sequence (cf. Definition 2), but implies similar conditions as other stochastic orders, only with an explicit focus on the probability mass located in the distribution’s tails (cf. Theorem 2). Further, we pick some example data sets from risk management applications in Section 5.2, and demonstrate how a decision can be made based on empirical data.
The main contribution of this work is twofold: while any stochastic order could be used for decision making on actions with random variables describing their outcome, not all of them are equally suitable in a risk management context. The ordering we present in this article is specifically designed to fit into this area. Second, the technique of constructing the ordering is new and perhaps of independent scientific interest having applications beyond our context. In the theoretical parts, this work is a condensed version of [14, 15], whereas it extends this preliminary research by practical examples and concrete algorithms to efficiently choose best actions despite random consequences and with a sound practical meaning.
2 Preliminaries and Notation22todo: 2this section has been made less verbose (to tighten this bit)
Sets, random variables and probability distribution functions are denoted as upper-case letters like or
. Matrices and vectors are denoted as bold-face upper- and lower-case letters, respectively. The symbolsdenote the cardinality of the finite set or the absolute value of the scalar . The -fold cartesian product (with permitted) is , and is the set of all infinite sequences over . Calligraphic letters like denote families (sets) of sets or functions. The symbol denotes the space of hyperreal numbers, being a certain quotient space constructed as , where is a free ultrafilter. We refer to [16, 17] for details, as is only a technical vehicle whose detailed structure is less important than the fact that it is a totally ordered field. Our construction of a total ordering on loss distributions will crucially hinge on an embedding of random variables into , where a natural ordering and full fledged arithmetic are already available without any further efforts.
The symbol means the random variable (RV) having distribution , where the subscript is omitted if things are clear from the context. The density function of is denoted by its respective lower-case letter . We call an RV continuous, if it takes values in , and discrete, if it takes values on a countably infinite set . A categorical RV is one with only finitely many, say , distinct outcomes. In that case, the density function can be treated as a vector .
3 The Decision Framework
Our decision problems will concern choosing actions of minimal loss. Formally, if is a set of actions, from which we ought to choose the best one, then a loss-function is usually some mapping , so that an optimal choice from is one with minimal loss under (see  for a full-fledged treatment and theory in the context of Bayesian decision theory). In IT risk management (being used to illustrate our methods later in Section 5.2), risk is often quantified by
which roughly resembles the idea of understanding risk as the expectation of damage. In this quantitative approach, the damage is captured by the aforementioned loss function , whereas the likelihood is obtained from the distribution of the random event causing the damage.
However, losses can not always be measured precisely. For the introductory example, consider the two actions “publish the incident” and “keep the incident secret”. Either choice has unpredictable consequences so we replace the deterministic loss-function by a random variable. That is, let be two arbitrary actions, and write and , respectively, for the random losses implied by taking these actions. The challenge now is to make a decision that minimizes the risk when losses are random.
Obviously, comparing and in the way suggested by (119], where the utility of mixed strategies is exactly the expectation of outcomes but normally disregards further moments). For the example of two Gaussian variables , the expectations are equal, but actions resulting in losses measured by are undesirable relative to , since the fluctuation around the mean for is considerably larger than for actions with consequences described by . An apparent quick fix is to take the variance into account for a decision. However, the previous issue is still not mitigated, since it is equally easy (yet only slightly more involved) to construct two random variables with equal first and second moment, but with different third moments (Example 5 will give two such distributions explicitly). Indeed, the third moment can be taken into account in the straightforward way, which has been discussed in the literature on risk attitudes; see [13, 20, 21] for a few starting references. Towards a more sophisticated approach, we will in the following use the whole object (the random variable) rather that a few representative values thereof to make a decision.
3.1 The Usual Stochastic Order
Choosing a best action among , we ought to compare the random variables in some meaningful way. Without any further restrictions on the support or distribution, we may take the usual stochastic order  for that purpose, which calls if and only if
Condition (2) can be stated equivalently by demanding for all increasing functions for which the expectations exist (so-called test-functions). In the latter formulation, it is easy to see that, for example, the -ordering in particular entails , so that a comparison based on (1) comes out the same under . Moreover, in restricting and to take on only positive values, as our above definition of implies, implies that all moments are in pairwise -order, since the respective functions delivering them are all increasing on . Under this restriction, comparisons based on the second and third moment  are also covered under .
3.2 Generalizing : The -Ordering
In cases where it is sufficient to lower risk under an acceptance threshold, rather than truly minimizing them, we may indeed relax the -ordering in several ways: we can require (2) only for large damages in for a threshold that may be different for various application domains, or we may not use all increasing functions, but only a few selected ones (our construction will use the latter and entail the former relaxation). Given that moments are being used to analyze risks and are related to risk attitudes , let us take the functions for , which are all increasing on . To assure the existence of all moments and the monotony of all members in our restricted set of test-functions, we impose the following assumptions on a general random variable , which we hereafter use to quantitatively model “risk”:
Let be the set of all random variables , who satisfy the following conditions:
has a known distribution with compact support (note that this implies that is upper-bounded).
(w.l.o.g., since as is bounded, we can shift it into the region ).
The probability measure induced by is either discrete or continuous and has a density function
. For continuous random variables, the density function is assumed to be continuous, and piecewise polynomial over a finite partition of the support.author=stefan,backgroundcolor=blue!20!whiteauthor=stefan,backgroundcolor=blue!20!whitetodo: author=stefan,backgroundcolor=blue!20!whitedie stueckweise polnomielle Dichte auf einer endlichen Zerlegung war ja der wesentliche Reparaturansatz
Requirement 1 assures that all moments exist. Requirements 2 and 3 serve technical reasons that will be made clear in Lemma 2. In brief, these two assure that the ordering obtained will be total, and simplifies proofs by defining the order as equal to the natural ordering of hyperreal numbers. The requirement of the density to be piecewise polynomial over a finite number of segments is necessary to avoid families of distributions with alternating moments, such as were constructed by . Such families include even some benign members with monotone densities, and have an order that explicitly depends on the ultrafilter, which we do not want (this will be made rigorous in Theorem 1 below). Assuming that the density, if it is continuous, has a finite piecewise polynomial definition, excludes these pathological cases. This exclusion is mild, since it can be shown [24, Lemma 3.1] that most distributions can be approximated to arbitrary precision by distributions satisfying the requirements of Definition 1. The permission to restrict our attention to moments rather than the whole random variable is given by the following well known fact:
Let two random variables have their moment generating functions
have their moment generating functionsexist within a neighborhood . Assume that for all . Then and have the same distribution.
Proof (Sketch). The proof is a simple matter of combining
well-known facts about power-series and moment-generating functions (see
 for a description).
In the following, let us write to mean the -th moment of a random variable . Our next lemma establishes a total relation (so far not an ordering) between two random variables from , on which our ordering will be based:
For any two probability distributions and associated random variables according to Definition 1, there is a so that either or .
The proof of lemma 2 is given in the appendix. The important fact stated here is that between any two random variables , either a or a ordering holds asymptotically on the moment sequence. Hence, we can take Lemma 2 to justify the following relaxation of the usual stochastic order:
Definition 2 (-Preference Relation over Probability Distributions).
Let be two random variables with distribution functions . We prefer over , respectively the distribution over , written as
Strict preference is denoted and defined as
For this definition to be a meaningful ordering, we need to show that behaves like other orderings, say on the real numbers. We get all useful properties almost for free, by establishing an isomorphy between and another well known ordering, namely the natural order on the hyperreal space :
Let be according to definition 1. Assume every element to be represented by hyperreal number , where is any free ultrafilter. Let be arbitrary. Then, if in , irrespectively of .
Proof (cf. ).
Let be two probability distributions, and let . Lemma 2 assures the existence of
some so that iff
whenever . Let be the set of indices where , then complement set is finite (it has at most
elements). Let be an arbitrary free ultrafilter. Since
is finite, it cannot be contained in as is free.
And since is an ultrafilter, it must contain the complement a set,
unless it contains the set itself. Hence, , which implies the claim.
Theorem 1 has quite some useful implications: first, the asserted independence of the ultrafilter spares us the need to explicitly construct (note that the general question of whether or not non-isomorphic hyperreal fields would arise from different choices of ultrafilters is still unanswered by the time of writing this article). Second, the -ordering on inherits all properties (e.g., transitivity) of the natural ordering on , which by the transfer principle , hold in the same way as for on . More interestingly for further applications, topological properties of the hyperreals can also be transferred to . This allows the definition of a whole game theory on top of , as was started in . It must be noted, however, that the -ordering still behaves different to on , since, for example, the equivalence-relation induced by does not entail an identity between distributions (since a finite number of moments is allowed to mismatch in any case).
Interestingly, although not demanded in first place, the use of moments to compare a distribution entails a similar fact as inequality (2) upon which the usual stochastic order was defined:
Let have the distributions . If , then there exists a threshold so that for every , we have .
The proof of this appears in the appendix. Intuitively, Theorem 2 can be rephrased into saying that:
If , then “extreme events” are less likely to occur under than under .
Summarizing the results obtained, we can say that the -ordering somewhat resembles the initial definition of the usual stochastic order , up to the change of restricting the range from to a subset of and in allowing a finite number of moments to behave arbitrarily. Although this allows for an explicit disregard of the first few moments, the overall effect of choosing a -minimal distribution is shifting all the probability mass towards regions of lower damages, which is a consequence of Theorem 2. As such, this result could by itself be taken as a justification to define this ordering in first place. However, in the way developed here, the construction roots in moments and their recognized relation to risk attitudes [13, 20, 21], and in the end aligns itself to both, the intuition behind and the focus of risk management on extreme events, without ever having stated this as a requirement to begin with. Still, by converting Theorem 2 into a definition, we could technically drop the assumption of losses being . We leave this as an aisle for future research. As a justification of the restrictions as stated, note that most risk management in the IT domain is based on categorical terms (see [25, 26, 27, 28, 29]), which naturally map into integer ranks . Thus, our assumption seems mild, at least for IT risk management applications (applications in other contexts like insurance  are not discussed here and constitute a possible reason for dropping the lower bound in future work).
3.3 Distributions with Unbounded Tails
Theorem 2 tells that distributions with thin tails would be preferred over those with fat tails. However, catastrophic events are usually modeled by distributions with fat, heavy or long tails. The boundedness condition in definition 1 rules out many such distributions relevant to risk management (e.g., financial risk management ). Thus, our next step is extending the ordering by relaxing some of the assumptions that characterize .
The -relation cannot be extended to cover distributions with heavy tails, as those typically do not have finite moments or moment generating functions. For example, Lévi’s -stable distributions  are not analytically expressible as densities or distribution functions, so the expression
could be quite difficult to work out for the usual stochastic order. Conversely, resorting to moments, we can work with characteristic functions, which can be much more feasible in practice.
Nevertheless, such distributions are important tools in risk management. Things are, however, not drastically restricted, for at least two reasons:
Compactness of the support is not necessary for all moments to exist, as the Gaussian distribution has moments of all orders and is supported on the entire real line (thus violating even two of the three conditions of assumption1). Still, it is characterized entirely by its first two moments, and thus can easily be compared in terms of the -relation.
Any distribution with infinite support can be approximated by a truncated distribution. 33todo: 3the explanations about truncated distributions have been shortened, as this is standard and a compact reminder may suffice to follow the text Given a random variable with distribution function , then truncated distribution is the conditional likelihood .
By construction, the truncated distribution has the compact support . More importantly, for a loss distribution with unbounded support and given any , it is easy to choose a compact interval large enough inside so that for all . Hence, restricting ourselves to distributions with compact support, i.e., adopting assumption 1, causes no more than a numerical error that can be made as small as we wish.
More interestingly, we could attempt to play the same trick as before, and characterize a distribution with fat, heavy or long tails by a sequence of approximations to it, arising from better and better accuracy . In that sense, we could hope to compare approximations rather than the true density in an attempt to extend the preference and equivalence relations and to distributions with fat, heavy or long tails.
Unfortunately, such hope is an illusion, as a distribution is not uniquely characterized by a general sequence of approximations (i.e., we cannot formulate an equivalent to lemma 1), and the outcome of a comparison of approximations is not invariant to how the approximations are chosen (i.e., there is also no alike for lemma 2
). To see the latter, take the quantile functionfor a distribution , and consider the tail quantiles . Pick any sequence with . Since , the tail quantile sequence behaves like , where the limit is independent of the particular sequence , but only the speed of divergence is different for distinct sequences.
Now, let two distributions with infinite support be given. Fix two sequences and , both vanishing as , and set
Let us approximate by a sequences of truncated distributions with supports and let the sequence approximate on . Since for all , it is easily verified that the sequence of moments of the distributions truncated to and implies that the respective moment sequences diverge so that ultimately. However, by replacing the “” by a “” in (4), we can construct approximations to whose truncated supports overlap one another in the reverse way, so that the approximations would always satisfy . It follows that the sequence of approximations cannot be used to unambiguously compare distributions with infinite support, unless we impose some constraints on the tails of the distributions and the approximations. The next lemma (see appendix A.3 for a proof) assumes this situation to simply not occur, which allows to give a sufficient condition to unambiguously extend strict preference in the way we wish.
Let be two distributions supported on with continuous densities . Let be an arbitrary sequence with as , and let for be the truncated distribution supported on .
If there is a constant and a value such that for all , then there is a number such that all approximations satisfy whenever .
By virtue of lemma 3, we can extend the strict preference relation to distributions that satisfy the hypothesis of the lemma but need not have compact support anymore. Precisely, we would strictly prefer one distribution over the other, if all truncated approximations are ultimately preferable over one another.
Definition 3 (Extended Preference Relation ).
Let be distribution functions of nonnegative random variables that have infinite support and continuous density functions . We (strictly) prefer over , denoted as , if for every sequence there is an index so that the approximations for satisfy whenever .
The -relation is defined alike, i.e., the ultimate preference of over on any sequence of approximations.
It is a matter of simple algebra to verify that any two out of the three kinds of extreme value distributions (Gumbel, Frechet, Weibull) satisfy the above condition, thus are strictly preferable over one another, depending on their particular parametrization.
Definition 3 can, however, not applied to every pair of distributions, as the following example shows.
Take the “Poisson-like” distributions with parameter ,
It is easy to see that no constant can ever make and that all moments exist. However, neither distribution is preferable over the other, since finite truncations to based on the sequence will yield alternatingly preferable results.
An occasionally simpler condition that implies the hypothesis of definition 3 is
The reason is simple: if the condition of definition 3 were violated, then there is an infinite sequence for which for all . In that case, there is a subsequence for which . Letting , we can construct a further subsequence of to exhibit that , so that condition (5) would be refuted. Observe that (5) is similar to the definition of a likelihood ratio order  in the sense that it implies both, a likelihood ratio and -ordering. Note that, however, a likelihood ratio order does not necessarily imply a -order, since the former only demands to be increasing, but not a -relation among the densities.
It must be emphasized that the above line of arguments does not provide us with a mean to extend the - or -relations accordingly. For example, an attempt to define and as above is obviously doomed to failure, as asking for two densities to satisfy ultimately (note the intentional relaxation of towards ), and ultimately for two constants is nonsense.
A straightforward extension of can be derived from (based on) the conclusion of lemma 3:
Let be two distributions supported on the entire nonnegative real half-line with continuous densities . Let be a diverging sequence towards , and let for denote the density truncated to have support . We define if and only if for every sequence there is some index so that for every .
More compactly and informally spoken, definition 4 demands preference on all approximations with finite support except for at most finitely many exceptions near the origin.
Obviously, preference among distributions with finite support implies the extended preference relation to hold in exactly the same way (since the sequence of approximations will ultimately become constant when overshoots the bound of the support), so definition 4 extends the -relation in this sense.
3.4 Comparing Distributions of Mixed Type
The representation of a distribution by the sequence of its moments is of the same form, for discrete, categorical and continuous random variables. Hence, working with sequence representations (hyperreal numbers) admits to compare continuous to discrete and categorical variables, as long as there is a meaningful common support. The framework itself, up to the results stated so far, remains unchanged and is applied to the category’s ranks instead. The ranking is then made in ascending order of loss severity, i.e., the category with lowest rank (index) should be the one with the smallest damage magnitude (examples are found in IT risk management standards like ISO 27005 or the more generic ISO 31000  as well as related standards).
A comparison of mixed types is, obviously, only meaningful if the respective random variables live in the same (metric) space. For example, it would be meaningless to compare ordinal to numeric data. Some applications in natural risk management define categories as numeric ranges (such as [25, 26, 27, 28]), which could make a comparison of categories and numbers meaningful (but not necessarily so).
It must be noted that Definition 2 demands only the existence of some index after which the sequence of moment diverges, without giving any condition to assure this. Likewise, Theorem 2 is non-constructive in asserting the existence of a region onto which the -smaller distribution puts more mass than the other. Hence, practical matters of deciding and interpreting the -ordering are necessary and discussed in the following.
In general, if the two distributions are supported on the sets for and for with , then the mass that puts on the set will cause the moments of to grow faster than those of . In that case, we can thus immediately conclude , and we get in Theorem 2. Thus, the more interesting situation arises when the supports are identical, which is assumed throughout the following subsections. Observe that it is herein not necessary to look at overlaps at the lower end of the supports, since the mass assigned near the “right end” of the support is what determines the growth of the moment sequence; the proof of Lemma 2 in the appendix more rigorously shows this.
4.1 Deciding between Categorical Variables
Let be two distributions over a common support, i.e., a common finite set of categories, hereafter denoted in descending order as . Let be the corresponding probability mass functions. For example, these can be normalized histograms (empirical density functions) computed from the available data to approximate the unknown distributions of the random variables .
Letting the category correspond to its rank within the support, it is easy to check that the expectation of by definition is a sequence whose growth is determined by whichever distribution puts more mass on categories of high loss. Formally, if , then , since the growth of either sum is determined by the largest term (here being ). Upon the equality , we can retract the respective terms from both sums (as they are equal), to see whether the second-largest term tips the scale, and so on.
Overall, we end up observing that -comparing distributions is quite simple, and a special case of another common ordering relation:
Definition 5 (lexicographic ordering).
For two real-valued vectors and of not necessarily the same length, we define if and only if there is an index so that and whenever .
Our discussion from above is then the mere insight that the following is true:
Let be two categorical random variables with a common ordered support , and let be the respective (empirical) density functions. Then , where .
4.2 Deciding between Continuous Variables
Let us assume that the two random variables have smooth densities for some . Under this assumption, we can switch to yet another useful sequence representation:
Given two distributions , e.g., constructed from a Gaussian kernel (cf. remark 2 below), let the respective representations according to (6) be . Then, it turns out that the lexicographic ordering of implies the same ordering w.r.t. , or formally:
will be demonstrated on our example data set #3, in connection with a kernel density estimate, in section5. Practically, we can thus decide the -relation by numerically computing derivatives of increasing order, until the decision is made by the lexicographic ordering (which, for our experiments, happened already at zeroth order in many cases).
The assumption on differentiability is indeed mild, as we can cast any integrable density function into a -function by convolution with a Gaussian density with zero mean and variance . Clearly, by the differentiation theorem of convolution. Moreover, letting , we even have -convergence of , so that the approximation can be made arbitrarily accurate by choosing the parameter
sufficiently small. Practically, when the distributions are constructed from empirical data, the convolution corresponds to a kernel density estimation (i.e., a standard nonparametric distribution model). Using a Gaussian kernel then has the additional appeal of admitting a closed form of the-th derivatives , involving Hermite-polynomials.
Observation – “”:
As an intermediate résumé, the following can be said:
Under a “proper” representation of the distribution (histogram or continuous kernel density estimate), the -order can be decided as a humble lexicographic order.
This greatly simplifies matters of practically working with -preferences, and also fits into the intuitive understanding of risk and its formal capture by theorem 2: whichever distribution puts more mass on far-out regions is less favourable under .
4.3 Comparing Deterministic to Random Effects
In certain occasions, the consequence of an action may result in perfectly foreseeable effects, such as fines or similar. Such deterministic outcomes can be modeled as degenerate distributions (point- or Dirac-masses). These are singular and thus outside by Definition 1. Note that the canonic embedding of the reals within the hyperreals represents a number by the constant sequence . Picking up this idea would be critically flawed in our setting, as any such constant sequence would be preferred over any probability distribution (whose moment sequence diverges and thus overshoots inevitably and ultimately).
However, it is easy to work out the moment sequence of the constant as for all . In this form, the -relation between the number and the continuous random variable supported on can be decided as follows:
If , then : to see this, choose so that is strictly positive on a compact set (note that such a set must exist as is continuous and the support ranges until ). We can lower-bound the -th moment of as
Note that the infimum is positive as is strictly positive on the compact set . The lower bound is essentially an exponential function to a base larger than , since , and thus (ultimately) grows faster than .
If , then , since – in any possible realization – leads to strictly less damage than . The formal argument is now based on an upper bound to the moments, which can be derived as follows:
It is easy to see that for , this function grows slower than as , which leads to the claimed -relation.
If , then we apply the mean-value theorem to the integral occurring in to obtain an for which
for all . Hence, in that case. An intuitive explanation stems from the fact that may assign positive likelihood to events with less damage as , whereas a deterministic outcome is always larger or equal to anything that can deliver.
4.4 On the Interpretation of and Inference
The practical meaning of the -preference is more involved than just a matter of comparing the first few moments. Indeed, unlike for IT risk preferences based on (1), the first moment can be left unconstrained while may still hold in either direction.
For general inference, the comparison of two distributions provides a necessary basis (i.e., to define optimality, etc.). For example, (Bayesian) decision theory or game theoretic models can be defined upon , via a much deeper exploration of the embedding of into the hyperreals (by mapping a distribution to its moment sequence), such as the induced topology and calculus based on it. In any case, however, we note that the previous results may help in handling practical matters of inside a more sophisticated statistical decision or general inference process. For practical decisions, some information can be obtained from the value that Theorem 2 speaks about. This helps assessing the meaning of the order, although the practical consequences implied by or are somewhat similar. The main difference is (2) holding only for values in case of . The threshold can hence be found by numerically searching for the largest (“right-most”) intersection point of the respective survival functions; that is, for two distributions , a valid in Theorem 2 is any value for which for all . An approximation of , e.g., computed by a bisective search in common support of both distributions, then more accurately describes the “statistically best” among the available actions, since losses are more likely for all other options. A practical decision, or more general inference based on , should therefore be made upon computing as an explicit auxiliary information, in order to assign a quantitative meaning to “extreme events” in the interpretation underneath Theorem 2. Further issues of practical decision making in the context of IT risk management are discussed along the first empirical example found in section 5.2.
Section 5 will not discuss (statistical) inference since the details are beyond the scope of this work (we leave this to follow up work). Instead, the following section will be dedicated to numerical illustrations of only, without assigning any decisional meaning to the -preferred distributions. For each example, we will also give an approximation (not the optimal) value of .
5 Numerical Examples
Let us now apply the proposed framework to the problem of comparing effects that are empirically measurable, when the precise action/response dynamics is unknown. We start by looking at some concrete parametric models of extreme value distributions first, to exemplify cases of numerical comparisons of distributions with unbounded tails in Section5.1.
In Section 5.3, we will describe a step-by-step evaluation of our -ordering on empirical distributions. The sources and context of the underlying empirical data sets are described in section 5.2. From the data, we will compile non-parametric distribution models, which are either normalized histograms or kernel density estimators. On these, we will show how to decide the -relation using the results from section 4.
5.1 Comparing Parametric Models
We skip the messy algebra tied to the verification of the criteria in Section 3.3, and instead compute the moments numerically to illustrate the growth/divergence of moment sequences as implied by Lemma 2.
Example 3 (different mean, same variance).
Consider two Gumbel-distributions and , where a density for is given by
where and are the location and scale parameter.
Computations reveal that under the given parameters, the means are and . Figure 1 plots the respective densities of (dashed) and (solid line). The respective moment sequences evaluate to
thus illustrating that . This is consistent with the intuition that the preferred distribution gives less expected damage. The concrete region about which Theorem 2 speaks is at least for damages (cf. Theorem 2).
Example 4 (same mean, different variance).
Let us now consider two Gumbel-distributions and , for which but .
Figure 2 plots the respective densities of (dashed) and (solid line). The respective moment sequences evaluate to
thus illustrating that . This is consistent with the intuition that among two actions leading to the same expected loss, the preferred one would be one for which the variation around the mean is smaller; thus the loss prediction is “more stable”. The range on which damages under are less likely than under begins at (cf. Theorem 2).
Example 5 (different distributions, same mean and variance).
Let us now consider a situation in which the expected loss (first moment) and variation around the mean (second moment) are equal, but the distributions are different in terms of their shape. Specifically, let and , with densities as follows:
Figure 3 plots the respective densities of (dashed) and (solid line). The respective moment sequences evaluate to
thus illustrating that . In this case, going with the distribution that visually “leans more towards lower damages” would be flawed, since nonetheless assigns larger likelihood to larger damages. The moment sequence, on the contrary, unambiguously points out as the preferred distribution (the third moment tips the scale here; cf. [13, 20, 21]). The statistical assurance entailed by Theorem 2 about an interval in which high damage incidents are less likely (at least) includes losses .
5.2 Empirical Test Data and Methodology
To demonstrate how the practical matters of comparing distributions work, we will use three sets of empirical data, based on qualitative data from risk estimation, and based on simulating a malware outbreak using percolation.44todo: 4the context descriptions of the test data sets #1 and #2 were rewritten towards being more compact and tight.
Test Data Set #1 – IT Risk Assessments:
The common quantitative understanding of risk by the formula (1) is easily recognized as the expectation (i.e., first moment) of a loss distribution. Although being standard in quantitative IT risk management, its use is discouraged by the German Federal Office of Information Security (BSI)  for several reasons besides the shortcomings that we discussed here (for example, statistical data may be unavailable at the desired precision and an exact formula like (1) may create the illusion of accuracy where there is none ).
Best practices in risk management (ranging up to norms like the ISO27005 , the ISO31000  or the OCTAVE Allegro framework ) usually recommend the use of qualitative risk scales. That is, the expert is only asked to utter an opinion about the risk being “low/medium/high” or perhaps using a slightly more fine-grained but in any case ordinal scale. In a slight abuse of formalism, these categories are then still carried into an evaluation of (1) (cf. ) towards finding the decision with the “least” risk in qualitative terms as (1) gives.
Categorical risk assessments are heavily used in the IT domain due to their good systematization and tool support. Our first test data set is thus a risk assessment made in terms of the Common Vulnerability Scoring System (CVSS) . The CVSS ranks risks on a scale from 0 to 10, as a decimal rounded up to one place behind the comma. Usually, these CVSS values come from domain experts, so there is an intrinsic ambiguity in the opinions on grounds of which a decision shall be made. Table 1, taken from  (by kind permission of the author A. Beck), shows an example of such expert data for two security system installments being assessed by experts in the left and right part of the table (separated by the double vertical line). The -ordering shall now help to choose the better of the two options, based on the ambiguous and even inconsistent domain expert inputs. For simplicity of the example, we did not work with the fine-grained CVSS scores, but coarsened them into three categories, i.e., intervals of scores (L), (M) and (H). We remark that the categorial assessment was added in this work, and is not from the source literature.
Test Data Set #2 – Malware Outbreaks:
Computer malware infections are continuously reported in the news, with an early and prominent example having been the Stuxnet worm in 2008 , which infected the Iranian uranium enrichment facilities. Ever since, the control and supervision of cyber-physical systems has gained much importance in risk management, since attacks on the computer infrastructure may have wide effects ranging up to critical supply infrastructures such as water supply, power supply, and many others (e.g., oil, gas or food supply networks, etc.).
The general stealthiness of such infections makes an exact assessment of risk difficult. A good approach to estimate that risk is to apply outbreak simulation models, such as, for example, using percolation theory [40, 41]. These simulations provide us with possible infection scenarios, in which the number of infected nodes (after a fixed period of time), can be averaged into a probability distribution describing the outcome of an infection. Repeating the simulation with different system configurations yields various outcome distributions. An example for a network with 20 nodes and 1000 repetitions per simulation is displayed in Table 2. The -relation shall then help deciding which configuration is better in minimizing the risk of a large outbreak.
|size of the outbreak||config. 1||config. 2|
Test Data Set #3 – Nile Water Level:
As a third data set, we use one that ships with the statistical software suite R. Concretely, we will look at the dataset Nile that consists of the measurements of the annual flow of the Nile river between 1871 and 1970. For comparisons, we will divide the data into two groups of 50 observations each (corresponding to years). The decision problem associated with it is the question of which period was more severe in terms of water level. Extending the decision problem to more than two periods would then mean searching for a trend within the data. Unlike a numerical trend, such as a sliding mean, we would here have a “sliding empirical distribution” to determine the trend in terms of randomness.
5.3 Comparing Empirical Distributions
With the three data sets as described, let us now look into how decisions based on the empirical data can be made.
Categorical Data – Comparing Normalized Histograms:
Compiling an empirical distribution from the example CVSS data in Table 1 gives the histograms shown in Figure 4. Clearly, scenario 1 is preferable here, as it is less frequently rated with high damage than scenario 2. On the contrary, the decision is much less informed than in the case where the full numeric data would have been used. We will thus revisit this example later again.
For the simulated malware infection data in Table 2, the empirical distribution of the number of affected nodes as shown in Figure 5 is obtained by normalization of the corresponding histograms. In this case, configuration 2 is preferable as the maximal damage of 20 node has occurred less often than in configuration 1.
Continuous Data – Comparing Kernel Density Estimates:
If the data itself is known to be continuous, then a nonparametric distribution estimate can be used to approximate the unknown distribution.
In the following, let us write to mean a general kernel density estimate based on the data (observations) , of the form
where is the chosen kernel function, and
is a bandwidth parameter, whose choice is up to any (of many existing) heuristics (see[42, 43] among others). Computing a kernel density estimate from data is most conveniently done by invoking the density command within the R statistical computing software .
This comparison of two kernel density estimates (KDE) is illustrated in Figure 6. For that purpose, we divided the test dataset #3 (data(Nile) in R) into observations covering the years 1871-1920 and 1921-1970. For both sets the density is estimated with a Gaussian kernel and the default bandwidth choice nrd0 (Silverman’s rule ) yielding a KDE with bandwidth for the years 1871-1920 and a KDE with bandwidth for the years 1921-1970. Further we have the maximal observed values and , and see that
Therefore (and also by visual inspection of Figure 6), the density for the period from 1871 until 1920 has had higher likelihoods for a high water level, which became less in the period from 1921-1970, thus indicating a “down-trend” by .
Using Gaussian Kernels:
Commonly, the kernel density approximation is constructed using Gaussian kernels per default, in which case takes the form . Definition 1 is clearly not met, but there is also no immediate need to resort to the extended version of as given by Definition 4. Indeed, if we simply truncate the KDE at any point into the distribution and remember that , the truncated kernel density estimate is again a -density, as required by Lemma 4.
Returning to the CVSS example data in Table 1, we constructed Gaussian kernel density estimates , with bandwidths and (using the the default Silverman’s rule in R); plots of which are given in Figure 7. Using the criterion of Lemma 4 in connection with the lexicographic ordering, we end up finding that , in contrast to our previous finding. This is, however, only an inconsistency at first glance, and nevertheless intuitively meaningful if we consider the context of the decision and the effect of the nonparametric estimation more closely:
Since scenario 2 is based on more data than scenario 1, the bandwidth is less than . This has the effect of the distribution being “more condensed” around higher categories, as opposed to the distribution for scenario 1, whose tail is much thicker. Consequently, the decision is to prefer scenario 2 is implicitly based on the larger data set, and considers the higher uncertainty in the information about scenario 1.
The Gaussian kernel has tails reaching out to , which also assigns positive mass to values outside the natural range of the input data ( in case of CVSS). In many contexts, observations may not be exhaustive for the possible range (e.g., monetary loss up to the theoretical maximum may – hopefully – not have occurred in a risk management process in the past). By construction, the KDE puts more mass on the tails the more data in this region is available. From a security perspective, this mass corresponds to zero-day exploit events. Thus, such incidents are automatically accounted for by .
6 Discussion55todo: 5the discussion has been rewritten towards being more tight and referring closer to the method laid out in the paper.
Our proposed preference relation is designed for IT risk management. In this context, decision makers often rely on scarce and purely subjective data coming from different experts. Bayesian techniques that need large amounts of data are therefore hard to apply (and somewhat ironically, a primary goal of IT risk management is exactly minimizing the lot of incidents that could deliver the data). Since the available information may not only be vague but possibly also inconsistent, consensus finding by data aggregation is often necessary. There exist various non-probabilistic methods to do this (such as fuzzy logic, Dempster-Shafer theory, neural networks, etc.) and perform extremely well in practice, but the interpretation of the underlying concepts is intricate and the relation to values and business assets is not trivial.
To retain interpretability, data aggregation often means averaging (or taking the median of) the available risk figures, in order to single out an optimal action. This clearly comes at the cost of losing some information. Stochastic orders elegantly tackle the above issues by letting the entire data go into a probability distribution (and thus preserving all information), and defining a meaningful ranking on the resulting objects. However, not all stochastic orders are equally meaningful for the peculiarities of IT risk management. For example, low damages are normally disregarded as being covered by the system’s natural resilience, i.e., no additional efforts are put on lowering a risk that is considered as low already. The relevance of risks depends on whether or not a certain acceptable damage threshold is exceeded. IT risk managers typically care about significant (extreme) distortions and events with high potential of damage but with only a limited lot of reported evidence so far, such as zero day exploits or advanced persistent threats.
Consequently, a suitable ordering may reasonably ignore damages of low magnitude, and focus on extreme outcomes, i.e., the tails of the respective loss distributions. This is a major reason for our transition from the usual stochastic order that takes into account the entire loss range (in fact all , according to (2)) to one that explicitly focuses on a left neighborhood of the loss maximum. In a converse approach to the same problem, this could as well be used as a starting point to define an order, but starting from moments instead and finishing with an ordering that is about the heaviness of tails is an interesting lesson learned from our proposed technique of using to construct the ordering here. More importantly, the rich structure of , being available without additional labor, makes our ordering useable with optimization and game theory, so that important matters of security economics can be covered as a by-product. This non-standard technique of constructing an ordering is an independent contribution of this work.66todo: 6The reviewer’s comment on the possibility of avoiding non-standard analysis is certainly true, and we acknowledge this remark here explicitly. Nevertheless, drawing fellow researcher’s attention to this possibility appears important to us.
Summarizing our point, decision making based on a stochastic ordering has the appeal of a statistical fundament that is easy to communicate and, more importantly, fits well into existing risk management standards (ISO 27000, ISO 31000, etc.).
The well defined arithmetic over , into which Theorem 1 embeds the (risk) distribution models in , lets us technically work with distributions like as if we were in a topological field. This embedding offers an interesting unexplored (and nontrivial) route of future research: though the operations on random variables (say, addition or quotients) do not correspond to the same operations in (which is immediately evident from the definition), many other operations and even functions of random variables can be studied in the space rather than on the set of distributions. So we can, for example, do optimization theory over distributions but equipped with the full armory of calculus known from the reals (that analogously holds in the space by virtue of Łos’ theorem or the transfer principle ).
Our ordering relation on the set of probability distributions can be extended towards a theory of games on these spaces (this extension is based on the topology that the order induces, upon which Nash’s result on the existence of equilibria can be re-established on our space of probability distributions). First steps into applying the framework to competitive decision-situations have been taken in , and will be further detailed in follow up research articles.
The authors wish to thank the anonymous reviewers for invaluable suggestions and for bringing up a variety of interesting aspects to look at here and future research. Their input greatly improved the readability and quality of the text. This work was supported by the European Commission’s Project No. 608090, HyRiM (Hybrid Risk Management for Utility Networks) under the 7th Framework Programme (FP7-SEC-2013-1). The project is online found at https://hyrim.net.
- 1. Shortliffe EH, Buchanan BG. A model of inexact reasoning in medicine. Readings in uncertain reasoning. 1990;p. 259–275.
- 2. Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 1988.
- 3. Buntine WL. Chain graphs for learning. In: Uncertainty in Artificial Intelligence; 1995. p. 46–54.
- 4. Jensen FV. Bayesian networks and decision graphs. New York: Springer; 2002.
- 5. Halpern JY. Reasoning about Uncertainty. MIT Press; 2003.
- 6. Evans MJ, Rosenthal JS. Probability and Statistics – The Science of Uncertainty. W.H. Freeman and Co.; 2004.
- 7. Koski T, Noble JM. Bayesian Networks. Wiley Series in Probability and Statistics. Wiley; 2009.
- 8. Szekli R. Stochastic Ordering and Dependence in Applied Probability. Lecture Notes in Statistics. Vol. 97. Springer; 1995.
- 9. Stoyan D, Müller A. Comparison methods for stochastic models and risks. Wiley, Chichester; 2002.
- 10. Busby JS, Onggo BSS, Liu Y. Agent-based computational modelling of social risk responses. European Journal of Operational Research. 2016;251(3):1029–1042.
- 11. Embrechts P, Lindskog F, McNeil A. Modelling Dependence with Copulas and Applications to Risk Management; 2003. Handbook of Heavy Tailed Distributions in Finance.
- 12. McNeil A, Frey R, Embrechts P. Quantitative Risk Management – Concepts, Techniques and Tools. Princeton Univ. Press; 2005.
Eichner T, Wagener A.
Increases in skewness and three-moment preferences.Mathematical Social Sciences. 2011;61(2):109–113.
- 14. Rass S. On Game-Theoretic Risk Management (Part One) – Towards a Theory of Games with Payoffs that are Probability-Distributions. ArXiv e-prints. 2015 Jun;http://arxiv.org/abs/1506.07368.
- 15. Rass S. On Game-Theoretic Risk Management (Part Two) – Algorithms to Compute Nash-Equilibria in Games with Distributions as Payoffs; 2015. arXiv:1511.08591.
- 16. Robinson A. Nonstandard Analysis. Studies in Logic and the Foundations of Mathematics. North-Holland, Amsterdam; 1966.
- 17. Rass S, König S, Schauer S. Uncertainty in Games: Using Probability Distributions as Payoffs. In: Khouzani M, Panaousis E, Theodorakopoulos G, editors. Decision and Game Theory for Security, 6th International Conference, GameSec 2015. LNCS 9406. Springer; 2015. .
- 18. Robert CP. The Bayesian choice. New York: Springer; 2001.
- 19. Gibbons R. A Primer in Game Theory. Pearson Education Ltd; 1992.
- 20. Chiu WH. Skewness Preference, Risk Taking and Expected Utility Maximisation. The Geneva Risk and Insurance Review. 2010;35(2):108–129.
- 21. Wenner F. Determination of Risk Aversion and Moment-Preferences: A Comparison of Econometric models [PhD Thesis]. Universität St.Gallen; 2002.
- 22. Shaked M, Shanthikumar JG. Stochastic Orders. Springer; 2006.
- 23. Bürgin V, Epperlein J, Wirth F. Remarks on the tail order on moment sequences. arXiv:210410572 [math]. 2021 Apr;Available from: http://arxiv.org/abs/2104.10572.
- 24. Rass S, König S, Schauer S, Bürgin V, Epperlein J, Wirth F. On Game Theory Using Stochastic Tail Orders. arXiv:210800680 [math]. 2021 Aug;ArXiv: 2108.00680. Available from: http://arxiv.org/abs/2108.00680.
- 25. Bundestag D. Unterrichtung durch die Bundesregierung: Bericht über die Methode zur Risikoanalyse im Bevölkerungsschutz 2010 [Information by the government: report on the method of risk analysis for public safety 2010]. In: Verhandlungen des Deutschen Bundestages: Drucksachen, 2010, p. 17/4178.; 2010. .
- 26. Bundesamt für Bevölkerungsschutz, Bern. Methode zur Risikoanalyse von Katastrophen und Notlagen für die Schweiz [Methods for risk analysis of catastrophies and crises for switzerland]; 2013.
- 27. The Network of Analysts for National Security (ANV). National Risk Assessment 2011; 2011. National Institute for Public Health and the Environment (RIVM), the Netherlands.
- 28. Swedish Civil Contingencies Agency (MSB). Swedish National Risk Assessment 2012; 2012. Retrieved Oct.17, 2016. https://www.msb.se/RibData/Filer/pdf/26621.pdf.
- 29. Mell P, Scarfone K. A Complete Guide to the Common Vulnerability Scoring System; 2007. Version 2.0 (last access: Feb. 12th, 2010). http://www.first.org/cvss/cvss-guide.pdf.
- 30. Hogg RV, Klugman SA. Loss distributions. Wiley series in probability and mathematical statistics Applied probability and statistics. New York, NY: Wiley; 1984.
- 31. Bäuerle N, Müller A. Stochastic orders and risk measures: Consistency and bounds. Insurance: Mathematics and Economics. 2006;38(1):132–148. Available from: http://www.sciencedirect.com/science/article/pii/S0167668705001125.
- 32. Nolan J. Stable Distributions: Models for Heavy-Tailed Data. Springer; 2016.
- 33. International Standards Organisation (ISO). ISO/IEC 27005 - Information technology – Security techniques – Information security risk management; 2011. Http://www.iso27001security.com/html/27005.html [retrieved: Dec.6, 2016].
- 34. International Standards Organisation (ISO). ISO/IEC 31000 - Risk management – Principles and guidelines; 2009. (accessed: April 11, 2016). http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=43170.
- 35. Münch I. Wege zur Risikobewertung. In: Schartner P, Taeger J, editors. DACH Security 2012. syssec; 2012. p. 326–337.
- 36. Richard A Caralli, James F Stevens, Lisa R Young, William R Wilson. Introducing OCTAVE Allegro: Improving the Information Security Risk Assessment Process; 2016. Techical report CMU/SEI-2007-TR-012 ESC-TR-2007-012.
- 37. Goodpasture JC. Quantitative methods in project management. Boca Raton, Florida: J. Ross Pub; 2004. ISBN: 1-932159-15-0.
- 38. Beck A. Entwicklung einer Metrik zur automatisierten Analyse und Bewertung von Bedrohungsszenarien mit Hilfe neuronaler Netzwerke; 2016.
- 39. Karnouskos S. Stuxnet Worm Impact on Industrial Cyber-Physical System Security. In: IECON 2011 - 37th Annual Conference of the IEEE Industrial Electronics Society (IECON 2011). IEEE; 2011. p. 4490–4494.
- 40. Newman MEJ. The spread of epidemic disease on networks. Physical Review E. 2002;66, 016128.
- 41. König S. Error Propagation Through a Network With Non-Uniform Failure; 2016. arXiv:1604.03558.
- 42. Liu B, Yang Y, Webb GI, Boughton J. A Comparative Study of Bandwidth Choice in Kernel Density Estimation for Naive Bayesian Classification. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho TB, editors. Advances in Knowledge Discovery and Data Mining: 13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, April 27-30, 2009 Proceedings. Berlin, Heidelberg: Springer Berlin Heidelberg; 2009. p. 302–313.
- 43. Silverman BW. Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC; 1998.
- 44. R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria; 2016. ISBN 3-900051-07-0. Available from: http://www.R-project.org.
Appendix A Proofs
The proofs here first appeared in , and are repeated for the sake of completeness and convenience of the reader.
a.1 Proof of Lemma 2
We first discuss the continuous case, which illustrates the basic idea that can be applied alike to categorical and discrete distributions.
Let denote the densities of the distributions . Fix the smallest so that covers both the supports of and . Consider the difference of the -th moments, given by
Towards a lower bound to (8), we distinguish two cases:
If for all , then and because are continuous, their difference attains a minimum on the compact set . So, we can lower-bound (8) as , as .
Otherwise, we look at the right end of the interval , and define
Without loss of generality, we may assume . To see this, note that if , then the continuity of implies within a range for some , and is the supremum of all these . Otherwise, if on an entire interval for some , then on (the opposite of the previous case) implies the existence of some so that , and is the supremum of all these (see Figure 8 for an illustration). In case that , we would have on , which is either trivial (as for all if ) or otherwise covered by the previous case.
In either situation, we can fix a compact interval and two constants (which exist because are bounded as being continuous on the compact set ), so that the function
as due to and because are constants that depend only on .
In both cases, we conclude that, unless , for sufficiently large where is finite. This establishes the lemma for continuous distributions.
In the discrete or categorical case, the argument remains the same, only adapted to looking at the finite set of values on which . The largest value less than above which equality holds until the end of the support then determines the growth of the difference sequence in the same way as was argued in Section 4.1.
a.2 Proof of Theorem 2
Let be the density functions of . Call the common support of both densities, and take . Suppose there were an so that on every interval whenever , i.e., would be larger than until both densities vanish (notice that on the right of ). Then the proof of lemma 2 delivers the argument by which we would find a so that for every , which would contradict . Therefore, there must be a neighborhood on which