Computation of quantile sets for bivariate data

01/21/2021 ∙ by Andreas H Hamel, et al. ∙ Free University of Bozen-Bolzano 0

Algorithms are proposed for the computation of set-valued quantiles and the values of the lower cone distribution function for bivariate data sets. These new objects make data analysis possible involving an order relation for the data points in form of a vector order in two dimensions. The bivariate case deserves special attention since two-dimensional vector orders are much simpler to handle than such orders in higher dimensions. Several examples illustrate how the algorithms work and what kind of conclusions can be drawn with the proposed approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 25

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Algorithms are presented for computing the quantile sets for bivariate random variables as well as the values of the corresponding lower cone distribution function in the presence of an order relation for their values. Such quantiles in the general multivariate case have been defined in

[6]

as a common generalization of univariate quantiles and Tukey’s (halfspace) depth regions. Likewise, the lower cone distribution function is a common generalization of the univariate cumulative distribution function and Tukey’s (halfspace) depth function. It can also be used as a ranking function for multi-criteria decision making problems

[9].

Moreover, it was shown in [2] that set-valued quantiles and the lower cone distribution functions form Galois connections between complete lattices of sets and the interval [0,1] of real number. This generalizes a property which is straightforward and well-known in the univariate case, but has never been discussed with respect to depth functions and depth regions. It is also shown in [2] that set-valued quantiles characterize the distribution of a random set extension of the original random variable as well as its capacity functional.

The bivariate case deserves special attention since convex cones in have a very simple representation (every closed convex cone is polyhedral, i.e., the intersection of a finite number of halfspaces—this number being 1 or 2 in almost all cases) and, using this, the computations can be done much faster than in the general -valued case: there are polyhedral cones with arbitrary many facets already in .

Algorithms for the bivariate location depth and corresponding depth regions were given in [17, 18]. Algorithms for depth functions and regions in general dimensions can be found, for example, in [5, 10, 11, 12, 15]. These references are mainly concerned with Tukey depth functions/regions, i.e., they do not take an order relation for the value of the random variable into account. The reader may compare [4] which is one of the very few references dealing with statistics for multivariate ordered data.

On the other hand, an order relation is often present and intuitive since decision makers have preferences or the impact of some events is clearly preferred over the ones of others. A few examples illustrating this feature are discussed below: hurricane scales, hail insurance and human resource management. It is beyond the scope of this paper to give an exhaustive statistical analysis of these example; they are used as showcases for the type of conclusions which can be drawn with the approach initiated in [6], in particular what sets it apart from a mere depth function approach.

The paper is organized as follows. In the next section, vector orders in are reviewed. Section 3 includes the definition of the main concepts and preparatory results. The algorithms are presented in Section 4 while in Section 5 several examples are discussed including a new view on hurricane scales and the problem of finding best candidates for tasks/jobs.

2 Vector preorders in two dimensions

The basic assumption is that there is a preference relation for the two-dimensional data points in form of a vector preorder, i.e., a reflexive and transitive relation which is compatible with the algebraic operations in . Such vector preorders are in one-two-one correspondence with convex cones including via

(2.1)

(see, for example, [1, Chap. 8]). A convex cone is a set satisfying for all and .

In the following, it is assumed that the cone generating the preorder via (2.1) is closed. Such vector preorders in have some special features compared to the case . Only the following cases are possible:

1. The cone is a linear subspace which is either , or , or a straight line in .

2. The cone is a ray: for some .

3. The cone is generated by two linearly independent vectors:

for some which are linearly independent.

4. The cone is a closed (homogeneous) halfspace: for two linearly independent vectors .

There are interesting non-closed cones such as the lexicographic ordering cone even in ; such cases require a different type of analysis (since the bipolar theorem does not apply) and therefore, they will not be considered here.

The case leads to the Tukey halfspace depth function and regions; it is dealt with, e.g., already in [18]. While the case is trivial, the case will not be discussed in this paper. Finally, if is a closed halfspace, there is such that ; in this case, the order is a total preorder (a reflexive, transitive relation such that either or or both) and the situation can be reduced to the univariate case.

In this paper, the main subject is the case of a closed convex pointed cone with non-empty interior, i.e., #3 above and case #4 will appear as an intermediate step.

If is generated by two linearly independent vectors , then is the intersection of exactly two halfspaces, i.e., there are linearly independent such that

(choose and orthogonal to and , respectively, such that , ).

It is assumed in the following that as well as are known, and the two sets and are called a V-representation and an H-representation of , respectively. The set

is called the (positive) dual cone of (always a closed convex cone). Under the given assumptions,

where is a base of , i.e., for each there are unique and such that .

3 Empirical cone distribution functions and quantiles

In this section, we give the definitions of lower cone distribution functions and associated quantiles for bivariate random variables in case of a finite data sets. Let be a finite collection of data points which could be a sample of a random variable. The following definition provides the bivariate empirical counterpart to the concepts from [6]. Compare also [9] for the -valued case with applications to a multi-criteria decision making problem.

Definition 3.1

The functions for and defined by

(3.1)
(3.2)

are called empirical lower -distribution function and empirical lower -distribution function, respectively, for the data set .

The functions and are called the -location depth and the cone location depth for .

The functions and can be interpreted as follows. For each point , the -location depth gives the number of data points which are dominated by with respect to the total preorder generated by , i.e., data points satisfying . The cone location depth of gives the minimal number of data points which are dominated by with respect to all total preorders generated by for . Thus, a point dominates at least data points with respect to the total preorder generated by for all , i.e., no matter which weighted average with weights from is taken. Data points which are higher ranked than others are “deeper” in the sense that they improve with respect to more ’s at the same time.

Proposition 3.2

(1) implies ;

(2) implies .

Proof. This follows directly from the monotonicity property of and , respectively, in [6, Proposition 1 (b)].

The following example shows that the cone location depth (as well as ) can be understood as a ranking function for the data points which reflects the order . This seems to be very much in the spirit of Tukey’s original work. This example also shows that points which are non-comparable with respect to the order can have the same or very different cone location depths.

Example 3.3

For every data point in Figure 3.1 Tukey’s (halfspace) depth HD and the cone location depth CD are computed. One may already realize that the cone location depth “follows the cone” (increases in directions in which the cone “opens”) whereas the halfspace depth increases toward the center of the data cloud. This means that the cone location depth ranks the data points taking into account the order generated by the cone. This is a new feature not captured by depth functions.

Figure 3.1:

Empirical quantiles are set-valued functions, i.e., they map into the power set , the set of all subsets of including the empty set.

Definition 3.4

The empirical -quantile function and the empirical -quantile function associated to and are defined by

respectively.

The definitions of and immediately yield

(3.3)
(3.4)

where is the value of the ceiling function at defined by (the least natural number which is greater than or equal to ). Clearly,

Let be a finite univariate data set. We denote its empirical lower quantile by . With this notation, one has

(3.5)
(3.6)

as in the general (multi-dimensional) case (see [6, Proposition 6]).

Next, we formally state that at least one data point must be on the boundary of each -quantile.

Proposition 3.5

If , and , then there is such that

(3.7)

Moreover, the following three conditions are equivalent for :

(a) .

(b) One has

(3.8)
(3.9)

(c) .

Proof. Since is a closed halfspace with normal , there is a point such that . By (3.3), . If there would be no data point on the boundary of this halfspace, one even had . Then, there would exist with , hence , but , a contradiction.

(a) (b): If (a) is true, then and (3.8) follows from (3.3). If (3.9) would not be true, then, by a similar argument as in the first part of the proof, would not be in .

(b) (a): Assume . Then , hence

which means . Hence . Conversely, if , then by (3.8). This and (3.9) imply (otherwise, which leads to a contradiction). But then , hence .

(a) (c) Both directions follow from (3.5).

For the following result, some notation is needed which is also used in the remainder of the paper. For we define the following sets

The set includes all data points on the boundary of the shifted halfspace , whereas includes the data points in with from (3.7) in both cases.

Standing assumption. In the remainder of the paper, it is assumed that is generated by the two linearly independent vectors , i.e., is a V-representation of . In this case, the set

is a base of . Equivalently, is generated by two linearly independent vectors , i.e., is a V-representation of .

Proposition 3.6

If and , then can be represented as

such that and

for all . In particular, is a finite set.

Proof. First, take . By Proposition 3.5, where

Since , there is such that . Assume that , i.e., . Then, there are such that , and the two halfspaces

contain exactly the same set of data points as where , . Hence , , and

This means that does not contribute to the intersection in

(3.10)

and it is enough to run it over those with and .

Secondly, take such a and assume . Now, (3.8) implies ”=” in this inequality. Pick such that

(3.11)
(3.12)

One has and and by (3.11), (3.12), hence

(3.13)

Since the data set is finite, it is always possible to find and such that

and the following conditions are satisfied: is kept on the boundary of the halfspace for , and one has , and for , and for (i.e., for ), , one has , and either or .

The underlying geometrical idea is to turn in direction and , respectively, around and until the next data point is hit. The data points in with , , are exactly the same as in .

Then, one has

(3.14)

which means that is indeed redundant in the intersection (3.10). To see this, observe that there is satisfying

( is the intersection point of the two boundary lines of , and it does not have to be a data point, of course).

Claim. , i.e., , for .

Indeed, one has

i.e.,

(3.15)
(3.16)

Multiplying the second equation by and adding the result to the first gives

so one of the two parts of the sum must be , the other . With the help of (3.11), (3.12) one gets

hence

Together with (3.15), (3.16), this gives

so for which proves the claim.

Finally, take

i.e.,

The definitions of and yield

hence

so , , where the inclusion is the claim above. This proves (3.14) which completes the proof of the proposition.

Remark 3.7

For , one can have

In the first case, is also redundant by Proposition 3.6. This is exploited in the algorithm below. In the second, cannot be ruled out by the proposition, but it could still be redundant for the intersection in (3.10).

Remark 3.8

Under the standing assumption, the situation can be reduced to the case when the cone is . Let be generated by the two linearly independent vectors , i.e.,

. Then, there is an invertible matrix

such that and one can use the affine invariance of the cone distribution function (see [6, Proposition 2.7]) to get

Indeed, since if, and only if, and , one may easily see that

do the job. Moreover,

so . The procedure now is: first, transform the data and the cone by ; secondly find and ; finally transform back. Clearly, this idea is restricted to the bivariate case.

4 The bivariate algorithms

In this section, we provide some more theoretical background for algorithms which produce the empirical quantiles with empirical halfspace depth regions as special cases and the values of with the halfspace depth function as a special case, respectively. Pseudocodes of the algorithms will also be given.

4.1 The rotation step

The algorithm below is designed such that it starts with and then runs through until it hits . At an intermediate step, a is generated. There are three cases:

(1) and .

(2) If and .

(3) .
Case (3) serves as a stopping criterion. In case (1) and (2), a permutation of the data points is generated which in turn is used to generate a new . Set .

Case (1). The input is , (the only element in ). Find the permutation of such that

(4.1)

One has .

Case (2). The input is . Find the permutation of such that

(4.2)

Find the set

and set . Define a new permutation of the points in by