Smoothed Analysis of the Expected Number of Maximal Points in Two Dimensions

07/18/2018 ∙ by Josep Diaz, et al. ∙ Universitat Politècnica de Catalunya The Hong Kong University of Science and Technology 0

The Maximal points in a set S are those that aren't dominated by any other point in S. Such points arise in multiple application settings in which they are called by a variety of different names, e.g., maxima, Pareto optimums, skylines. Because of their ubiquity, there is a large literature on the expected number of maxima in a set S of n points chosen IID from some distribution. Most such results assume that the underlying distribution is uniform over some spatial region and strongly use this uniformity in their analysis. This work was initially motivated by the question of how this expected number changes if the input distribution is perturbed by random noise. More specifically, let Ballp denote the uniform distribution from the 2-d unit Lp ball, delta Ballq denote the 2-d Lq-ball, of radius delta and Ballpq be the convolution of the two distributions, i.e., a point v in Ballp is reported with an error chosen from delta Ballq. The question is how the expected number of maxima change as a function of delta. Although the original motivation is for small delta the problem is well defined for any delta and our analysis treats the general case. More specifically, we study, as a function of n,δ, the expected number of maximal points when the n points in S are chosen IID from distributions of the type Ballpq where p,q in 1,2,infty for delta > 0 and also of the type Ballp infty-q, where q in [1,infty) for delta > 0.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let be a set of -dimensional points. The largest points in are its maximal points of and are a well-studied object. More formally111We restrict our definition to because that is what this paper addresses; the concept of maxima generalize naturally to for and have been well-studied there as well. We discuss this in more detail in the Conclusions and Extensions section.

Definition 1

For let () denote the () coordinate of For , is dominated by if , and . If then

are the maximal points of

The problems of finding and estimating the number of maximal points of a set in

, appear very often in many fields under different denominations,

maximal vectors

, skylines, Pareto frontier/points and others, see, e.g., [18, 5, 17, 12, 14], and for a more exhaustive history of the problems and further references, Sections 1 and 2 in [7].

Figure 1: The diagram shows for two point sets . In both (a) and (b) the circles – both empty and filled – denote the points in and the filled circles are . If the points are considered as being drawn from region denotes the region in that dominates . In (a) is the dotted square and in (b) is the dotted circle.

Recall that the metric for points in the -dimensional space is defined by

Let denote a set of points chosen Independently Identically Distributed (IID) from some 2-D distribution and

be the random variable counting the number of maximal points in

. Because maxima are so ubiquitous, understanding the expected number of maxima has been important in many areas and many properties of have been studied.

More specifically, if is the uniform distribution drawn from an ball with then, it is well known [12, 2, 13, 6] that

  • If , then
    The same result holds if the points are drawn from some distribution where and are ANY two 1-dimensional distributions that are independent of each other.

  • If , then

    where is a constant dependent only upon .

  • Similar results to the above, i.e., that , derived using similar techniques, are known if is a uniform distribution from ANY convex region [11].

It is also known [15, 16] that if the points are chosen IID from a

-D Gaussian distribution then

There are also generalizations of these results (both the ones and the Gaussian one) to higher dimensions. See [13] for a a table containing most known results.

Surprisingly, given the importance of the problem, not much else is known. The motivation for this work is to extend the family of distributions for which can be derived.

Consider a point that is originally generated from some uniform distribution over a unit ball but, has some error in the metric when measured or reported. The actual reported point can be equivalently considered as being chosen from a new distribution which we denote by (the next section provides formal definitions). Note that the support of this distribution is the Minkowksi sum of the two balls.

Figure 2: .

As an example, Figure 2 shows the support of . In the diagram, the shaded inner square is the unit -ball. A point chosen from that square is then perturbed by the addition of another point , drawn uniformly from the ball with radius . The support of this convoluted distribution is the interior of the dotted region in the figure.

Note that the distribution is NOT uniform in this support. Towards the centre the density is uniform but it decreases approaching the boundary of the support where it becomes zero. Note too that the rate of decrease differs in different parts of the support. It is this non-uniformity that will cause complications in calculating

Although the problem described above was for small it is well defined for all which is what we analyze in this paper. More specifically, the motivation for the present work is twofold:

  • Explain how changes when the distribution is perturbed and

  • Increase the families of distributions for which is understood.

The idea of analyzing how quantities change under perturbations is smoothed analysis [20, 21]. In the classic setting, smoothed analysis of the number of maxima would mean analyzing how, given a fixed set would change under small perturbations (as a function of the original set ). This was the approach in [9, 8] (see also similar work for convex hulls in [10]). This paper differs in that it is the Distribution that is being smoothed (or convoluted) and not the point set. This paper also differs from recent work [22, 1] on the most-likely

skyline and convex hull problems in that those papers assume that each point has a given probability distribution and they are attempting to find the subset of points that has the highest probability of being the skyline (or convex hull).

2 Definitions and Results

Definition 2

or will denote that is a real number will denote that OR

Definition 3

Let be a distribution over .

  • If , the distribution is generated by choosing a point using and then returning the point

  • Let be two distributions over . is the convolutionof It is generated by choosing a point from and a point from and returning

  • A set of is Chosen from if the are IID with each being generated using distribution .

Definition 4

Let be a set and
Then .

The Minkowski sum of sets and is

If , let will denote the set

Definition 5

Let , and

  • The ball of radius around is

  • The ball of radius around is

Let and denote the respective unit balls and , denote their respective areas.

  • For all will denote the uniform distribution that selects a point uniformly from . This distribution has support with uniform density within

  • will be the convolution of distributions and . This distribution’s support is the Minkowski sum . Note that the density of is NOT uniform in .

The main result of this paper is

Theorem 1

Fix so that either or and Let be points chosen from the distribution and . Let be a function of . Then behaves as below.

Observations: In

  • When , has exactly the same distribution as if were chosen from so this is an uninteresting case.

  • When is small enough , behaves almost as if were chosen from and when is large enough it behaves almost as if were chosen from

  • Later Lemma 8 will show that has the same distribution for chosen from both and Thus row (iv) gives the behavior for for any and row (v) the behavior for

  • When the behavior starts at , smoothly decreases until reaching and then increases again until reaching . The behavior in the middle is different for and In both cases there is symmetry between and (from Lemma 8).

  • When there is no symmetry. Behavior starts at , decreases to at and then increases again at a different rate to .

  • When , the behavior is asymptotically equivalent for all not just The only difference is in the value of the constant hidden by the The behavior starts at , stays there for a short while and then smoothly increases to

Figure 3: Illustrations of the supports of some of the different distributions in the form examined in Theorem 1. The dotted lines denote the and balls centered at Note that in all cases the density is uniform near the center of the support but then decreases to as the boundary is approached. The gray areas denote, approximately, where the maxima of are concentrated.

3 Basic Lemmas

The following collection of Lemmas comprise the basic toolkit used to derive Theorem 1. They are only stated here, with complete proofs being provided in Section 5.

Definition 6

Let be a distribution over , and a measurable region.

  • will denote the density function of

  • will denote the measure of

If is understood we often simply write and

Definition 7

Let , and

  • .

  • is dominant in or a dominant region in if

Note that, by definition, is a dominant region in

Lemma 1

Let and be chosen from and Then

The following observation will be used to prove most of our lower bounds.

Lemma 2 (Lower Bound)

Let be chosen from . Further let be a collection of pairwise disjoint dominant regions in with for all . Then

Definition 8

Let For define

the preimage of point in

Lemma 3

Fix . Let and let be a point chosen from Let . Then

(1)
(2)
Lemma 4

Fix . Let and be any constant.

The constants implicit in the in (a) and (c) are only dependent upon while the constants implicit in the in (b) and (d) are only dependent upon

Lemma 5 (Mirror)

Let be any distribution with a continuous density function and a set of points chosen from . Let be two disjoint regions in the support that are parameterized by and satisfy:

  1. .

  2. (Monotonicity in ) , and .

  3. (Asymptotic dominance in measure)

Define the random variables

Then

Lemma 6 (Sweep)

Let be any distribution with a continuous density function and a set of points chosen from . Let be two disjoint regions in the support that are parameterized by , satisfy conditions 1-3 of Lemma 5 and, in addition, satisfy

Then

Corollary 7

Fix and choose from Let be the upper-right quadrant of the plane and the first octant , i.e.,

Then

(3)
(4)

Proof: Set

For set

Conditions (1) and (2) of Lemma 5 trivially hold. Condition (3) holds because, by symmetry around the -axis Finally the additional condition of Lemma 6 holds because every point in is below and to the left of every point in . Thus the expected number of maximal points in below the -axis is . Note that this is independent of . Similarly, the expected number of maximal points to the left of the -axis is . This proves Eq. 3

To prove Eq. 4 define the second octant to be

By the symmetry between the and coordinates in the distribution,

Futhermore, since and partition ,

Thus

The fact that for , dominates if and only if dominates implies

Lemma 8 (Scaling)

Fix , and Let be points chosen from and points chosen from . Then and have exactly the same distribution. In particular

Lemma 9 (Limiting Behavior)

Let , , and chosen from . Then

Note that if chosen from , and are independent random variables. Thus, for any if is chosen from and are independent random variables. As noted in the introduction, this means that if is chosen from , is exactly the same as if was just chosen from i.e.,

Now note that Lemma 9 combined with Lemma 8 immediately imply the limiting behavior in columns (b) and (e) of the table in Theorem 1. Note too that for rows (ii) and (iii), column (d) follows directly from applying Lemma 8 to column (c).

Thus, proving Theorem 1 reduces to proving cells (ii) c, (iii) c, (iv) c,d and (v) c,d. In the next sections we sketch how to derive these results with full proofs relegated to the appendix.

4 The General Approach

4.1 A Simple Example:

Figure 4: Illustration of proof when is chosen from All but maxima will be in first quadrant ; (b) and (c) only show . (b) illustrates the lower bound and (c) the upper bound.

Before sketching our results it is instructive to see how the Lemmas in the previous section can be used to re-derive that fact that, if then . This is illustrated in Figure 4.

Even though the behavior of is already well understood we provide this to sketch the generic steps that are needed to derive . These are exactly the same steps that are needed when and permits identifying where the complications can arise in those more general cases.

Set and let be the points defined in the figure with and Also set

Finally, for set and The steps in the derivation are.

  1. Restricting to first Quadrant: Corollary 7 implies that it is only necessary to analyze .

  2. Calculating Density and Measure: Because has a uniform density, for all regions

  3. Lower Bound: The are a collection of pairwise disjoint dominant regions with Thus, from Lemma 2,

  4. Upper bound: Note that so

    Since , Lemma 1 implies that for all

    The crucial observation is that for all , the Sweep Lemma (Lemma 6) holds with and . Thus Combining the above completes the upper bound, showing that

4.2 The General Approach

The proof of Theorem 1 will require case-by-case analyses of for different pairs . The analysis for each pair will follow exactly the same 5 steps as the analysis of above. We note where the complications arise.

Step 1 of restricting the analysis to quadrant will be the same for every case.

Step 2, of deriving the measure, will often be quite cumbersome. While Lemma 3 provides an integral formula this, in many cases, is unusable. The density varies quite widely near the border of the support which is where most of the maxima are located. A substantial amount of work is involved in finding usable functional representations for the densities/measures in different parts of the support.

Step 3, of deriving the lower bound, is usually a simple application of Lemma 2, given the results of step 2.

Step 4 is the hardest step. It is usually derived using the sweep lemma with the difficulties arising from how to specify the regions to be swept. This strongly depends upon how the measure is represented .

5 Proofs of Basic Lemmas

Proof: of Lemma 2.

First note that, from Lemma 1, Thus implies

If region is dominant then points in can only be dominated by other points in so Since each is dominant, this implies

Finally, since the are pairwise disjoint,

Proof: of Lemma 3:

Note that for , and for , .

To see Eq. 2 note that

For Eq. 1 first note that

where (5) comes from the change of variables . Differentiating around yields Eq. 1.

Proof: of Lemma 4:

The proof for (a) follows easily from the fact that, for all

so from Eq. 1, . Furthermore, if then

where is only dependent upon . Thus, again from Eq. 1, , proving (b). The proofs for (c) and (d) follow from plugging these inequalities into Eq. 2.

Proof: of Mirror Lemma (Lemma 5):

Without loss of generality smoothly rescale so that , and thus .

The informal intuition of the Lemma is that since the “first” point in appears when the sweep line is , Since is asymptotically dominated in measure by and thus

Note that by the continuity of the measure we know that . That is, we may assume that

Now assume that is known. Conditioned on known , the remaining points in are chosen from with the associated conditional distribution. More specifically, if is any one of those points.

Thus, conditioning on , and applying Lemma 1 (b)

and therefore

From the definition of and Lemma 1 (c), with exponentially low probability. Therefore, recalling that

Another application of Lemma 1 (c) shows

Thus

Proof: of Sweep Lemma (Lemma 6):

From the setup in Lemma 5, for all all points in are dominated by all points in . By the definition of , contains (exactly) one point. Thus no point in can be maximal, i.e.,

The proof follows from

Proof: of Lemma 8:

Let be chosen from . Recall that the process of choosing point from is to choose from , from and return . Choosing a point from is the same except that it returns . Thus the distribution of choosing from is exactly the same as choosing from

Finally, note that dominance is invariant under multiplication by a scalar, i.e., dominates if and only if dominates . Thus and have the same distribution and

The proof of Lemma 9 will need an observation that will be reused multiple times in the analysis of and is therefore stated first, in its own lemma.

Lemma 10

Recall from Definition 7. Fix and set