DeepAI
Log In Sign Up

Stochastic Tverberg theorems and their applications in multi-class logistic regression, data separability, and centerpoints of data

07/23/2019
by   Jesús A. De Loera, et al.
0

We present new stochastic geometry theorems that give bounds on the probability that m random data classes all contain a point in common in their convex hulls. We apply these stochastic separation theorems to obtain bounds on the probability of existence of maximum likelihood estimators in multinomial logistic regression. We also discuss connections to condition numbers for analysis of steepest descent algorithms in logistic regression and to the computation of centerpoints of data clouds.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

01/26/2018

A note on "MLE in logistic regression with a diverging dimension"

This short note is to point the reader to notice that the proof of high ...
11/10/2019

Stochastic DCA for minimizing a large sum of DC functions with application to Multi-class Logistic Regression

We consider the large sum of DC (Difference of Convex) functions minimiz...
11/13/2020

Sparse Representations of Positive Functions via Projected Pseudo-Mirror Descent

We consider the problem of expected risk minimization when the populatio...
04/16/2016

DS-MLR: Exploiting Double Separability for Scaling up Distributed Multinomial Logistic Regression

Scaling multinomial logistic regression to datasets with very large numb...
12/03/2018

On functional logistic regression via RKHS's

In this work we address the problem of functional logistic regression, r...
02/01/2020

Linear and Fisher Separability of Random Points in the d-dimensional Spherical Layer

Stochastic separation theorems play important role in high-dimensional d...

1 Introduction

This paper shows how methods from stochastic convex geometry can be successfully used in the foundations of data science. Before we discuss the geometric results, we discuss their implications:

Logistic regression is perhaps the most widely used non-linear model in multivariate statistics and supervised learning 

[12]. Statistical inference for this model relies on the theory of maximum likelihood estimation. In the binary classification case, given independent observations , logistic regression links the response to the covariates via the logistic model

here

is the unknown vector of regression coefficients. In this model, the

log-likelihood is given by

and, by definition, the maximum likelihood estimate (MLE) is any maximizer of this functional. The basic intuition behind this method is as follows: we seek coefficients so that corresponds as closely as possible with the observations . For example, if and have different signs, there is a larger “penalty” expressed in the log-likelihood, since in that case . See [10] for further discussion and examples.

One difficulty arising in machine learning is that the MLE does not exist in all situations. In fact, given two data classes, say one of red points (where

), and one of blue points (where ), it is well-known that an MLE exists if and only if the convex hulls of the blue points intersects the convex hull of the red points [1, 18]. Although an appealing criterion for existence, this geometric characterization leads to another question: How much training data do we need, as a function of the dimension of the covariates of the data, before we expect an MLE to exist with high probability?

The seminal work of Cover [6] (adapting a technique originally due to Schläfli [17]) provides an answer in a special case. When applied to logistic regression, Cover’s main result states the following: assume that the

’s are drawn i.i.d. from a continuous probability distribution

and that the class labels are independent from , and have equal marginal probabilities; i.e., . Then Cover showed that as and grow large in such a way that , the convex hulls of the data points asymptotically overlap - with probability tending to one - if , whereas they are separated - also with probability tending to one - if . When the class labels are not independent from the , the problem is more difficult. In this case, Candès and Sur [3]

proved that a similar phase transition occurs, and is parameterized by two scalars measuring the overall magnitude of the unknown sequence of regression coefficients.

Tukey introduced a notion of depth for a point relative to a data cloud , as the smallest number of data points in a closed half-space with boundary through (see [21, 16] and references therein). We say a point has half-space depth in if that every half-space containing contains at least points in . A centerpoint of an point data set is a point such that every half-space containing has at least points in , thus it is a point of depth at least

. In a way a centerpoint is a generalization of the notion of median for high-dimensional data. Centerpoints are useful in a variety of applications (see e.g.,

[7] for references). Unfortunately, obtaining a centerpoint is difficult, and the current best randomized algorithm constructs a centerpoint in time [4, 13]. Thus finding an approximate centerpoint of a set is of interest.

Consequences of our geometric results

The first contribution of our paper is to further develop the connection between geometric probability (Cover’s result), discrete geometry (Tverberg-type results), and the conditions for the existence of MLEs. Our paper discusses the generalization of Cover’s stochastic separation problem to more than two colors by studying so-called Tverberg partitions- a partition of a data set into classes so that the intersection of all the convex hulls of the classes is nonempty.

Each of our stochastic-geometric theorems has a nice implication. Table 1 summarizes our theorems (middle column) as well as their consequences to the existence of the Maximum-likelihood estimator in terms of the size of the data set (right column).

Deterministic version Stochastic version Likely MLE Existence

Radon
Cover’s Theorem[6] pair of data classes (mentioned above)
Tverberg Thms 4,5 all pairs of data classes
(Theorem 2 part 1.)
Radon with tolerance Thm 7

pair of data classes with outliers removed

(Theorem 2 part 2.)
Tverberg with tolerance Thms 2,6,[19] all pairs of data classes with outliers removed
(Theorem 2 part 2.)
Table 1: Stochastic analogues of Tverberg’s theorem and their implications for existence of MLEs. By “Likely MLE Existence”, we mean that one can bound below the probability of MLE existence as a function of the number of input data points, according to the corresponding theorems in the “Stochastic” column.

There are two common approaches to extend binary classification to multi-class classification: “one-vs-rest” and “one-vs-one”. Suppose the data has classes. In “one-vs-rest”, we train

separate binary classification models. Each classifier

for is trained to determine whether or not an example is part of class or not. To predict the class for a new sample , we run all classifiers on and choose the class with the highest score: In “one-vs-one” regression, we train separate binary classification models, one for each possible pair of classes. To predict the class for a new sample , we run all classifiers on and choose the class with the most votes.

To apply “one-vs-one” multinomial logistic regression, we would like to ensure that the MLE exists between the data corresponding to every pair of labels. The next theorem applies our stochastic Tverberg theorem to give a sufficient condition for all these MLEs to exist with high probability (a sequence of events , occurs with with high probability if .)

Theorem 1 (Stochastic Tverberg theorems applied to multinomial regression)

Fix . Assume that the ’s are drawn i.i.d. from a centrally symmetric continuous probability distribution on and that the class labels are independent from , and have equal marginal probabilities; i.e., for all .
Then

  1. Letting the number of data points grow as a function of the number of labels , the MLE exists between the data corresponding to every pair of labels with high probability as long as

  2. Suppose the number of data points is , where we fix the number of labels, and is a function of - the number of outliers to be removed from the data set. Then the MLE exists between the data corresponding to every pair of labels with high probability if any points are removed, so long as .

The same bound applies to “one-vs-rest” logistic regression, since MLE existence in that case is a weaker condition. The various special cases of Stochastic Tverberg theorems are thus useful in different kinds of classification problems, and these observations are summarized in Table 1.

The last two rows of the table of summarizing our results was motivated by the challenge of dealing with outlier data and seeking robust classification of data, we rely on an additional parameter: tolerance. A -tolerant partition, which will be defined formally later, but it is a notion of “robust” intersection in the sense that the intersection of the convex hulls of the subsets remains non-empty even after any points are removed. See Figure 1 for an example of a -tolerant partition.

Figure 1: A partition in three data classes with tolerance one. All three convex hulls intersect even after any one point is removed.

The parameter of tolerance is also significant in studying MLE existence. A natural observation is that tolerant partitions correspond to robust MLE existence. Any points, possibly corrupted or outlier data, can be removed and still the convex hulls of the data with each label intersect.

In fact, the parameter of tolerance is also similar to an important parameter used to guarantee to the convergence speed of first order methods for finding MLEs. Recently, when studying binomial logistic regression, Freund, Grigas and Mazumunder [9] introduced the following notion to quantify the extent that a dataset is non-separable (where denotes the negative part of ):

DegNSEP* is thus the smallest (over all normalized models ) average misclassification error of the model over the observations. They showed that the condition number DegNSEP* informs the computational properties and guarantees of the standard deterministic first-order steepest descent solution method for logistic regression. Let us now briefly discuss how the parameter of tolerance for Radon partitions (Tverberg 2-partitions) can be viewed as a discrete analogue of DegNSEP*.

Define PertSEP* as the smallest (or more precisely, the infimum thereof) perturbation of the feature data which will render the perturbed problem instance separable. Namely,

In Proposition 2.4 of [9] it is shown that DegNSEP* = PertNSEP*.

In this paper we introduce a new parameter simply defined as the norm of the smallest perturbation of the feature data which will render the perturbed problem instance separable. In other words, it is the minimal number of data points we could move to make the data set separable, normalized by the total number of data points. Namely,

The following theorem shows that the tolerance of a Radon partition is given by :

Theorem 2

Suppose that , is a Radon partition with tolerance precisely equal to . Then viewing as a labeled dataset (with ), we have that

Theorem 2, combined with a result of Soberón, has a corollary, stated precisely in the next section, which roughly says that of a randomly bi-partitioned point set asymptotically approaches . This is the highest possible value one could hope for since, by definition, of any two class data set is bounded above by . In fact, this result extends easily to the multi-class setting. In other words, for a large randomly -partitioned data set, we expect of every pair of data classes to be close to - independent of both the dimension of the covariates, as well as the number of classes .

For further discussion of PertSEP* and for two-class data, including more probabilistic aspects of these condition numbers and many interesting implications for steepest descent algorithms, see [9].

We also discuss how our geometric probability results are related to the problem of computing approximations to centerpoints of datasets. Table 2 and the discussion that follows summarize our contributions. Tverberg’s theorem implies that every data set has a centerpoint, as the Tverberg intersection point of a Tverberg partition must be a point of half-space depth one in each of the color classes. Hence an effective version is desirable as a method to obtain centerpoints. The proof of Radon’s lemma is constructive and, in fact, one of the most notable randomized algorithms for computing approximate centerpoints works by repeatedly replacing subsets of points by their Radon point. In contrast, no known polynomial time algorithm exists for computing exact Tverberg points. Thus, fast algorithms for approximate Tverberg points have been introduced in [5, 14, 15]. If one is interested in probabilistic algorithms for finding Tverberg partitions, the main results of our paper can be used to give expected performance of algorithms where we obtain Tverberg partitions by random choice, so long as the points come from a balanced distribution.

In particular, our Theorem 5 suggests a trivial algorithm for finding a Tverberg partition among a set of i.i.d. points drawn from a distribution which is balanced about a point . According to Theorem 5, a random equipartition of such points into less than sets should produce a Tverberg partition with high probability. This trivial non-deterministic algorithm was also suggested by Soberón, except using a random allocation rather than equi-partition. Our asymptotic results improve the bounds on expected performance of Soberón’s proposed algorithm (random allocation) for points from a balanced distribution as well. We summarize the performance and time complexity of various algorithms for obtaining Tverberg partitions, including our own (last two rows), in Table 2.

Method Number of Colors Time complexity

Tverberg
PPAD (unknown if polynomial)
Mulzer, Werner [14]

Rolnick, Soberón  [15]
with error prob. weakly poly. in , and
Random equi-partition
Random Allocation
Table 2: Approximate Tverberg Partitions for balanced distributions.

Section 2 presents our geometric tools and results, and Section 3 contains the proofs of our new results.

2 Our geometric methods: Stochastic Tverberg-type theorems

We begin by remembering Tverberg’s celebrated theorem [22] which generalizes Radon’s lemma to -partitions (see [2, 7] for references and the importance of this theorem):

Theorem (Theorem: (H. Tverberg 1966))

Every set with at least points in Euclidean -space has at least one -Tverberg partition (with tolerance zero).

The notion of “tolerant Tverberg theorems” was pioneered by Larman [11] and refined over the years, such as in the following result due to Soberón and Strausz [20]. Here is the precise definition:

Definition 1

Given a set , a Tverberg -partition of with tolerance is a partition of into subsets with the property that all convex hulls of the intersect after any -points are removed. In other words, for all , we have

Theorem (Theorem: (Soberón, Strausz 2012))

Every set with at least points in has at least one Tverberg -partition with tolerance . In other words, can be partitioned into parts so that for all , we have

More recently, P. Soberón proved the following bound [19]. Let denote the smallest positive integer such that a Tverberg -partition with tolerance exists among any points in dimension . Then for fixed and . The proof of this result relies on the probabilistic method and, as Soberón remarked, can in fact be used to prove a Stochastic Tverberg-type result, which we will revisit later.

Prior Stochastic Tverberg theorems

Before stating our main results, we introduce two models for random partitioned data point sets. In both models will use the term colors instead of subsets, for ease of notation. Hereafter, when we refer to a continuous distribution on , we mean continuous with respect to the Lebesgue measure on . We defer proofs of the new results stated until the next section.

Our first model is a so-called random equi-partition model i.e., we ensure that every color has the same number of points. More specifically, given integers and and a continuous probability distribution on , we let denote a random equi-partitioned point set with points, consisting of colors, and points of each color, distributed independently according to .

Our second model is a random allocation model: Given integers and and a continuous probability distribution on , we let denote a random point set with points i.i.d. according to , which are randomly colored one of colors with uniform probability ( for each color).

For example, using these models we can state Cover’s result as follows:

Theorem (Theorem: (T. Cover 1965))

If is a continuous probability distribution on , then

In particular, we have

Furthermore, for any and any sequence of continuous probability distributions where each is a distribution on , we have

and

To the best of the authors’ knowledge, the first generalization of Cover’s 1964 result to more than two colors appeared only recently in Soberón’s paper [19]:

Theorem (Theorem: P. Soberón 2018)

Let be positive integers and let be a real number. Given points in , a random allocation of them into parts is a Tverberg partition with tolerance with probability at least , as long as

This result is quite remarkable. For any fixed and , it shows that the probability of a random allocation of of points in in colors having tolerance at least approaches one as goes to infinity. On the other hand, by pigeonhole principle, any allocation of points into colors must have one color with at most points. Thus, for a fixed number of colors , the tolerance of a random partition is asymptotically as high as it could possibly be! By Theorem 2, this result yields the following corollary.

Corollary 1

For any sequence of partitioned point sets with a distribution on , and any , we have with high probability.

In fact, for fixed and , Corollary 1 can be extended to the multi-class setting. In other words, for a large randomly -partitioned data set, we expect of every pair of data points to be close to :

Theorem 3

Fix . For any distribution on and any sequence of -partitioned point sets

we have

with high probability.

Our new stochastic geometric theorems

Our first theorem is a geometric probability result similar to Soberón’s and Cover’s. It yields a Stochastic Tverberg theorem for equi-partitions (without tolerance).

Theorem 4 (Stochastic Tverberg theorem for equi-partitions)

Suppose is a probability distribution on that is balanced about some point

, in the sense that every hyperplane through

partitions into two sets of equal measure. Then

In fact, the previous theorem is asymptotically tight in the number of colors . This is shown by our next theorem, which establishes an interesting threshold phenomenon for Tverberg partitions.

Theorem 5 (Tverberg Threshold Phenomena for equi-partitions)

Let be a continous probability distribution in balanced about some point . Consider the sequence of random equi-partitioned point sets , where , and depends on . Then is Tverberg with high probability if , and is not Tverberg with high probability if .

Remark: It is also interesting to consider the same problem from the “box convexity” setting where the convex hull of a set of points is defined to be the smallest box (with sides parallel to the coordinate axes) enclosing those points. Since checking convex hull membership is easier in the box convexity setting, this set up may be more relevant in certain applications. Our method of proof of Theorem 4 also works in box convexity setting, and we obtain the same bounds.

We note that the number of points needed to reach the conclusion in Theorem 5 is independent of the dimension, as in the aforementioned result of Soberón [19].

The next two theorems adapt both Cover’s result and Theorem 4 to the setting of tolerance.

Theorem 6 (Stochastic Tverberg with tolerance for equi-partition)

Suppose is a probability distribution on that is balanced about some point .

For the case of random bi-partitions, we can adapt Cover’s result to obtain a Stochastic Radon theorem with Tolerance.

Theorem 7 (Stochastic Radon with tolerance for random allocation)

If is a continuous probability distribution on , then

In particular, we have

Remark: Theorem 7 yields a weaker expected tolerance than Soberóns result, but the proof is shorter and more elementary.

For random allocations with more than two colors, we will use some developments on random allocation problems, including the following notation. If balls are thrown into urns uniformly and independently, let equal the number of throws necessary to obtain at least balls in each urn.

Corollary 2 (Stochastic Tverberg for random allocation)

Suppose is a probability distribution on that is balanced about some point .

  1. Then

  2. For the case of Tverberg without tolerance, we also have

  3. Suppose , is a sequence of random partitioned point sets, where depends on .

    Then is Tverberg with high probability if .

These results are improvements on Soberón’s bound when the number of colors is large relative to the desired tolerance.

3 Proofs of our stochastic results

Proof (Proof of Theorem 2)

Let denote the minimal number of points perturbed among any perturbation that makes separable, and denote the minimal number of points needing to be removed from to make separable. Then is equal to , and the tolerance of , is equal to . It suffices to show that . To see that , note if in are moved so that the resulting set is separable, then is also separable. To see that , suppose that is separable by a hyperplane. Then moving to the appropriate sides of the hyperplane determined by , we can construct a separable dataset , obtained from moving points from .

Proof (Proof of Theorem 3)

For fixed and , let denote the event that a random allocation of points in in colors has tolerance at least . By Soberón’s theorem above, asymptotically approaches one as goes to infinity. Now, for fixed and , let denote the event that a random allocation of points into colors has between and points of color , where

. By the law of large numbers,

approaches one as goes to infinity. As the events , where , all have probability approaching one, the probability of the intersection of all these events also approaches one. This can be seen by applying the union bound to their complements. Thus there exists such that the , where , simultaneously occur with probability . Therefore with probability , each pair of colors has at most points, and is a Radon partition of tolerance at least (the tolerance of each bi-partition is a priori bounded below by the tolerance of the -partition). By theorem 2, of each pair is at least with probability . Since and were arbitrary, this completes the proof.

Proof (Proof of the lower bound in Theorem 4)

After a possible translation, can assume without loss of generality that is balanced about the origin. We will prove that

by bounding from below the probability that the origin is a Tverberg point. We may assume without loss of generality that none of the randomly selected points are the origin. Furthermore we can radially project the points onto a sphere of radius smaller than the minimal norm of the projected points, since that will not affect whether the origin is a Tverberg point. After this projection, we may assume the points are uniformly sampled on a small sphere centered at the origin. The origin is then a Tverberg point as long as the points from each color contain the origin in their convex hull. This is equivalent to showing no color has all of its points contained in one hemisphere. For a fixed color, the probability of the points of that color being contained in one hemisphere was computed by Wagner and Welzl [23] (generalizing the celebrated result of Wendel [24] addressing the case when is rotationally invariant about the origin) as

(1)

Using this to compute the probability that none of the color classes is contained in one hemisphere we obtain the desired bound above.

Proof (Proof of the upper bound in Theorem 4)

Again, we assume without loss of generality that is balanced about the origin. We will first treat the case , and then explain how to obtain the bound for arbitrary . To bound the probability of a Tverberg partition from above, we bound the probability of the complement below. We let denote the event that the convex hulls have empty intersection. In dimension one, is contained in the event that there is at least one color class with all points less than zero, and at least one color class with all points greater than zero. Since we assume that the origin equipartitions , we can rephrase this as the probability that among people each flipping fair coins, there is at least one person with all heads and at least one person with all tails. In other words, denoting by and the events that at least one person gets all heads or tails respectively, we have . We have

Since and , this yields

The probability of a Tverberg partition is thus bounded as follows

This proves the desired bound for dimension one. For higher dimensions, we note that if we let , denote the projection onto the -th axis for , we have that the signs of

are independent Bernoulli random variables with probability

(as the hyperplane orthogonal to the -th axis equipartitions by the assumption that is balanced about the origin). Thus to have a Tverberg partition, we must have that no pair of the color classes are separated by the origin after projecting onto the coordinate axes. Since these events are independent, the probability of this happening is bounded as follows.

Proof (Proof of Theorem 5)

We will show that is Tverberg with high probability if . Fix an . We set and apply the lower bound in Theorem 4 to deduce that

Choosing a constant so that , we have

We will show that the limit as approaches infinity of the left hand side is bigger than for any . Fix . As , there exists an such that for all . Consequently for all . Thus

Since was arbitrary, we see that the probability of a Tverberg partition tends to 1.

Now we show that is not Tverberg with high probability if . As before, we fix an greater than zero apply the upper bound in Theorem 4 with to obtain

For any , when is large, both terms inside the parentheses are smaller than . Since , the probability of a Tverberg partition converges to zero as approaches infinity.

Proof (Proof of Theorem 6)

Again, we assume without loss of generality that is balanced about . Let denote the set of points of some fixed color. Then we assume that , and we can partition into subsets with for each . By Wagner and Welzl’s result (Equation 3 above), for each , contains the origin with probability at least . By independence, the probability that less than of the contain the origin is less than . On the other hand, if at least of the contain the origin, then by pigeonhole principle contains the origin for any . Thus, with probability at least , we have that contains the origin. Since this probability is independent for each of the colors, the result follows.

Using a similar strategy combined with Cover’s result, we give the proof of Theorem 7 below.

Proof (Proof of Theorem 7)

Given points in colored red and blue by random allocation, we arbitrarily partition them into groups of size at least . By Cover’s result, for each fixed group, the convex hulls (of the red and blue points) in that group intersect with probability at least . For each of the groups, we think of the event that the convex hulls in that group intersect as a “success”. Then the probability that at least groups have intersecting convex hulls is bounded below by the probability that a binomial process with trials and success probability has at least total successes. Computing this binomial probability yields the theorem. (If at least groups have intersecting convex hulls, then removing at most points leaves at least one group with intersecting convex hulls. )

Proof (Proof of Corollary 2)

We split the proof according to the three respective parts of the statement.

  1. The probability that a random allocation of points into colors is an -Tverberg partition with tolerance is bounded below by the probability that a random allocation of points into colors has at least points per color, times the probability that an equipartition of points into colors is Tverberg with tolerance . The result for Tverberg with tolerance then follows from Theorem 6.

  2. The result for the special case of Tverberg without tolerance then follows the same reasoning as part (1), except using Theorem 4 in place of Theorem 6.

  3. To show the asymptotic result, we use a result on urn models due to Erdős and Renyi [8] saying that

    This implies that for any and sequence of points allocated into urns, we have at least points in each urn with high probability. Then we apply Theorem 5, which says that any equi-partition of a point set into colors and points per color is Tverberg with high probability.

4 Acknowledgments

This work was partially supported by NSF grants DMS-1522158 and DMS-1818169. We are grateful to David Rolnick and Pablo Soberón for their comments.

References