# Supervised Learning of Labeled Pointcloud Differences via Cover-Tree Entropy Reduction

We introduce a new algorithm, called CDER, for supervised machine learning that merges the multi-scale geometric properties of Cover Trees with the information-theoretic properties of entropy. CDER applies to a training set of labeled pointclouds embedded in a common Euclidean space. If typical pointclouds corresponding to distinct labels tend to differ at any scale in any sub-region, CDER can identify these differences in (typically) linear time, creating a set of distributional coordinates which act as a feature extraction mechanism for supervised learning. We describe theoretical properties and implementation details of CDER, and illustrate its benefits on several synthetic examples.

## Authors

• 2 publications
• 10 publications
• 4 publications
• 2 publications
• ### Information-theoretic convergence of extreme values to the Gumbel distribution

We show how convergence to the Gumbel distribution in an extreme value s...
07/07/2020 ∙ by Oliver Johnson, et al. ∙ 0

• ### MixMatch: A Holistic Approach to Semi-Supervised Learning

Semi-supervised learning has proven to be a powerful paradigm for levera...
05/06/2019 ∙ by David Berthelot, et al. ∙ 20

• ### Elements of Generalized Tsallis Relative Entropy in Classical Information Theory

08/05/2019 ∙ by Supriyo Dutta, et al. ∙ 0

• ### S3: Supervised Self-supervised Learning under Label Noise

Despite the large progress in supervised learning with Neural Networks, ...
11/22/2021 ∙ by Chen Feng, et al. ∙ 0

• ### Information-theoretic measures associated with rough set approximations

Although some information-theoretic measures of uncertainty or granulari...
02/01/2011 ∙ by Ping Zhu, et al. ∙ 0

• ### Semi-supervised learning in unbalanced and heterogeneous networks

Community detection was a hot topic on network analysis, where the main ...
01/07/2019 ∙ by Ting Li, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Acknowledgments

All authors partially supported by OSD contract number N00024-13-D6400 via JHU-APL subcontract number 131753. (Distribution Statement A - Approved for public release; distribution is unlimited.) Bendich and Harer were also partially supported by NSF award BIGDATA 1444791. Harer was also partially supported by the DARPA MoDyL program, number HR0011-16-2-0033. Pieloch was supported by the National Science Foundation Graduate Student Fellowship Program through grant DGE 16-44869. We thank Christopher J. Tralie for many helpful insights about cover trees, and we thank David Porter and Michael Williams at JHU-APL for many motivational discussions.

## 1. Overview and Assumptions

We propose a new supervised machine-learning method for classification, where each object to be classified is a weighted

pointcloud, and the classification task is to learn which of a finite set of labels should be applied to the pointcloud. The method is fast, data-driven, multi-scale, and requires no tuning. Additionally, its details are transparently geometric; it does not suffer from the “black box” interpretation difficulties that arise in many machine-learning methods. We call it Cover-tree Differencing via Entropy Reduction [CDER, pronounced “cedar”].

A high-level sketch of CDER is as follows. We construct a partial cover tree [7] on the union of all labeled pointclouds that are given in some training set, and we search the cover tree for convex regions that are likely to be local minima of entropy [3]. For each such region, we build distributional coordinates from the dominant labels’ data. An ordered sequence of these distributional coordinates allows us to determine the likelihood of each label for an unlabeled test pointcloud.

Section 3 explains the notion of distributional coordinates defined on pointclouds as a front-end for supervised learning. Section 4 details the cover-tree algorithm and our enhancements to it. Section 5 explains our simple approach to entropy minimization in the labeled pointcloud context. Section 6 gives a formal description of CDER. Section 7 illustrates CDER with several synthetic examples. The remainder of this section establishes context and notation for this problem.

Fix a Euclidean111See the Discussion section for further thoughts on this requirement. space . A pointcloud is a finite222In this document, indicates the cardinality of a finite set , whereas indicates a norm. set . We allow weighted pointclouds, where each has a positive weight ; the utility of weights is explained in Section 2. A cloud collection is a finite set of pointclouds

 X={X1,X2,…,XN},

with . Let denote the union of all pointclouds in a cloud collection, which is itself a pointcloud with weights and labels of each point inherited from the respective .

For our supervised learning task, the training data is a cloud collection , where each pointcloud is labeled from a finite set of labels, . For the sake of discussion and visualization, we usually interpret as a set of colors, but for mathematical purposes, where . Let denote the label function, and we also use to indicate a generic label in .

It is acceptable that the pointclouds have unequal sizes across . It is also acceptable that the labeled sub-collections have unequal sizes across labels . Pointwise weights can be assigned to compensate for these sizes, as in Section 2. Our only structural hypothesis is that, for each labeled sub-collection, the pointclouds are sampled from an underlying density function—or from several density functions as chosen by a random process—on . For each label , let denote this density function.

We aim to efficiently find regions in where one label has greater density than the other labels. That is, we seek convex regions that are “characteristic” of a particular label because is unusually prominent there, while ignoring regions where the various density functions are similar. See Figure 1.

We analyze the relative density of labels over regions using the information-theoretic notion of entropy (see Section 5), locating subregions of low entropy. Because of the use of cover trees, our method emphasizes regions that are very large, very dense, or small-yet-remote (see Section 4). For each of these regions , we construct a function that approximates near ; such a is an example of what we call a distributional coordinate (see Section 3).

## 2. Weights

Typically, pointclouds are experimental samples that do not come with pre-assigned weights, and we assign weights as follows: We have the prior assumption that each label is equally likely among the test data.333See the discussion section for more general thoughts about this assumption. We give the training set total weight 1. That is, each label is alloted total weight , where . Moreover, we make the assumption that each sample is equally representative of its underlying density function , regardless of . That is, each pointcloud is alloted total weight , where is the cardinality of , the number of training pointclouds with label . Finally, we assume all points have equal weight, so . Let denote the labeled and weighted pointcloud given by the union of all with these labels and weights.

For example, suppose that the training set consists of where , , and with labels red, blue, and red, respectively. Suppose we want to have unit weight. Then each has weight , each has weight , and each has weight .

These weights are chosen to model relative likelihoods in the following way: If each label has an underlying density function , and each with label is a sample from , then should be proportional to for any measurable set .

## 3. Distributional Coordinates

Given a cloud collection as above, we would like to map each pointcloud

to a vector

for some fixed dimension . While this is an interesting topic in its own right, we are mainly concerned with viewing this as a feature extraction method: that is, the vectors and corresponding labels will be used as input for any supervised learning algorithm that requires collections of vectors in a common Euclidean space, along with class labels.

One could imagine a supervised learning approach to labeled pointclouds that does not perform feature extraction; for example, by first defining a (non-Euclidean) metric between pointclouds, and then applying a nearest-neighbor or clustering method. However, most such metrics are unstable to noise (such as Hausdorff distance) or are slower-than-linear to compute (such as various Kullback-Leibler Divergence

[10] or the Wasserstein metric [9]).

This section frames the problem of feature extraction in the common language of distributional coordinates. We start with a definition of this intuitive concept, and then describe the advantages and disadvantages of some standard examples. The approach we advocate comes in the next section.

Let be an integrable function on Euclidean space. For any pointcloud with pointwise weights , define

 ∫Xg=w1g(x1)+…+w|X|g(x|X|).

We refer to as a distributional coordinate. Any ordered set of distributional coordinates transforms into a vector . It is computationally important to keep the dimension —or at least the intrinsic dimension of

—as small as possible, as we hope to minimize the negative effect of the curse of dimensionality on the supervised learning problem.

We use the cloud collection shown on the bottom of Figure 2 as a notional working example in the Euclidean plane. There are two class labels, “magenta” and “green.” We imagine that each magenta pointcloud (a typical example on the top-left of the figure) contains a large number of points sampled from a large central blob and a few points sampled from much smaller blobs along the horizontal axis. Each green pointcloud (a typical example on the top-right of the figure) contains a large number of points sampled from the same large central blob and a few points sampled from smaller, different blobs along the horizontal axis.

In a typical supervised learning context, would be split (say 80:20) into training and testing sets, and we would build a set of distributional coordinates from the labeled training set which could then infer the correct color labels on the testing set.

### 3.1. Binning

A simple approach would be to let each be the indicator function on some region ; that is, takes the value on some region and elsewhere. Then counts the total weight of the points . Assuming the regions cover the domain with overlap only on boundaries (for example, the could be the squares shown in Figure 2) this is what is commonly referred to as binning, and it has been used (for example, [4]) successfully for supervised-learning on labeled pointclouds. On the other hand, binning is obviously unstable: consider a point that lies very close to the boundary between two adjacent bins. It also can suffer from the curse of dimensionality: a larger number of bins than are shown in Figure 2 would be needed to distinguish the class labels.

### 3.2. Gaussian Mixtures

To address instability, one could define each to be a Gaussian with mean and covariance matrix

. The obvious questions are: how many Gaussians (modes) should be used, what are their means, and what are their covariances? A data-driven approach should be used to fit a Gaussian Mixture Model (

[8], VI.8) to .

It is apparent to the human eye that nine Gaussians (1 large green and magenta, 4 small magenta, and 4 small green) model the example in Figure 2. That is, we could “by hand” define a good set of distributional coordinates to establish a feature vector . Moreover, for the purposes of classification, the central “large” distributional coordinate is useless. And perhaps the two innermost “small” Gaussians are less useful, as the remaining six distributional coordinates are sufficient for accurate classification. Our goal is to replicate this observation: we construct Gaussian modes in a data-driven manner from that are appropriate for use as labeled distributional coordinates for the label classification problem. We now proceed to the computational details.

## 4. Cover Trees

Cover trees were originally conceived [7] as a way to accelerate the nearest-neighbors problem. More recently, they have been used in dictionary learning [2] and for speeding up the analysis of the topology of pointclouds [13], to pick just a few applications. This section gives basic definitions for cover trees as well as a fast algorithm for their construction; our account is far more detailed than usually appears in the literature and is meant to accompany our publicly available code. Sections 5 and 6 use cover trees to define a set of label-driven distributional coordinates on a cloud collection with class labels.

For any finite pointcloud444

For the moment we forget that

is a union of pointclouds and simply treat it as a single pointcloud to be studied. , consider a filtration

 A0⊂⋯⊂Aℓ⊂Aℓ+1⊂⋯⊂X––.

An element is called an adult555To reduce confusion (hopefully) by maintaining a consistent metaphor, we diverge from prior works’ notation in two places. Our adult is called a center elsewhere. Our guardian is usually called a parent. We avoid the word parent because there are many distinct tree relations embedded in this algorithm. at level . The set is the cohort at .

A cover tree builds a filtration by covering with balls of smaller and smaller radius centered at the points in . See Figure 3.

Specifically, a cover tree is a filtration of with the following additional properties:

1. . Let ;

2. There is a radius such that ;

3. There is a real number , called the shrinkage ratio such that, for every , where .

4. For each , if , then .

5. For each , each point is assigned to a guardian such that lies in the ball ; we say is a child of . Each is its own guardian and its own child.

6. There is a tree structure on the (level, adult) pairs of the filtration , where there is a tree relation if was a child of at level . We say is a successor of , and is a predecessor of . Note that for all .

### 4.1. Vocabulary

Cover trees can be generated using a fast algorithm relying on the notion of friends, first introduced in [7]. The rough algorithm is familiar to many practitioners, but is not widely published. We enhance this algorithm by giving each point a weight and a label. We describe our method here because the details of the construction are crucial to the supervised-learning task of this article.

Extending the maturation/reproduction metaphor of adults, children, and guardians above, a child with guardian at level is called a teen if , and it is called a youngin if . See Figure 4. When changing from level to level , the radius of each ball shrinks to . Children farther than from their guardians become orphans. We must decide whether these orphans should be adopted by other adults at level , or if the orphans should be emancipated as new adults at level . That is, the newly emancipated adults at level comprise the cohort at level .

Our algorithm uses several notions of friends, set by the following bounds at level :

 (1) T1(ℓ)=(2+θ)rℓ,T2(ℓ)=(2+2θ)rℓ,T3(ℓ)=2(1+θ+θ2+⋯)rℓ=21−θrℓ.

For each , we say that adults and at level are type- friends if . It is easy to verify the following recursion relations:

 (2) ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩T1(ℓ)≤T2(ℓ)≤T3(ℓ),T1(ℓ+1)≤T1(ℓ),T2(ℓ+1)≤T2(ℓ),T3(ℓ+1)≤T3(ℓ).

Here are the essential facts about friends, which are immediate consequences of the triangle inequality:

###### Theorem 1.
1. If is a child of at level that is orphaned at level , then can be adopted only by either the type-1 friends of at level or the newly-emancipated children at level of those type-1 friends. See Figure 5.

2. If we choose to re-assign children to the nearest adult, then teen with guardian can be adopted by adult at level , where the level- predecessors of and are type-2 friends. See Figure 6.

3. If and are type-3 friends at level , then their predecessors and are type-3 friends at level . See Figure 7.

###### Theorem 2.

Suppose is an adult at level with predecessor . If has guardian at level and guardian of at level , then is a type-1 friend of , and . See Figure 8.

Based on this scenario, we say is an elder of if . Moreover, if is its own predecessor, then is its only elder, so the notion of elder is nontrivial only if belongs to cohort .

###### Theorem 3.

For the purpose of computing elders, there are three particularly good choices of which require no additional computation. If , then . If , then . If , then . In these cases, the elders of are the friends (of the respective type) that have cohort or less. See Figure 4.

These facts allow the following algorithm.

### 4.2. Algorithm

At level , the cover tree has a single adult, . All points are its children. Its only friend is itself. Every has some weight and some label, which we assume as given input. For each label , compute the total weight and weighted mean of all children of with that label.

Suppose at level , the friends, children, per-label weights, and per-label means are known for all adults . We compute these for level as follows:

• Orphan. For each , find its orphaned children: those such that . Sort the orphans so that the first orphan is the one nearest the per-label mean for the label of greatest weight, the second orphan is the one nearest the per-label mean of the label of second-greatest weight, and so on, via a two-dimensional argsort without repetition. The reason for this sorting will become apparent in Section 6.

• Adopt or emancipate. For each , consider its orphans in order. For each orphaned by , check whether can be adopted by another adult using the type-1 friends of :

• If so, the adoptive guardian will be a adult such that . The predecessor of has .

• If not, then is a new adult at level . Its only child is itself. Its friends are initialized to be itself and its predecessor (but these will be updated in the Befriending stage). Note the set is ordered by cohort, and within a cohort is ordered by predecessor, and within a predecessor block is ordered by proximity to the predecessor’s per-label means.

• Exchange Teens.666This step is optional for general cover tree, but it is particularly useful for the computation of elders, which are needed in Section 6. If has guardian , then check whether it is closer to some other guardian using type-2 friends. That is, find . The predecessor of has . Note that implies , meaning that only teens (not youngins) can be exchanged.

• Befriend: Now, all -adults and their children are known. We need to find new friends using the recursion relation and the predecessors.

• A pair of adults, and , are type-3 friends if their predecessors and are type-3 friends at level and .

• A pair of adults, and , are type-2 friends if they are type-3 friends and .

• A pair of adults, and , are type-1 friends if they are type-2 friends and .

• For each newly emancipated adult of cohort , find its elders by searching its friends for elements of cohort or less.

• Weigh: For each adult at level , compute the per-label weights and per-label weighted means for its children. Sort the labels by their weights from greatest to least.

At level , the friends, children, per-label weights, and per-label means are all known.

### 4.3. Algorithmic Efficiency

Suppose , and the cover-tree filtration stops at level . We are interested in the storage and computational complexity of this algorithm in terms of , for a fixed ambient dimension .

The desired output is the successor/predecessor tree and the children/guardian relationships at each level. Thus, storage of friend lists is the only issue that might lead us to another algorithm. The main question is: given an adult at level , how many other adults are within of ?

Since no two adults at level are within of one another, this is related to the idea of “doubling dimension.” Define to be the maximum number of points in that can be placed in the ball of radius , including the center, so that no two points have distance . This number grows with the ambient dimension . In Euclidean space, it is independent of scale; that is, the same number works for a ball of radius using points of distance at least . The number can be bounded by the number of simplices with edge length at least that can be put into a ball of radius . Since these simplices have a minimum volume , we get a bound like:

 D≤Vol(Bα)D!2D/2√D+1∼D!(α)D(2π)D/2(D2)!√D+1∼O(D!2DαD).

Setting , we get an upper bound on the expected size of the friends list, which is a constant independent of the level . Thus, the total storage size is ; in particular, it is at most , where the accounts for the lists of type-1, type-2, and type-3 friends.

Setting , we get an upper bound on the expected number of successors of an adult from level to , which is a constant independent of the level . The size of is . The maximum expected size of is . The maximum expected size of is . Generally, the maximum expected size of is . Moreover, the larger the size of each cohort , the shallower the cover tree. That is, is expected to be of the order .

For each adult , we must:

1. Sum and sort the weights of its children, by label. The expected number of children is . Sorting the labels is for labels.

2. Compute the distance (an operation) to each of its roughly children. These distances yield the orphans and teens. The expected number of orphans is , and the expected number of teens is .

3. For each orphan, determine whether it should be adopted or liberated. Compute the distance (an operation) to the friends or successors of the friends of . This set has expected size , which is constant in .

4. For each teen, determine whether it should be exchanged. Compute the distance (an operation) to the friends or successors of the friends of . This set has expected size , which is constant in .

5. For each successor—a list of size —update its friends by computing the distance (an operation) to the successors of the friends of . This is expected to be distance computations.

For each adult at level , this is roughly operations, albeit with an additional penalty from . Level has at most adults, in which case the total number of computations is roughly

 ℓmax∑ℓ=0DℓnDℓO(D2)∼O(nlogDn)O(D2).

It is interesting that the rapid growth from the ambient dimension is counteracted by the logarithm term. In our experience with applications, this is effectively linear in

, as the sum terminates earlier than this estimate. This is because the estimate of

by is typically irrelevant; the maximum level of the cover tree is actually controlled by the ratio of the maximum to the minimum distances in the pointcloud. In applications, these are typically within several orders of magnitude, so with , we expect . Moreover, in our supervised-learning application in Section 6, we abort the cover tree much earlier, once the low-entropy regions have been found.

## 5. Entropy Reduction

For each label , let denote the (weighted) points in that have label . For any convex, compact , let . The total weight in of each label is , which by Section 2 corresponds to the relative likelihood of a point with label being sampled from . Let .

To quantify the relative prominence of the various labels, we use the information-theoretic notion of entropy. Entropy is characterized by several nice properties, detailed in [3]. The formula is

 (3) S(Ω)=−∑λwλWlogL(wλW)=−1W∑λwλlogL(wλ)+logL(W).

If all weights are roughly equal, then entropy is 1. If one label has high weight and the others have weight 0, then the entropy is 0. Generally, entropy decreases as a smaller subset of labels becomes more likely. Thus, low entropy is a good indicator that the region is particularly prominent for a small subset of labels. The question remains: how can we locate low-entropy regions?

Our search for low-entropy regions is motivated by the following elementary theorems; however, we do not insist that these strict regularity hypotheses hold for any given application.

###### Theorem 4.

Fix a function that is continuous on a compact, convex set . Fix a radius . Let denote the centroid of on the compact, convex set (the closed ball). There exists such that .

Proof: Note that varies continuously with , and apply Brouwer’s fixed-point theorem ([11], II.22).

###### Theorem 5.

Suppose that is an analytic function with local max at . For any , there exists a radius such that the centroid of on has .

Proof: Expand as a power series around , and approximate by truncating at the second-order term. The level sets are concentric ellipsoids about , as given by the positive-definite symmetric matrix .

###### Theorem 6.

Suppose and are positive, continuous functions defined on a compact, convex region . Suppose that achieves a local max on and that on , with . Then, there is a radius such that the average value of on is greater than the average value of on for all .

Proof: The mean value theorem for integrals.

Putting these together, we try to find a region and a ball of radius in that region whose center is near the mean of a particular label. We ask whether the entropy of that ball is non-decreasing with radius; that is, does the entropy become lower (or remain the same) when the ball shrinks? If true, this is consistent with the hypothesis that a local maximum of a particular label occurs near the center of that ball, while other labels remain roughly constant. If false, then we subdivide the region and search around the boundary of the original ball for other smaller regions of low entropy.

Specifically, the compact, convex regions corresponds to the children of an adult in the cover tree, and the sub-balls correspond to the children of successors of at level in the cover tree. This is the heart of the CDER algorithm in Section 6.

## 6. Cover-tree Differencing via Entropy Reduction

The input data is a ratio and a training set of pointclouds with labels , respectively. Set weights as in Section 2.

### 6.1. Region Selection

Here we describe the heart of the CDER algorithm, wherein we search the adults of the cover tree as a data-driven method to locate regions on which we want to build distributional coordinates. The construction and weighting of those distributional coordinates occurs in Section 6.2.

For each level of the cover tree, define a subset of adults that are still potential candidates for distributional coordinates. Set . Once , we break.

The cover tree of is constructed level-by-level as in Section 4.2. At each level of the cover tree (until break), we perform the following computation:

For each adult , let denote the set of children of at level . Let denote the set of children of at level . Let denote the union of the children (at level ) of the elders of . (If belongs to cohort or less, then .) Note that ; see Figure 9. Then, we decide what to do with :

• if , then break. Nothing more is needed.

• if has only one child (itself) at level , then pass. This region is useless.

• elif has only one child (itself) at level , then pass. This region is useless.

• elif , then build a distributional coordinate using the children of of each of the dominant labels.777A label is dominant if its points represent more than of the total weight of the children. For a two-label task, this is simply the majority label.

• elif , then append to . The smaller ball may contain a lower-entropy region.

• elif , then append the successors of —other than itself!—to . The annulus may contain a lower-entropy region.

• elif , then append the successors of —other than itself!—to . The annulus may contain a lower-entropy region.

• elif , then pass. The region is useless.

• elif if , then pass. The region is useless.

• else (meaning ) append the successors of to . There is too little information to decide whether contains a low-entropy region, so re-analyze everything at the next level.

Recall that our method in Section 4.2 produces the list which is ordered by cohort, and within a cohort is ordered by predecessor, and within a predecessor block is ordered by proximity to the predecessor’s per-label means. Therefore, if in this ordering, then tends to be nearer the mean of the most prominent label of a larger region than . Hence, the distributional coordinates are sorted by “granularity,” from coarse to fine.

In some scenarios, detailed analysis of clusters is more important than efficient classification. Therefore, one can implement a “non-parsimonious” version by replacing all pass operations with append. For machine learning applications, the “parsimonious” version is far faster, as the cover tree can be stopped much earlier than the limit suggested in Section 4.3.

### 6.2. Build Distributional Coordinates

For each selected to build, we construct a distributional coordinate in the following way.

For each of the dominant labels among the children of , use PCA/SVD to build a Gaussian model of mass 1. Let denote the entropy difference on caused by erasing the non-dominant labels. We amplify (or attenuate) each Gaussian model by the coefficient

 mi=wλ(1−ΔS)(rℓ)D.

The distributional coordinate is .

One can experiment with different weighting systems; we have found these to be generally effective, and we justify this choice for the following geometric reasons:

• The weight represents the relative likelihood that a point of the given label was selected from the region on which this coordinate was detected. Perhaps, by chance in our selection process, a single region was detected at a later level in the form of several smaller regions. The weight of the smaller regions will sum to the weight of the larger region.

• The entropy term is the “certainty” with which this region was selected. The entropy is the information that is lost if the other labels are removed from the region, so is the information that remains. In other words, this term penalizes regions according to the impurity of their labels.

• The term accounts for the relative remoteness of the sample region. All else being equal, remote regions are particularly distinctive. Because of the cover-tree construction, remoteness can be measured by the size of the cover-tree ball. For example, suppose that two different regions are detected by CDER, each with the same entropy, weight, and configuration of points; the Gaussian coordinates are translates of one-another. Suppose that the first region is very distant from the rest of the pointcloud, so it is detected at an early level . Suppose that the second region is surrounded closely by other data, so it is not detected by the cover tree until a later level . The first region’s relative volume is .

## 7. Examples

This section illustrates the promise of CDER on several synthetic examples of varying levels of complexity that are designed to stress certain features of the algorithm.

### 7.1. Blobs

This example shows that CDER ignores high-density regions unless they have low entropy.

Consider two labels888Labels are indicated by colors, and we have intentionally chosen colors that seem to show up non-catastrophically in black-and-white printouts. Nonetheless, we recommend printing this paper in color or at least reading this section on the computer screen!: 0/magenta and 1/green. The cloud collection has 25 pointclouds of each label. Each magenta pointcloud consists of:

• 100 points sampled from a standard normal distribution

• 2 points sampled from a normal distribution with and mean

• 2 points sampled from a normal distribution with and mean

• 2 points sampled from a normal distribution with and mean

• 2 points sampled from a normal distribution with and mean

Each green pointcloud consists of

• 100 points sampled from a standard normal distribution

• 2 points sampled from a normal distribution with and mean

• 2 points sampled from a normal distribution with and mean

• 2 points sampled from a normal distribution with and mean

• 2 points sampled from a normal distribution with and mean

This produces the cloud collection in Figure 2. The output of CDER appears in Figure 10

In a typical run, the algorithm terminates at cover-tree level 8, even though the complete cover tree of ends at level 14. A total of 26 distributional coordinates are produced, and their weights vary by about 3 orders of magnitude. Table 1 shows how conservative the entropy-reduction process is at selecting regions to generate distributional coordinates.

To judge accuracy on testing data in a supervised-learning scenario, we need a method to label a test pointcloud using the distributional coordinates. Each test pointcloud is mapped to a point in . Many sophisticated methods are possible, but for simplicity, we simply ask: For a given pointcloud , which is bigger: the Euclidean norm of the magenta Gaussian coordinates evaluated on ,

  ⎷∑g magenta(∫Xtestg)2

or the Euclidean norm of the green Gaussian coordinates evaluated on ,

  ⎷∑g green(∫Xtestg)2?

With this simple comparison, the algorithm achieves 100% accuracy in a 5-fold cross-validation of this cloud collection with a 80/20 training/testing split. More precisely, for each run of the cross-validation, we take 80 percent of the point clouds, turn these into a cloud collection, build distributional coordinates using this cloud collection, and then test the results on the remaining 20 percent of the point clouds. This entire procedure is then repeated five times.

Moreover, the relative masses of the 28 distributional coordinates vary over four orders of magnitude, so for this sort of comparison, one could dispose of many of them in the interest of speed while preserving accuracy. Note that these distributions are contrived to have the same mass (0-moment) and mean (1-moment) and variance. Elementary statistical tests would not distinguish them; 2-moments or skewness tests would be necessary.

### 7.2. Blocks

This example shows that CDER is fairly robust against background noise that prevents strict separation. It also demonstrates that smoothness of the underlying distributions is not necessary for good results.

Consider two labels: 0/magenta and 1/green. The cloud collection consists of 100 magenta pointclouds and 100 green pointclouds. Each magenta pointcloud in generated by sampling 30 points uniformly in the unit square, as well as 2 extra points sampled uniformly in a square at , as well as 2 extra points sampled uniformly in a square at . Each green pointcloud in generated by sampling 30 points uniformly in the unit square, as well as 2 extra points sampled uniformly in a square at , as well as 2 extra points sampled uniformly in a square at . See Figures 1 and 11.

Using the same simple comparison as in Section 7.1, the algorithm achieves 88% accuracy despite the high background noise.

### 7.3. Deep Field

In this example, we demonstrate that Gaussian mixtures are (nearly) “fixed points” of CDER. It also demonstrates that the algorithm can handle unequal sample sizes via the weighting system in Section 2.

Consider two labels: 0/magenta and 1/green. The underlying distribution is build from 50 Gaussian distributions. For each of these, we choose a random label. We choose a mean point, uniformly on the

square. We choose a pair of orthonormal covariances uniformly from to . We choose a random rotation angle. We choose an amplification factor (that is, a sample size) uniformly between 50 and 5000. For a particular random seed, we arrive at some combined density function and , as in Figure 14

Then, for each label, we produce between 20 and 40 pointclouds, each with between 50 and 500 points. For a particular random seed, we arrive at the cloud collection in Figure 12.

The CDER algorithm builds a Gaussian mixture model focused on regions of low entropy (Figure 13), so it should not be surprising that it builds Gaussians near the original Gaussians, as in Figure 14.

### 7.4. Three Labels

For simplicity, the previous examples have involved only two labels. In a two-label system, a low-entropy region has exactly one dominant label. However, the algorithm is sensible for any number of labels, and it may be that the low-entropy regions are dominated by multiple labels. Hence, an ensemble of regions may be necessary to distinguish pointclouds.

Consider three labels: 0/magenta, 1/green, and 2/orange. Let be a triple of standard normal distributions: one each at , , and . Let be a triple of standard normal distributions: one each at ,, and . , and . Let be a triple of standard normal distributions: one each at , , and . See Figure 15.

The CDER algorithm detects these shared regions perfectly, achieving 100% in a 5-fold cross-validation of this cloud collection with a 80/20 training/testing split, using the same simple comparison method as in Section 7.1.

## 8. Stability

Ideally, one would like to prove that CDER is stable or backward-stable, in the sense of [14, Lecture 14]. To formulate stability or backward-stability, CDER must be expressed as a numerical approximation of a formal function between normed vector spaces.

Let denote the vector space of piecewise continuous functions on our compact domain. Define a norm on by , where each term is the norm on our (compact) domain. Given two “input ”functions we consider them as density functions for each of two labels. Consider the pointwise entropy,

 H(x)=−f1(x)f1(x)+f2(x)log2(f1(x)f1(x)+f2(x))−f2(x)f1(x)+f2(x)log2(f2(x)f1(x)+f2(x)).

Let denote the formal function by where

 (4) g1(x)={H(x)f1(x),if f1(x)>f2(x),0,otherwise.

and

 (5) g2(x)={H(x)f2(x),if f2(x)>f1(x),0,otherwise.

A numerical algorithm for can be obtained roughly as “sample equally from and , and apply CDER to generate weighted distributional coordinates. The weighted sum of the distributional coordinates for label 1 is , and the sum of the distributional coordinates for label 2 is .” A stability result such as this is a subject of future work. We have not yet succeeded in proving either stability statement for CDER, but the high cross-validation of the examples above is promising.

## 9. Discussion

This paper introduced CDER, a data-driven, label-driven feature extraction method for collections of labeled pointclouds. CDER is fast, both in terms of theoretical complexity and in initial tests on examples. It does not require the user to choose any tuning parameters, and the geometric meaning of its output features are transparently clear. This section outlines some future directions and makes a few generalizing remarks.

The attentive reader of Section 4 may notice that cover trees can be defined in an arbitrary metric space, and indeed this is how they were originally defined [7]. We use them to construct distributional coordinates, and our algorithm for doing so (Section 6) demands that we be able to quickly compute means of sets of points. While the Fréchet Mean ([5], IX) can be defined in any metric space, there are not always fast algorithms for its computation (it also need not exist in all cases). And so while more generality could be achieved in the definition of CDER, we only make our complexity statements for cloud collections in a common Euclidean space.

All examples in Section 7 were artificial, and simply intended to help visualize the novel and fairly-technical CDER algorithm and emphasize its key properties. Future work will involve applications of CDER to real data from a variety of real fields, including large administrative datasets of interest to social scientists and to vehicle tracking [4].

We also hope that CDER will prove useful as a feature-extraction method in topological data analysis [6], since persistence diagrams can be thought of as point clouds in the plane. Future work will compare the performance of CDER against other such feature-extraction methods (for example, [12] and [1]).

Finally, we recall the weighting discussion in Section 2

, where we used the simplifying assumptions that each color/label was equally likely and that each point within a single point cloud should be given equal weight. We note that CDER can be easily adapted to accommodate other prior assumptions about relative likelihoods of labels or even prior assumptions about outlier status of certain points in a cloud, say as part of a Bayesian learning process.

## References

• [1] A. Adcock, E. Carlsson, and G. Carlsson. The ring of algebraic functions on persistence bar codes. ArXiv e-prints, 2013.
• [2] William K. Allard, Guangliang Chen, and Mauro Maggioni. Multi-scale geometric methods for data sets ii: Geometric multi-resolution analysis. Applied and Computational Harmonic Analysis, 32(3):435 – 462, 2012.
• [3] John C. Baez, Tobias Fritz, and Tom Leinsterm. A characterization of entropy in terms of information loss. Entropy, 13(11):1945–1957, 2011.
• [4] Paul Bendich, Sang Chin, Jesse Clarke, John deSena, John Harer, Elizabeth Munch, Andrew Newman, David Porter, David Rouse, Nate Strawn, and Adam Watkins. Topological and statistical behavior classifiers for tracking applications. IEEE Trans. on Aero. and Elec. Sys., 52(6):2644–2661, 2016.
• [5] D. Burago, I.U.D. Burago, and S. Ivanov. A Course in Metric Geometry. Crm Proceedings & Lecture Notes. American Mathematical Society, 2001.
• [6] Herbert Edelsbrunner and John Harer. Persistent Homology — a Survey. Contemporary Mathematics, 0000:1–26.
• [7] Sariel Har-Peled and Manor Mendel. Fast construction of nets in low-dimensional metrics and their applications. SIAM Journal on Computing, 35(5):1148–1184.
• [8] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
• [9] Michael Kerber, Dmitriy Morozov, and Arnur Nigmetov. Geometry Helps to Compare Persistence Diagrams, pages 103–112. 2016.
• [10] S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Statist., 22(1):79–86, 03 1951.
• [11] James R. Munkres. Elements of Algebraic Topology. Addison Wesley, 1993.
• [12] Jan Reininghaus, Stefan Huber, Ulrich Bauer, and Roland Kwitt. A stable multi-scale kernel for topological machine learning. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pages 4741–4748, 2015.
• [13] Donald R. Sheehy. Linear-size approximations to the Vietoris-Rips filtration. Discrete & Computational Geometry, 49(4):778–796, 2013.
• [14] Lloyd N. Trefethen and D Bau III. Numerical linear algebra. Numerical Linear Algebra with Applications, 12:361, 1997.