Nonparametric ridge estimation

12/20/2012 ∙ by Christopher R. Genovese, et al. ∙ Carnegie Mellon University Sapienza University of Rome 0

We study the problem of estimating the ridges of a density function. Ridge estimation is an extension of mode finding and is useful for understanding the structure of a density. It can also be used to find hidden structure in point cloud data. We show that, under mild regularity conditions, the ridges of the kernel density estimator consistently estimate the ridges of the true density. When the data are noisy measurements of a manifold, we show that the ridges are close and topologically similar to the hidden manifold. To find the estimated ridges in practice, we adapt the modified mean-shift algorithm proposed by Ozertem and Erdogmus [J. Mach. Learn. Res. 12 (2011) 1249-1286]. Some numerical experiments verify that the algorithm is accurate.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multivariate data in many problems exhibit intrinsic lower dimensional structure. The existence of such structure is of great interest for dimension reduction, clustering and improved statistical inference, and the question of how to identify and characterize this structure is the focus of active research. A commonly used representation for low-dimensional structure is a smooth manifold. Unfortunately, estimating manifolds can be difficult even under mild assumptions. For instance, the rate of convergence for estimating a manifold with bounded curvature blurred by homogeneous Gaussian noise, is logarithmic [Genovese et al. (2012a)], meaning that an exponential amount of data are needed to attain a specified level of accuracy. In this paper, we offer a way to circumvent this problem. We define an object, which we call a hyper-ridge set that can be used to approximate the low-dimensional structure in a data set. We show that the hyper-ridge set captures the essential features of the underlying low-dimensional structure while being estimable from data at a polynomial rate.

Let

be a sample from a probability density

defined on an open subset of -dimensional Euclidean space and let be an estimate of the density. We will define hyper-ridge sets (called ridges for short) for both and , which we denote by and . We consider two cases that make different assumptions about . In the hidden manifold case (see Figure 1), we assume that the density is derived by sampling from a dimensional manifold and adding -dimensional noise. In the density ridge case, we look for ridges of a density without assuming any hidden manifold, simply as a way of finding structure in a point cloud, much like clustering. The goal in both cases is to estimate the hyper-ridge set. Although in the former case, we would ideally like to estimate , this is not always feasible for reasonable sample sizes, so we use the ridge as a surrogate for . We focus on estimating ridges from point cloud data; we do not consider image data in this paper.

Figure 1: Synthetic data showing lower dimensional structure. The left plot is an example of the hidden manifold case. The right plot is an example of a hidden set consisting of intersecting manifolds.

A formal definition of a ridge is given in Section 2. Let be fixed. Loosely speaking, we define a -dimensional hyper-ridge set of a density to be the points where the Hessian of has

strongly negative eigenvalues and where the projection of the gradient on that subspace is zero. Put another way, the ridge is a local maximizer of the density when moving in the normal direction defined by the Hessian.

Yet another way to think about ridges is by analogy with modes. We can define a mode to be a point where the gradient is 0 and the second derivative is negative, that is, the eigenvalues of the Hessian are negative. The Hessian defines a -dimensional normal space (corresponding to the smallest eigenvalues) and a dimensional tangent space. A ridge point has a projected gradient (the gradient in the direction of the normal) that is 0 and eigenvalues in the normal space that are negative. Modes are simply dimensional ridges.

Figure 2: An example of a one dimensional ridge defined by a two-dimensional density . The ridge  is a circle on the plane. The solid curve is the ridge, lifted onto , that is, .

A stylized example is shown in Figure 2. In this example, the density is where , is a circle in , is a smooth (but nonuniform) density supported on and

is a two-dimensional Gaussian with a variance

that is much smaller than the radius of the circle. The ridge is a one-dimensional subset of . The figure has a solid curve to show the ridge lifted onto , that is, the curve shows the set . The ridge does not coincide exactly with due to the blurring by convolution with the Gaussian. In fact, is a circle with slightly smaller radius than . That is, is a biased version of . Figure 3 shows and .

Note that the density is not uniform over the ridge. Indeed, there can be modes (-dimensional ridges) within a ridge. What matters is that the function rises sharply as we approach the ridge (strongly negative eigenvalue).

One of the main points of this paper is that captures the essential features of . If we can live with the slight bias in , then it is better to estimate since can be estimated at a polynomial rate while can only be estimated at a logarithmic rate. Throughout this paper, we take the dimension of interest as fixed and given.

Figure 3: The outer circle denotes the manifold . The dashed circle is the ridge of the density . The ridge is a biased version of and acts as a surrogate for . The inner circle shows the ridge from a density estimator with bandwidth . can be estimated at a much faster rate than .

Many different and useful definitions of a “ridge” have been proposed; see the discussion of related work at the end of this section. We make no claim as to the uniqueness and optimality of ours. Our definition is motivated by four useful properties that we demonstrate in this paper: [1.]

If is close to , then is close to where is the ridge of and is the ridge of .

If the data-generating distribution is concentrated near a manifold , then the ridge approximates both geometrically and topologically.

can be estimated at a polynomial rate, even in cases where can be estimated at only a logarithmic rate.

The definition corresponds essentially with the algorithm derived by Ozertem and Erdogmus (2011). That is, our definition provides a mathematical formalization of their algorithm.

Our broad goal is to provide a theoretical framework for understanding the problem of estimating hyper-ridge sets. In particular, we show that the ridges of a kernel density estimator consistently estimate the ridges of the density, and we find and upper bound on the rate of convergence. The main results of this paper are (stated here informally):

  • Stability (Theorem 4). If two densities are sufficiently close together, their hyper-ridge sets are also close together.

  • Estimation (Theorem 5). There is an estimator such that

    (1)

    where is the Hausdorff distance, defined in equation (9). Moreover,  is topologically similar to in the sense that small dilations of these sets are topologically similar.

  • Surrogate (Theorem 7). In the Hidden Manifold case with small noise variance and assuming has no boundary, the hyper-ridge set of the density satisfies

    (2)

    and is topologically similar to . Hence, when the noise is small, the ridge is close to . Note that we treat as fixed while . It then follows that

    (3)

This leaves open the question of how to locate the ridges of the density estimator. Fortunately, this latter problem has recently been solved by Ozertem and Erdogmus (2011) who derived a practical algorithm called the subspace constrained mean shift (SCMS) algorithm for locating the ridges. Ozertem and Erdogmus (2011) derived their method assuming that the underlying density function is known (i.e., they did not discuss the effect of estimation error). We, instead, assume the density is estimated from a finite sample and adapt their algorithm accordingly by including a denoising step in which we discard points with low density. This paper provides a statistical justification for, and extension to, their algorithm. We introduce a modification of their algorithm called SuRF (Subspace Ridge Finder) that applies density estimation, followed by denoising, followed by SCMS.

Related work. Zero dimensional ridges are modes and in this case ridge finding reduces to mode estimation and SCMS reduces to the mean shift clustering algorithm [Fukunaga and Hostetler (1975), Cheng (1995), Li, Ray and Lindsay (2007), Chacón (2012)].

If the hidden structure is a manifold, then the process of finding the structure is known as manifold estimation or manifold learning. There is a large literature on manifold estimation and related techniques. Some useful references are Niyogi, Smale and Weinberger (2008) Caillerie et al. (2011), Genovese et al. (2009, 2012a, 2012b, 2012c), Tenenbaum, de Silva and Langford (2000), Roweis and Saul (2000) and references therein.

The notion of ridge finding spans many fields. Previous work on ridge finding in the statistics literature includes Cheng, Hall and Hartigan (2004), Hall, Peng and Rau (2001), Wegman and Luo (2002), Wegman, Carr and Luo (1993) and Hall, Qian and Titterington (1992). These papers focus on visualization and exploratory analysis. An issue that has been discussed extensively in the applied math and computer science literature is how to define a ridge. A detailed history and taxonomy is given in the text by Eberly (1996). Two important classes of ridges are watershed ridges, which are global in nature, and height ridges, which are locally defined. There is some debate about the virtues of various definitions. See, for example, Norgard and Bremer (2012), Peikert, Günther and Weinkauf (2012). Related definitions also appear in the fluid dynamics literature [Schindler et al. (2012)] and astronomy [Aragón-Calvo et al. (2010), Sousbie et al. (2008)]. There is also a literature on Reeb graphs [Ge et al. (2011)] and metric graphs [Aanjaneya et al. (2012), Lecci, Rinaldo and Wasserman (2013)]. Metric graph methods are ideal for representing intersecting filamentary structure but are much more sensitive to noise than the methods in this paper. It is not our intent in this paper to argue that one particular definition of ridge is optimal for all purposes. Rather, we use a particular definition which is well suited for studying the statistical estimation of ridges.

More generally, there is a vast literature on hunting for structure in point clouds and analyzing the shapes of densities. Without attempting to be exhaustive, some representative work includes Davenport et al. (2010), Klemelä (2009), Adams, Atanasov and Carlsson (2011), Chazal et al. (2011), Bendich, Wang and Mukherjee (2012).

Throughout the paper, we use symbols like to denote generic positive constants whose value may be different in different expressions.

2 Model and ridges

In this section, we describe our assumptions about the data and give a formal definition of hyper-ridge sets, which we call ridges from now on. Further properties of ridges are stated and proved in Section 4.

We start with a point cloud . We assume that these data comprise a random sample from a distribution with density , where has at least five bounded, continuous derivatives. This is all we assume for the density ridge case. In the hidden manifold case, we assume further that and are derived from a -dimensional manifold by convolution with a noise distribution, where . Specifically, we assume that is embedded within a compact subset and that

(4)

where ,

is a uniform distribution on

, denotes convolution, is a distribution supported on , and

is a Gaussian distribution on

with zero mean and covariance . While we could consider a more general noise distribution in (4), we focus on the common assumption of Gaussian noise. In that case, a hidden manifold can only be estimated at a logarithmic rate [Genovese et al. (2012b)], so ridge estimators are particularly valuable. (Even when can be estimated at a polynomial rate, ridge estimators are often easier in practice than estimating the manifold, which would involve deconvolution.)

The data generating process under model (4) is equivalent to the following steps: [1.]

Draw from a .

If , draw from a uniform distribution on .

If , let where and is additional noise. Points drawn from represent background clutter. Points drawn from are noisy observations from . When consists of a finite set of points, this can be thought of as a clustering model.

2.1 Definition of ridges

As in Ozertem and Erdogmus (2011), our definition of ridges relies on the gradient and Hessian of the density function . Recall that is fixed throughout. Given a function , let denote its gradient and its Hessian matrix, at . Let

(5)

denote the eigenvalues of and let be the diagonal matrix whose diagonal elements are the eigenvalues. Write the spectral decomposition of as . Let be the last columns of (i.e., the columns corresponding to the smallest eigenvalues). If we write then we can write . Let be the projector onto the linear space defined by the columns of . We call this the local normal space and the space spanned by is the local tangent space. Define the projected gradient

(6)

If the vector field

is Lipschitz then by Theorem 3.39 of Irwin (1980),  defines a global flow as follows. The flow is a family of functions such that and and . The flow lines, or integral curves, partition the space (see Lemma 2) and at each where is nonnull, there is a unique integral curve passing through . Thus, there is one and only one flow line through each nonridge point. The intuition is that the flow passing through is a gradient ascent path moving toward higher values of . Unlike the paths defined by the gradient which move toward modes, the paths defined by the projected gradient move toward ridges. The SCMS algorithm, which we describe later, can be thought of as approximating the flow with discrete, linear steps

. [A proof that the linear interpolation of these points approximates the flow in the case

is given in Arias-Castro, Mason and Pelletier (2013).]

A map is an integral curve with respect to the flow of if

(7)

Definition: The ridge of dimension is given by .

Note that the ridge consists of the destinations of the integral curves: if for some satisfying (7).

Our definition is motivated by Ozertem and Erdogmus (2011) but is slightly different. They first define the -critical points as those for which . They call a critical point regular if it is -critical but not -critical. Thus, a mode within a one-dimensional ridge is not regular. A regular point with is called a principal point. According to our definition, the ridge lies between the critical set and the principal set. Thus, if a mode lies on a one-dimensional ridge, we include that point as part of the ridge.

2.2 Assumptions

We now record the main assumptions about the ridges that we will require for the results.

Assumption (A0) differentiability. For all , , and exist.

Assumption (A1) eigengap. Let denote a -dimensional ball of radius centered at and let . We assume that there exists and such that, for all , and .

Assumption (A2) path smoothness. For each ,

(A2)

where , and .

Condition (A1) says that is sharply curved around the ridge in the dimensional space normal to the ridge. To give more intuition about the condition, consider the problem of estimating a mode in one dimension. At a mode , we have that and . However, the mode cannot be uniformly consistently estimated by only requiring the second derivative to be negative since could be arbitrarily close to 0. Instead, one needs to assume that for some positive constant . Condition (A1) may be thought of as the analogous condition for a ridge. (A2) is a third derivative condition which implies that the paths cannot be too wiggly. (A2) also constrains the gradient from being too steep in the perpendicular direction. Note that these conditions are local: they hold in a size neighborhood around the ridge.

3 Technical background

Now we review some background. We recommend that the reader quickly skim this section and then refer back to it as needed.

3.1 Distance function and Hausdorff distance

We let denote a -dimensional open ball centered at with radius . If is a set and is a point then we define the distance function

(8)

where is the Euclidean norm. Given two sets and , the Hausdorff distance between and is

(9)

where

(10)

is called the -dilation of . The dilation can be thought of as a smoothed version of . For example, if there are any small holes in , these will be filled in by forming the dilation .

We use Hausdorff distance to measure the distance between sets for several reasons: it is the most commonly used distance between sets, it is a very strict distance and is analogous to the familiar distance between functions for sets.

3.2 Topological concepts

This subsection follows Chazal, Cohen-Steiner and Lieutier (2009) and Chazal and Lieutier (2005). The reach of a set , denoted by , is the largest such that each point in has a unique projection onto . A set with positive reach is, in a sense, a smooth set without self-intersections.

Now we describe a generalization of reach called -reach. The key point is simply that the -reach is weaker than reach. The full details can be found in the aforementioned references. Let be a compact set. Following Chazal and Lieutier (2005) define the gradient of to be the usual gradient function whenever this is well defined. However, there may be points at which is not differentiable in the usual sense. In that case, define the gradient as follows. For define for all . For , let . Let be the center of the unique smallest closed ball containing . Define .

The critical points are the points at which . The weak feature size is the distance from to its closest critical point. For , the -reach is where . It can be shown that is nonincreasing in , that and that .

As a simple example, a circle with radius has . However, if we bend the circle slightly to create a corner, the reach is 0 but, provided the kink is not too extreme, the -reach is still positive. As another example, a straight line as infinite reach. Now suppose we add a corner as in Figure 4. This set has 0 reach but has positive -reach.

Figure 4: A straight line as infinite reach. A line with a corner, as in this figure, has 0 reach but has positive -reach.

Two maps and are homotopic if there exists a continuous map such that and . Two sets and are homotopy equivalent if there are continuous maps and such that the following is true: (i) is homotopic to the identity map on and (ii) is homotopic to the identity map on . In this case we write . Sometimes fails to be homotopic to but is homotopic to for every sufficiently small . This happens because is slightly smoother than . If for all small , we will say that and are nearly homotopic and we will write .

The following result [Theorem 4.6 in Chazal, Cohen-Steiner and Lieutier (2009)] says that if a set is smooth and is close to , then a smoothed version of is nearly homotopy equivalent to .

Theorem 1 ([Chazal, Cohen-Steiner and Lieutier (2009)])

Let and be compact sets and let . If

(11)

then .

3.3 Matrix theory

We make extensive use of matrix theory as can be found in Stewart and Sun (1990), Bhatia (1997), Horn and Johnson (2013) and Magnus and Neudecker (1988).

Let be an matrix. Let denote an element of the matrix. Then the Frobenius norm is and the operator norm is . We define . It is well known that , that and that .

The operator converts a matrix into a vector by stacking the columns. Thus, if is then is a vector of length . Conversely, given a vector of length , let denote the matrix obtained by stacking columnwise into matrix form. We can think of as the “anti-vec” operator.

If is and is then the Kronecker is the matrix

(12)

If and have the same dimensions, then the Hadamard product is defined by .

For matrix calculus, we follow the conventions in Magnus and Neudecker (1988). If is a vector-valued map then the Jacobian matrix will be denoted by or . This is the matrix with . If is a matrix-valued map then is a matrix defined by

(13)

If then the derivative is a matrix given by

We then have the following product rule for matrix calculus: if and then

Also, if then where denotes the gradient of .

The following version of the Davis–Kahan theorem is from von Luxburg (2007). Let and be two symmetric, square matrices. Let be the diagonal matrix of eigenvalues of . Let and let

be the matrix whose columns are the eigenvectors corresponding to the eigenvalues of

in and similarly for and . Let

(14)

According to the Davis–Kahan theorem,

(15)

Let be a square, symmetric matrix with eigenvalues . Let be another square, symmetric matrix with eigenvalues . By Weyl’s theorem [Theorem 4.3.1 of Horn and Johnson (2013)], we have that

(16)

It follows easily that

(17)

4 Properties of ridges

In this section, we examine some of the properties of ridges as they were defined in Section 2 and show that, under appropriate conditions, if two functions are close together then their ridges are close and are topologically similar.

4.1 Arclength parameterization

It will be convenient to parameterize the gradient ascent paths by arclength. Thus, let be the arclength from to :

(18)

Let denote the inverse of . Note that

(19)

Let . Then

(20)

which is a restatement of (7) in the arclength parameterization.

In what follows, we will often abbreviate notation by using the subscript in the following way: and so forth.

4.2 Differentials

We will need derivatives of , , and . The derivative of is the Hessian . Recall from (13) that . We also need derivatives along the curve . The derivative of a functions along is

(21)

Thus, the derivative of the gradient along is

(22)

We will also need the derivative of in the direction of a vector which we will denote by

We can write an explicitly formula for as follows. Note that the elements of are the partial derivatives arranged in a matrix. Hence, . (Recall that stacks a vector into a matrix.) Note that is a matrix.

Recall that . The collection defines a matrix field: there is a matrix attached to each point . We will need the derivative of this field along the integral curves . For any , there is a unique path and unique such that . Define

where and with .

4.3 Uniqueness of the paths

Lemma 2

Conditions (A0)–(A2) imply that, for each , there is a unique path passing through .

We will show that the vector field is Lipschitz over . The result then follows from Theorem 3.39 of Irwin (1980). Recall that and is differentiable. It suffices to show that is differentiable over . Now . It may be shown that, as a function of , is Frechet differentiable. And

is differentiable by assumption. By the chain rule,

is differentiable as a function of . Indeed, is the matrix whose th column is where , denotes the Frechet derivative, and is the vector which is 1 in the th coordinate and zero otherwise.

4.4 Quadratic behavior

Conditions (A1) and (A2) imply that the function has quadratic-like behavior near the ridges. This property is needed for establishing the convergence of ridge estimators. In this section, we formalize this notion of quadratic behavior. Give a path , define the function

(24)

Thus, is simply the drop in the function along the curve as we move away from the ridge. We write if we want to emphasize that corresponds to the path passing through the point . Since , we define its derivatives in the usual way, that is, .

Lemma 3

Suppose that (A0)–(A2) hold. For all , the following are true: [1.]

.

and .

The second derivative of is:

(25)

.

is nonincreasing in .

.

1. The first condition is immediate from the definition.

2. Next,

Since the projected gradient is 0 at the ridge, we have that .

3. Note that . Differentiating both sides of this equation, we have that , and hence

Now

(26)

Since we have that , and hence

Therefore,

(27)

Recall that . Thus,

(28)

4. The first term in is . Since is in the column space of , where . Hence, from (A1),

and thus

Now we bound the second term . Since and , we have . Now . To see this, note that implies implies implies . To bound we proceed as follows. Let with . Then, from Davis–Kahan,