1 Introduction
Multivariate data in many problems exhibit intrinsic lower dimensional structure. The existence of such structure is of great interest for dimension reduction, clustering and improved statistical inference, and the question of how to identify and characterize this structure is the focus of active research. A commonly used representation for lowdimensional structure is a smooth manifold. Unfortunately, estimating manifolds can be difficult even under mild assumptions. For instance, the rate of convergence for estimating a manifold with bounded curvature blurred by homogeneous Gaussian noise, is logarithmic [Genovese et al. (2012a)], meaning that an exponential amount of data are needed to attain a specified level of accuracy. In this paper, we offer a way to circumvent this problem. We define an object, which we call a hyperridge set that can be used to approximate the lowdimensional structure in a data set. We show that the hyperridge set captures the essential features of the underlying lowdimensional structure while being estimable from data at a polynomial rate.
Let
be a sample from a probability density
defined on an open subset of dimensional Euclidean space and let be an estimate of the density. We will define hyperridge sets (called ridges for short) for both and , which we denote by and . We consider two cases that make different assumptions about . In the hidden manifold case (see Figure 1), we assume that the density is derived by sampling from a dimensional manifold and adding dimensional noise. In the density ridge case, we look for ridges of a density without assuming any hidden manifold, simply as a way of finding structure in a point cloud, much like clustering. The goal in both cases is to estimate the hyperridge set. Although in the former case, we would ideally like to estimate , this is not always feasible for reasonable sample sizes, so we use the ridge as a surrogate for . We focus on estimating ridges from point cloud data; we do not consider image data in this paper.A formal definition of a ridge is given in Section 2. Let be fixed. Loosely speaking, we define a dimensional hyperridge set of a density to be the points where the Hessian of has
strongly negative eigenvalues and where the projection of the gradient on that subspace is zero. Put another way, the ridge is a local maximizer of the density when moving in the normal direction defined by the Hessian.
Yet another way to think about ridges is by analogy with modes. We can define a mode to be a point where the gradient is 0 and the second derivative is negative, that is, the eigenvalues of the Hessian are negative. The Hessian defines a dimensional normal space (corresponding to the smallest eigenvalues) and a dimensional tangent space. A ridge point has a projected gradient (the gradient in the direction of the normal) that is 0 and eigenvalues in the normal space that are negative. Modes are simply dimensional ridges.
A stylized example is shown in Figure 2. In this example, the density is where , is a circle in , is a smooth (but nonuniform) density supported on and
is a twodimensional Gaussian with a variance
that is much smaller than the radius of the circle. The ridge is a onedimensional subset of . The figure has a solid curve to show the ridge lifted onto , that is, the curve shows the set . The ridge does not coincide exactly with due to the blurring by convolution with the Gaussian. In fact, is a circle with slightly smaller radius than . That is, is a biased version of . Figure 3 shows and .Note that the density is not uniform over the ridge. Indeed, there can be modes (dimensional ridges) within a ridge. What matters is that the function rises sharply as we approach the ridge (strongly negative eigenvalue).
One of the main points of this paper is that captures the essential features of . If we can live with the slight bias in , then it is better to estimate since can be estimated at a polynomial rate while can only be estimated at a logarithmic rate. Throughout this paper, we take the dimension of interest as fixed and given.
Many different and useful definitions of a “ridge” have been proposed; see the discussion of related work at the end of this section. We make no claim as to the uniqueness and optimality of ours. Our definition is motivated by four useful properties that we demonstrate in this paper: [1.]
If is close to , then is close to where is the ridge of and is the ridge of .
If the datagenerating distribution is concentrated near a manifold , then the ridge approximates both geometrically and topologically.
can be estimated at a polynomial rate, even in cases where can be estimated at only a logarithmic rate.
The definition corresponds essentially with the algorithm derived by Ozertem and Erdogmus (2011). That is, our definition provides a mathematical formalization of their algorithm.
Our broad goal is to provide a theoretical framework for understanding the problem of estimating hyperridge sets. In particular, we show that the ridges of a kernel density estimator consistently estimate the ridges of the density, and we find and upper bound on the rate of convergence. The main results of this paper are (stated here informally):

Stability (Theorem 4). If two densities are sufficiently close together, their hyperridge sets are also close together.

Surrogate (Theorem 7). In the Hidden Manifold case with small noise variance and assuming has no boundary, the hyperridge set of the density satisfies
(2) and is topologically similar to . Hence, when the noise is small, the ridge is close to . Note that we treat as fixed while . It then follows that
(3)
This leaves open the question of how to locate the ridges of the density estimator. Fortunately, this latter problem has recently been solved by Ozertem and Erdogmus (2011) who derived a practical algorithm called the subspace constrained mean shift (SCMS) algorithm for locating the ridges. Ozertem and Erdogmus (2011) derived their method assuming that the underlying density function is known (i.e., they did not discuss the effect of estimation error). We, instead, assume the density is estimated from a finite sample and adapt their algorithm accordingly by including a denoising step in which we discard points with low density. This paper provides a statistical justification for, and extension to, their algorithm. We introduce a modification of their algorithm called SuRF (Subspace Ridge Finder) that applies density estimation, followed by denoising, followed by SCMS.
Related work. Zero dimensional ridges are modes and in this case ridge finding reduces to mode estimation and SCMS reduces to the mean shift clustering algorithm [Fukunaga and Hostetler (1975), Cheng (1995), Li, Ray and Lindsay (2007), Chacón (2012)].
If the hidden structure is a manifold, then the process of finding the structure is known as manifold estimation or manifold learning. There is a large literature on manifold estimation and related techniques. Some useful references are Niyogi, Smale and Weinberger (2008) Caillerie et al. (2011), Genovese et al. (2009, 2012a, 2012b, 2012c), Tenenbaum, de Silva and Langford (2000), Roweis and Saul (2000) and references therein.
The notion of ridge finding spans many fields. Previous work on ridge finding in the statistics literature includes Cheng, Hall and Hartigan (2004), Hall, Peng and Rau (2001), Wegman and Luo (2002), Wegman, Carr and Luo (1993) and Hall, Qian and Titterington (1992). These papers focus on visualization and exploratory analysis. An issue that has been discussed extensively in the applied math and computer science literature is how to define a ridge. A detailed history and taxonomy is given in the text by Eberly (1996). Two important classes of ridges are watershed ridges, which are global in nature, and height ridges, which are locally defined. There is some debate about the virtues of various definitions. See, for example, Norgard and Bremer (2012), Peikert, Günther and Weinkauf (2012). Related definitions also appear in the fluid dynamics literature [Schindler et al. (2012)] and astronomy [AragónCalvo et al. (2010), Sousbie et al. (2008)]. There is also a literature on Reeb graphs [Ge et al. (2011)] and metric graphs [Aanjaneya et al. (2012), Lecci, Rinaldo and Wasserman (2013)]. Metric graph methods are ideal for representing intersecting filamentary structure but are much more sensitive to noise than the methods in this paper. It is not our intent in this paper to argue that one particular definition of ridge is optimal for all purposes. Rather, we use a particular definition which is well suited for studying the statistical estimation of ridges.
More generally, there is a vast literature on hunting for structure in point clouds and analyzing the shapes of densities. Without attempting to be exhaustive, some representative work includes Davenport et al. (2010), Klemelä (2009), Adams, Atanasov and Carlsson (2011), Chazal et al. (2011), Bendich, Wang and Mukherjee (2012).
Throughout the paper, we use symbols like to denote generic positive constants whose value may be different in different expressions.
2 Model and ridges
In this section, we describe our assumptions about the data and give a formal definition of hyperridge sets, which we call ridges from now on. Further properties of ridges are stated and proved in Section 4.
We start with a point cloud . We assume that these data comprise a random sample from a distribution with density , where has at least five bounded, continuous derivatives. This is all we assume for the density ridge case. In the hidden manifold case, we assume further that and are derived from a dimensional manifold by convolution with a noise distribution, where . Specifically, we assume that is embedded within a compact subset and that
(4) 
where ,
is a uniform distribution on
, denotes convolution, is a distribution supported on , andis a Gaussian distribution on
with zero mean and covariance . While we could consider a more general noise distribution in (4), we focus on the common assumption of Gaussian noise. In that case, a hidden manifold can only be estimated at a logarithmic rate [Genovese et al. (2012b)], so ridge estimators are particularly valuable. (Even when can be estimated at a polynomial rate, ridge estimators are often easier in practice than estimating the manifold, which would involve deconvolution.)The data generating process under model (4) is equivalent to the following steps: [1.]
Draw from a .
If , draw from a uniform distribution on .
If , let where and is additional noise. Points drawn from represent background clutter. Points drawn from are noisy observations from . When consists of a finite set of points, this can be thought of as a clustering model.
2.1 Definition of ridges
As in Ozertem and Erdogmus (2011), our definition of ridges relies on the gradient and Hessian of the density function . Recall that is fixed throughout. Given a function , let denote its gradient and its Hessian matrix, at . Let
(5) 
denote the eigenvalues of and let be the diagonal matrix whose diagonal elements are the eigenvalues. Write the spectral decomposition of as . Let be the last columns of (i.e., the columns corresponding to the smallest eigenvalues). If we write then we can write . Let be the projector onto the linear space defined by the columns of . We call this the local normal space and the space spanned by is the local tangent space. Define the projected gradient
(6) 
If the vector field
is Lipschitz then by Theorem 3.39 of Irwin (1980), defines a global flow as follows. The flow is a family of functions such that and and . The flow lines, or integral curves, partition the space (see Lemma 2) and at each where is nonnull, there is a unique integral curve passing through . Thus, there is one and only one flow line through each nonridge point. The intuition is that the flow passing through is a gradient ascent path moving toward higher values of . Unlike the paths defined by the gradient which move toward modes, the paths defined by the projected gradient move toward ridges. The SCMS algorithm, which we describe later, can be thought of as approximating the flow with discrete, linear steps. [A proof that the linear interpolation of these points approximates the flow in the case
is given in AriasCastro, Mason and Pelletier (2013).]A map is an integral curve with respect to the flow of if
(7) 
Definition: The ridge of dimension is given by .
Note that the ridge consists of the destinations of the integral curves: if for some satisfying (7).
Our definition is motivated by Ozertem and Erdogmus (2011) but is slightly different. They first define the critical points as those for which . They call a critical point regular if it is critical but not critical. Thus, a mode within a onedimensional ridge is not regular. A regular point with is called a principal point. According to our definition, the ridge lies between the critical set and the principal set. Thus, if a mode lies on a onedimensional ridge, we include that point as part of the ridge.
2.2 Assumptions
We now record the main assumptions about the ridges that we will require for the results.
Assumption (A0) differentiability. For all , , and exist.
Assumption (A1) eigengap. Let denote a dimensional ball of radius centered at and let . We assume that there exists and such that, for all , and .
Assumption (A2) path smoothness. For each ,
(A2) 
where , and .
Condition (A1) says that is sharply curved around the ridge in the dimensional space normal to the ridge. To give more intuition about the condition, consider the problem of estimating a mode in one dimension. At a mode , we have that and . However, the mode cannot be uniformly consistently estimated by only requiring the second derivative to be negative since could be arbitrarily close to 0. Instead, one needs to assume that for some positive constant . Condition (A1) may be thought of as the analogous condition for a ridge. (A2) is a third derivative condition which implies that the paths cannot be too wiggly. (A2) also constrains the gradient from being too steep in the perpendicular direction. Note that these conditions are local: they hold in a size neighborhood around the ridge.
3 Technical background
Now we review some background. We recommend that the reader quickly skim this section and then refer back to it as needed.
3.1 Distance function and Hausdorff distance
We let denote a dimensional open ball centered at with radius . If is a set and is a point then we define the distance function
(8) 
where is the Euclidean norm. Given two sets and , the Hausdorff distance between and is
(9) 
where
(10) 
is called the dilation of . The dilation can be thought of as a smoothed version of . For example, if there are any small holes in , these will be filled in by forming the dilation .
We use Hausdorff distance to measure the distance between sets for several reasons: it is the most commonly used distance between sets, it is a very strict distance and is analogous to the familiar distance between functions for sets.
3.2 Topological concepts
This subsection follows Chazal, CohenSteiner and Lieutier (2009) and Chazal and Lieutier (2005). The reach of a set , denoted by , is the largest such that each point in has a unique projection onto . A set with positive reach is, in a sense, a smooth set without selfintersections.
Now we describe a generalization of reach called reach. The key point is simply that the reach is weaker than reach. The full details can be found in the aforementioned references. Let be a compact set. Following Chazal and Lieutier (2005) define the gradient of to be the usual gradient function whenever this is well defined. However, there may be points at which is not differentiable in the usual sense. In that case, define the gradient as follows. For define for all . For , let . Let be the center of the unique smallest closed ball containing . Define .
The critical points are the points at which . The weak feature size is the distance from to its closest critical point. For , the reach is where . It can be shown that is nonincreasing in , that and that .
As a simple example, a circle with radius has . However, if we bend the circle slightly to create a corner, the reach is 0 but, provided the kink is not too extreme, the reach is still positive. As another example, a straight line as infinite reach. Now suppose we add a corner as in Figure 4. This set has 0 reach but has positive reach.
Two maps and are homotopic if there exists a continuous map such that and . Two sets and are homotopy equivalent if there are continuous maps and such that the following is true: (i) is homotopic to the identity map on and (ii) is homotopic to the identity map on . In this case we write . Sometimes fails to be homotopic to but is homotopic to for every sufficiently small . This happens because is slightly smoother than . If for all small , we will say that and are nearly homotopic and we will write .
The following result [Theorem 4.6 in Chazal, CohenSteiner and Lieutier (2009)] says that if a set is smooth and is close to , then a smoothed version of is nearly homotopy equivalent to .
Theorem 1 ([Chazal, CohenSteiner and Lieutier (2009)])
Let and be compact sets and let . If
(11) 
then .
3.3 Matrix theory
We make extensive use of matrix theory as can be found in Stewart and Sun (1990), Bhatia (1997), Horn and Johnson (2013) and Magnus and Neudecker (1988).
Let be an matrix. Let denote an element of the matrix. Then the Frobenius norm is and the operator norm is . We define . It is well known that , that and that .
The operator converts a matrix into a vector by stacking the columns. Thus, if is then is a vector of length . Conversely, given a vector of length , let denote the matrix obtained by stacking columnwise into matrix form. We can think of as the “antivec” operator.
If is and is then the Kronecker is the matrix
(12) 
If and have the same dimensions, then the Hadamard product is defined by .
For matrix calculus, we follow the conventions in Magnus and Neudecker (1988). If is a vectorvalued map then the Jacobian matrix will be denoted by or . This is the matrix with . If is a matrixvalued map then is a matrix defined by
(13) 
If then the derivative is a matrix given by
We then have the following product rule for matrix calculus: if and then
Also, if then where denotes the gradient of .
The following version of the Davis–Kahan theorem is from von Luxburg (2007). Let and be two symmetric, square matrices. Let be the diagonal matrix of eigenvalues of . Let and let
be the matrix whose columns are the eigenvectors corresponding to the eigenvalues of
in and similarly for and . Let(14) 
According to the Davis–Kahan theorem,
(15) 
Let be a square, symmetric matrix with eigenvalues . Let be another square, symmetric matrix with eigenvalues . By Weyl’s theorem [Theorem 4.3.1 of Horn and Johnson (2013)], we have that
(16) 
It follows easily that
(17) 
4 Properties of ridges
In this section, we examine some of the properties of ridges as they were defined in Section 2 and show that, under appropriate conditions, if two functions are close together then their ridges are close and are topologically similar.
4.1 Arclength parameterization
It will be convenient to parameterize the gradient ascent paths by arclength. Thus, let be the arclength from to :
(18) 
Let denote the inverse of . Note that
(19) 
Let . Then
(20) 
which is a restatement of (7) in the arclength parameterization.
In what follows, we will often abbreviate notation by using the subscript in the following way: and so forth.
4.2 Differentials
We will need derivatives of , , and . The derivative of is the Hessian . Recall from (13) that . We also need derivatives along the curve . The derivative of a functions along is
(21) 
Thus, the derivative of the gradient along is
(22) 
We will also need the derivative of in the direction of a vector which we will denote by
We can write an explicitly formula for as follows. Note that the elements of are the partial derivatives arranged in a matrix. Hence, . (Recall that stacks a vector into a matrix.) Note that is a matrix.
Recall that . The collection defines a matrix field: there is a matrix attached to each point . We will need the derivative of this field along the integral curves . For any , there is a unique path and unique such that . Define
where and with .
4.3 Uniqueness of the paths
Lemma 2
Conditions (A0)–(A2) imply that, for each , there is a unique path passing through .
We will show that the vector field is Lipschitz over . The result then follows from Theorem 3.39 of Irwin (1980). Recall that and is differentiable. It suffices to show that is differentiable over . Now . It may be shown that, as a function of , is Frechet differentiable. And
is differentiable by assumption. By the chain rule,
is differentiable as a function of . Indeed, is the matrix whose th column is where , denotes the Frechet derivative, and is the vector which is 1 in the th coordinate and zero otherwise.4.4 Quadratic behavior
Conditions (A1) and (A2) imply that the function has quadraticlike behavior near the ridges. This property is needed for establishing the convergence of ridge estimators. In this section, we formalize this notion of quadratic behavior. Give a path , define the function
(24) 
Thus, is simply the drop in the function along the curve as we move away from the ridge. We write if we want to emphasize that corresponds to the path passing through the point . Since , we define its derivatives in the usual way, that is, .
Lemma 3
Suppose that (A0)–(A2) hold. For all , the following are true: [1.]
.
and .
The second derivative of is:
(25) 
.
is nonincreasing in .
.
1. The first condition is immediate from the definition.
2. Next,
Since the projected gradient is 0 at the ridge, we have that .
3. Note that . Differentiating both sides of this equation, we have that , and hence
Now
(26) 
Since we have that , and hence
Therefore,
(27) 
Recall that . Thus,
(28) 
4. The first term in is . Since is in the column space of , where . Hence, from (A1),
and thus
Now we bound the second term . Since and , we have . Now . To see this, note that implies implies implies . To bound we proceed as follows. Let with . Then, from Davis–Kahan,
Comments
There are no comments yet.