1 Introduction
Manifold learning is one of the fundamental fields in Machine Learning (van der Maaten et al., 2009; Tenenbaum et al., 2000; Zhang and Zha, 2004; Sun et al., 2012; Weinberger and Saul, 2006; Burges, 2009; Sorzano et al., 2014)
. It is motivated by the notion that high dimensional data sets often exhibit intrinsic structure that is concentrated on or near (sub)manifolds of lower local dimensionality. In addition to this, intrinsic structure that is nonlinear prevents the use of fast linear machine learning algorithms. Furthermore, recent developments have shown that the ridges
^{1}^{1}1The terms density ridge, principal curve and principal surface will be used interchangeably depending on context. of the probability density function can be used for both estimation and denoising of manifolds (Ozertem and Erdogmus, 2011; Genovese et al., 2014; Gerber and Whitaker, 2013; Pulkkinen, 2015; Pulkkinen et al., 2014; Hein and Maier, 2006).In practice manifold learning is often posed as either learning coordinates that describe the intrinsic manifold or simply as unwrapping, stretching or unfolding the manifold such that it can, to some extent, be treated as a linear subspace of .
Most algorithms in manifold learning rely on the socalled manifold assumption (Lin and Zha, 2008)
. The assumption is that the data cloud in a vector space lies on or close to a hypersurface of lower intrinsic dimension. Doing inference on or along this hypersurface would enable the use of simple linear methods which are fast and theoretically welldefined. A much used example to illustrate these concepts is the socalled swiss roll
(Tenenbaum et al., 2000). It consists of a 2D plane embedded in 3D and folded into a swiss roll shape. Calculating distances along this shape using standard Euclidean geometry would fail, and so will simple methods like for example linear regression if we wanted to estimate the data relationships. See e.g. Tenenbaum et al.,
(Tenenbaum et al., 2000), for illustrations and details.There exists many algorithms and methods that globally or locally tries to unwrap or unfold nonlinear manifolds, but none of which use the density ridges for this purpose. Principal curves and density ridge algorithms, (Hastie and Stuetzle, 1989; Ozertem and Erdogmus, 2011; Gerber, 2013), find curves or surfaces that are smooth estimates of the underlying manifold, but the possible intrinsic nonlinear structure will still be present such that linear operations in the ambient space will fail to represent the manifold.
Expanding on these works, our main contribution in this paper is the use of density ridges to unwrap manifolds. In this paper we use the ridges of a probability density function as defined by Ozertem and Erdogmus, (Ozertem and Erdogmus, 2011), based on the gradient and Hessian of a kernel density estimate. This definition, though computationally expensive, gives an explicit nonparametric and nonambiguous description of the density ridge. Furthermore, by tracing curve lengths along the local onedimensional ridges of a mode attraction basin^{2}^{2}2The basin of attraction is the the set of points in a probability density where the gradient flow curves converges to the same point – the local mode. we can estimate distances and a local coordinate system along the manifold. By doing this, a local linear representation along the manifold structure can be constructed. In addition, we propose an algorithm inspired by parallel transport as defined in differential geometry to combine several local coordinate representations to form a global representation along a single connected manifold. Finally we suggest an extension from using onedimensional ridges to unfold the manifold to using dimensional representations. This is done by creating locally flat charts of the manifold combined with approximated geodesic distances to relate the charts together to form a global representation of the manifold.
To summarize, our key contributions are: (1) Introduce a new algorithm for manifold learning that use the ridges of the probability density function. This enables a completely geometrically interpretable scheme for learning the structure of a hidden manifold. Note that this work is an extension of the scheme presented in, (Shaker et al., 2014), which is only local and limited to manifolds with intrinsic onedimensional structure. (2) We present a new algorithm for combining local coordinate charts along a onedimensional density ridge into a global atlas over a single connected manifold. (3) We present a local chart approximation for dimensional manifolds and an algorithm for calculating geodesics on dimensional manifolds represented by density ridges. Combining these gives a global unwrapping of a single connected manifold containing multiple charts.
1.1 Related research
In this work the goal is to unwrap manifolds estimated by density ridges. This yields two fields of literature to reference, the topics related to principal curves/density ridges and topics related to manifold learning. We start with principal curves and surfaces.
Research related to the manifoldestimation part of our work is usually described by the interrelated terms principal curves, principal surfaces and principal manifolds. Principal curves and related subjects have been studied in a wide array of settings, often under different names and configurations. The most common case is principal curves, which are smooth onedimensional curves embedded in . They were originally posed as smooth selfconsistent curves passing through the ‘middle’ of the data by Hastie and Stuetzle (Hastie and Stuetzle, 1989). Kegl et al. proposed to constrain the length of the principal curves, enabling a more thorough theoretical analysis and simpler implementation (Kégl et al., 2000). Einbeck et al. suggested an algorithm for finding local principal curves based on local linear projections (Einbeck et al., 2005). Our work is motivated by the work of Ozertem and Erdogmus, where local principal curves are defined as onedimensional ridges of the data probability density (Ozertem and Erdogmus, 2011). This framework also extends naturally to cover higher dimensional principal surfaces by using higher dimensional density ridges. See also the work of Bas, (Bas, 2011), for further details and applications.
In the last few years several very interesting works related to the density ridge interpretation have been introduced. Genovese et al. have provided a theoretical analysis of nonparametric density ridge estimation (Genovese et al., 2014) and Chen provides asymptotic theory and formulates density ridges as socalled solution manifolds (Chen et al., 2014c, a). Another closely related setting to the principal curves and derivatives of the probability density is filament estimation, which is in practice very similar to principal curves (Genovese et al., 2009, 2012). For estimation purposes, Pulkkinen and Pulkkinen et al. have proposed several practical improvements for the density ridge estimation framework of Ozertem and Erdogmus (Ozertem and Erdogmus, 2011), both in the generative model and as a method of optimization (Pulkkinen, 2015; Pulkkinen et al., 2014). Finally we mention the recent work of Gerber and Whitaker, (Gerber and Whitaker, 2013), where the original framework of Hastie and Stuetzle (Hastie and Stuetzle, 1989) is reformulated to avoid regularization both for principal curves and surfaces.
Beyond metastudies of principal curves and density ridges themselves, several applications can be found. Examples include template based classification (Chang and Ghosh, 1998), ice floe detection (Banfield and Raftery, 1992), galaxy identification (Chen et al., 2014b), character skeletonization (Krzy, 2002) and clustering (Stanford and Raftery, 2000) to name a few.
As an intermediate summary we recall that the work in this paper expands the idea of a principal curve or surface to actually unfolding/unwrapping the curves or surfaces found by the algorithms mentioned above.
In the field of manifold learning within the machine learning literature, several works exists that are closely related to ours in terms of manifold unwrapping. The closest in principles and ideas are the local tangent space alignment algorithm, (Zhang and Zha, 2004), the LOGMAP algorithm for calculating normal coordinates of a manifold, (Brun et al., 2005), and the manifold chart stitching of Pitelis et al. in (Pitelis et al., 2013)
. The manifold parzen algorithm and contractive autoencoders should also be mentioned,
(Vincent and Bengio, 2002; Rifai et al., 2011), as they are, similar to our work, algorithms that tries to learn representations along nonlinear manifolds via estimates of the probability density.1.2 Motivation: Using density ridges to unwrap manifolds
A kernel density ridge can be used to completely describe a biased version, called a surrogate, of a manifold embedded in given sufficient samples and bounded noise (Genovese et al., 2014). The main motivation for using density ridges to locally unwrap nonlinear manifolds comes from the following.

As the ridge is estimated via a probability density estimate, all ridges of dimensions higher than zero will by construction induce a gradient flow contained along the ridge/manifold surrogate. This is a trivial consequence of the definition of critical set by Ozertem and Erdogmus (Ozertem and Erdogmus, 2011).

In the onedimensional case the gradient flow completely describes local sections (attraction basins) of the manifold estimate, and can thus be unwrapped directly by representing the manifold in approximate arclength coordinates. This is analoguous to pulling a curly string taut.

In the dimensional case the gradient flows that converge to the same point can be approximated and unwrapped by a local linear projection. This is analoguous to flattening a curled up sheet.
To further emphasize the first point: Unless the underlying manifold is a straight line or flat in all directions, the kernel size of the density estimators used – presented in equation (2) on page 2 – has to be bounded. This will thus inherently lead to local maxima along the density ridges, which further leads to a gradient flow contained along the ridge.
For clarity, we split the next part of our motivation into the onedimensional case and the general dimensional case, starting with the onedimensional case.
In the one dimensional case all points along the ridge can be described by the piecewise arclength of the gradient flow integral curve from the point to the local mode where the gradient vanishes. This will create a complete description of the coordinate distances from all points on the ridge to their corresponding local maxima collected in local sections of the curve. In Figure 0(a), we see a 2D version of the synthetic ‘swiss roll’ dataset with added Gaussian noise, (Roweis and Saul, 2000). It is clearly a onedimensional manifold (curve) embedded in . The density ridge is shown in dark blue dots, while the noisy data is shown in light green dots. The ridge was found using the framework of Ozertem and Erdogmus, (Ozertem and Erdogmus, 2011), and a RungeKuttaFehlberg ODE solver implemented in MATLAB^{3}^{3}3We will go into further details in section 4. We see that the ridge is a good approximate for the manifold. To further illustrate the properties of the ridges we show a 3D plot with the ridge of the noisy swiss roll on the horizontal axes, and the density values along the ridge shown on the vertical axis in Figure 0(b).
It is evident that there are local maxima along the single connected ridge. Again, this observation is key to our proposition; by following the gradient flow of the density along the ridge we have a way of tracing the underlying manifold and thus calculating distances along the manifold.
Continuing to the dimensional case, the gradient flow will generate unique curves from each point on the density towards the local modes. The manifold can thus be inuitively be separated into parts based where the gradient flow converges to the same critical points. Each part can then be mapped by a linear projection approximating the logarithmic map (see for example Figure 4 on page 4 or Brun et al. (Brun et al., 2005)). In Figure 2 we see a twodimensional manifold embedded in estimated by a density ridge and the corresponding gradient flows from different views in Figure 1(b) and 1(a). In Figure 1(c) we see the sets of points, attraction basins, that have gradient flows that converge to the same point marked with similar colors. We see in Figure 1(d) that restricting the local flattening to cover only the different attraction basins separately will result in meaningful compact local representations. Flattening the entire manifold would, depending on perspective, most certainly introduce some kind of unwanted folding or collapsing.
Continuing from local to global unwrapping, we note that as we are using a kernel density estimate the gradient flow of the probability density is inherently local. Thus piecewise distances – or any other further operations – will also only be local, and to obtain a global unwrapping we need some strategy to combine local unwrappings. This is also clearly seen in Figure 1(d); the local attraction basins are flattened intuitively, but globally the representation is not meaningful.
To obtain a global representation we take inspiration from the concept of parallel transport from differential geometry (Lee, 1997) and introduce a translation along the tangent vectors of the onedimensional ridges (translation along geodesics in the dimensional case) to obtain a complete unwrapped, or simply flat, representation of the ridge. This is analoguous to parallel transporting vectors along a geodesic, see for example Freifeld et al., (Freifeld et al., 2014), except that the piecewise local distance along the geodesic is also added to the transported vector.
The result of global unwrapping^{4}^{4}4The result was obtained by using Algorithm 4 on page 4 of the twodimensional manifold in Figure 2 is shown in Figure 3.
To summarize, the gradient flow of a kernel density allows unwrapping manifolds using the following two steps:

Estimating local coordinate patches in the basins of attractions of the pdf.

Stitching together local coordinate charts by approximate parallel tranport/translation along geodesics.
To conclude this section we add two further comments that should be noted: (1) If the intrinsic manifold turns out to be an embedded surface () or hypersurface (), we have to consider the case of nonzero curvature, where an isometric (distancepreserving) mapping or unfolding cannot be guaranteed. For simplicity, in the rest of this paper we assume that the manifolds we are working with are isometric to . (2) In some cases, the definition of the density ridges allows for multiple orthogonal onedimensional ridges, up to if . In such cases local orthogonal coordinate systems can be estimated by first estimating the orthogonal ridges and then follow the gradient flow along each separate curve. This is useful in situations where the underlying onedimensional manifold is corrupted by noise that changes along the manifold.
1.3 Structure of paper
The rest of the paper is organized as follows: We start by formally introducing principal curves and density ridges as in the framework of Ozertem and Erdogmus (Ozertem and Erdogmus, 2011). We then include some relevant topics from differential geometry, and show connections to density ridges as found by kernel density estimation in Section 3. In Section 4 we present the principal curve/onedimensional density ridge unwrapping algorithm followed by the ridge translation algorithm in Section 5. Section 6 proposes extensions to density ridges of higher dimension and finally in Section 7 we show numerical examples and illustrations of the presented methods on both synthetic and real datasets.
2 Principal curves and density ridges
Principal curves were originally introduced by Hastie and Stuetzle (Hastie and Stuetzle, 1989). Several extensions were made, (Kégl et al., 2000; Einbeck et al., 2005; Tibshirani, 1992), until Ozertem and Erdogmus, (Ozertem and Erdogmus, 2011), redefined principal curves and surfaces as being the ridges of a probability density estimate, (Ozertem and Erdogmus, 2011). Given a probability density , its gradient and Hessian matrix , the ridge can be defined in terms of the eigendecomposition of the Hessian matrix. [Ozertem 2011] A point is on the dimensional ridge, of its probability density function, when the gradient is orthogonal to at least eigenvectors of and the corresponding eigenvalues are all negative. We express the spectral decomposition of as , where is the matrix of eigenvectors sorted according to the size of the eigenvalue and , , is a diagonal matrix of sorted eigenvalues. Furthermore can be decomposed into , where is the first eigenvectors of , and are the smallest. The latter is referred to as the orthogonal subspace due to the fact that when at a ridge point, all eigenvectors in will be orthogonal to .
This motivates the following initial value problem for projecting points onto a density ridge:
(1) 
where at , and . We denote the set of ’s that satisfy equation (1), calculated via the kernel density estimator , as the dimensional ridge estimator .
Of great value is the recent paper by Genovese et al., (Genovese et al., 2014), which showed that the Hausdorff distance between a dimensional manifold embedded in dimensions and the dimensional ridge of the density is bounded under certain restrictions wrt. noise and the closeness of the density estimate to the true density. They also show that the kernel density ridges are consistent estimators of the true underlying ridges and refer to the ridge as a surrogate of the underlying manifold^{5}^{5}5See Theorem 7 in (Genovese et al., 2014).. Thus, is a point set representing the underlying manifold with theoretically established bounds under Hausdorff loss.
2.1 Orthogonal local principal curves
In the special case of a projection onto a principal curve, , we note that the construction of the orthogonal subspace allows for choosing different orthogonal subspaces to project the gradient to. The most intuitive setting, which is most commonly used (Ozertem and Erdogmus, 2011; Genovese et al., 2014; Bas, 2011) is to use the top eigenvector(s) corresponding to eigenvalue(s) as the orthogonal subspace. We recall that the eigenvalues and eigenvectors of the Hessian matrix of a function points to the directions of largest second order change. In the case of a density ridge, the change in density along the ridge is assumed to be much lower than the change in density when moving away from the ridges, hence the choice of largest eigenvector pointing away from the ridge. This local ordering, though intuitive from the Hessian point of view, has not been well studied, and could in some cases actually give nonintuitive results^{6}^{6}6See the Tshaped mixture of Gaussians in (Ozertem and Erdogmus, 2011).. We will not go further into this in this paper.
The importance of this local decomposition is that considering a dimensional ridge, the ridge tracing algorithm, presented in Section 4, can be applied on each of the local orthogonal subspaces. In the case of e.g. different local scales of noise this could yield a better desciption of the data. Figure 4 shows a conceptual illustration of the idea, where the overall dominating nonlinearity can be described by a principal curve/onedimensional density ridge, but since the orthogonal structure varies along the ridge local decomposition provides further detail.
3 Relevant topics from differential geometry
Manifold learning is a framework which is inspired by concepts from differential geometry (Brun et al., 2005; Tenenbaum et al., 2000; Pitelis et al., 2013; Roweis and Saul, 2000; Gerber and Whitaker, 2013; Zhang and Zha, 2004). The main idea is that data sets or data structures seldom fill the vector space they are represented in. Even in low dimensions, e.g. , data often concentrate around clearly bounded subregions or socalled manifolds.
In some works the problem of learning representations on or along a manifold has been termed manifold unwrapping or unfolding (Weinberger and Saul, 2006; Sun et al., 2012). The idea is that learning the structure of a manifold keeps the local structures intact more strongly than global structures, so that the (global) nonlinearities can be unfolded, unrolled or unwrapped. This is perhaps the term most related to our work.
Unfolding or unwrapping a manifold can intuitively be performed in two different ways. Either we use some function that stretches or flattens the manifold such that we can use linear methods to calculate distances along the manifold, or we can somehow estimate the structure of the manifold such that distance measures can be defined directly along the manifold again allowing linear operations in the resulting coordinate system.
The ideas and concepts of manifold learning stems from differential geometry, the field of mathematics which studies smooth geometric objects and, closely related, smooth functions. Its main object of interest is the manifold, which can be roughly regarded as a well behaving smooth (topological) space.
A clear definition of a manifold can be found in either books of Lee or Tu, (Lee, 1997; Tu, 2008): A (topological) manifold is a second countable, locally Euclidean, Hausdorff space. Local Euclidean structure is analoguous to how humans perceive the surface of the earth. At smaller scales traversing a path along the surface will seem like a straight line, but on larger scales paths along the surface of the earth are clearly curved. A Hausdorff space is a space where two separate points have disjoint neighboorhoods (Barile and Weisstein, ). E.g. a surface embedded in that intersects with itself will have points that shares neighborhoods and is thus not Hausdorff.
In simple terms; the intuition behind differential geometry and manifolds is to describe an object locally at a certain point or a certain homogenuous region using derivatives. Although often not explicitly stated the kernel density estimate and its derivatives – see e.g. (Chacón et al., 2013) – can be interpreted in a differential geometry setting. We recall that Genovese et al., (Genovese et al., 2014), has shown that kernel density derivatives can be used to estimate a smoothed version of an underlying manifold sampled from data with noise.
Before we go into further discussion of the connection between the kernel density estimate of data sampled from a manifold and differential geometry, we introduce a few basic concepts.
First: throughout this discussion we are talking about submanifolds of (Lee, 1997). Given a manifold diffeomorphic to , at each point the tangent space, , is the Euclidean space of dimension which is tangent to at (Lee, 1997). See Figure 4 for a sketch of the related concepts. The term tangent to, can intuitively be interpreted as either the space of tangent vectors of all possible curves passing through or the space spanned by the partial derivatives of the parametrization of at (Lee, 1997). A disjoint union of all tangent spaces of is called the tangent bundle. We denote the coordinate transformation from the tangent space to the manifold as the exponential map, and the inverse transformation from the to the as the log map (Brun et al., 2005). Finally we note that vectors in can be expressed by a local basis of differentials . These are called the local coordinates at , (Lee, 1997). These local coordinates represents a Euclidean subspace of same dimension as .
As for the statiscal model, we assume the same model as in (Genovese et al., 2014) where the data points, , are sampled with noise from a distribution supported on . If we let be the distribution of points on and along and
be a Gaussian distribution with zero mean and
covariance – also in – that represents noisy samples that does not lie directly on the manifold, we get the following model^{7}^{7}7 denotes convolution.:The kernel density estimator is defined as follows:
(2) 
where is any symmetric and positive semidefinite kernel function. Note that we skip the normalizing constant, as it is simply a matter of scale and we are only interested in the direction of the gradient and the Hessian eigenvectors. We also note that in this work we restrict ourselves to use the Gaussian kernel . From the kernel density estimate we can calculate the gradient as follows(notation adapted from (Genovese et al., 2014)):
(3) 
And finally the Hessian matrix:
(4) 
where . From these equations we see that the gradient is simply the average of all vectors pointing out from weighted by the kernel function value of the norm of the vectors. We also note that depending on the kernel size the gradient will in practice be the average of the vectors pointing from to all points within the neighborhood covered by the particular kernel size. If a point and its neighbors within the kernel size distance lies on or close to the manifold , this clearly is a good candidate for a local tangent space estimate.
If we look at the expression for the Hessian matrix, we notice that it turns out to be proportional to the sample covariance matrix at
with the variance normalized wrt the kernel size
. Like the kernel density gradient, the weight of the points taken into account in calculating the covariance is governed by the kernel size, leading to a local sample covariance estimate at . We note that if a very large kernel size, in the order of , is used, the Hessian matrix will resemble the standard sample covariance matrix and eigendecomposition would yield standard PCA. Considering this we can interpret the Hessian matrix as fixing and aligning a local Gaussian distribution to a local subset of the data. For a onedimensional curve with multidimensional additive noise the local Gaussian will clearly have a strong correlation along the curve wich will be reflected in the eigenvectors of the local Hessian estimate. A simple illustration of this is shown in Figure 5. Considering the case of a principal surface we can continue our principal component analogy. Assuming the points have been sampled from some surface with some level of noise, the kernel density Hessian estimate at will in practice calculate a covariance matrix estimate based on the points within the range covered by the kernel centered at . Unless the noisy points that does not lie on completely dominate, we expect that most of the variance is concentrated along the manifold, and thus the largest eigenvectors according to eigenvalue will span the local tangent space .Since by definition the kernel density gradient, , lies in the span of Hessian eigenvectors if is on the dimensional ridge, the corresponding eigenvectors forms a natural basis for local coordinates in . Consequently the parallel Hessian eigenvectors, , can be calculated for each that lies on the density ridge, also for out of sample points, and will thus form an approximate tangent bundle on the ridge estimate of the manifold, .
3.1 Properties of density ridges
Throughout this section we consider a single connected submanifold of Euclidean space without boundary that satisifies the properties given in (Genovese et al., 2014).
Recall that a principal curve can be approximated by a onedimensional density ridge and a principal surface by a dimensional density ridge. Some properties are general, while some apply only to principal curves, or vice versa. We start by introducing the gradient flow, which we use in both cases later in this paper.
An integral curve of the positive gradient flow is a curve such that . For every point in the domain of there exists an integral curve that starts at . Given a critical point^{8}^{8}8A local mode in the case where the function is a probability density function. , , all points that have integral curves that converges to the critical point are said to lie in the basin of attraction of (AriasCastro et al., 2013). This property is the foundation of mode based clustering algorithms, e.g mean shift clustering (Comaniciu and Meer, 2002; Myhre et al., 2015).
The idea of a basin of attraction can be extended to hold for density ridges as well. Given a density ridge of dimension a local mode will still satisfy the criteria for being a point on the ridge. The differential equation – equation (1) on page 1 – for projecting points towards the density ridge always follows the gradient, so the density ridge(s) can be divided into basins of attraction based on the original points before projection. This naturally divides a manifold estimated by a density ridge into nonoverlapping subsets. Closely related: a chart of a smooth manifold is a diffeomorphism from a neighborhood on the manifold to , . The local density ridge unwrapping algorithm, presented in Section 4 can be considered an algorithm for learning the map , and we thus have a chart for each separate basin of attraction of the manifold. We note that in classical differential geometry it is common to consider smoothly overlapping charts (Tu, 2008), which is not the case in this work.
We end this section with a short summary of important
properties and observations for principal curves and surfaces separately.
Principal curves:

The gradient flow integral curves are geodesics, allowing for parallel transport, this will be introduced in Section 5.

The gradient flow separates the manifold into nonoverlapping regions  attraction basins, such that

Curves does not have intrinsic curvature, and thus principal curves can always be unfolded isometrically.

, . Thus, the gradient flow curve starting on points on the manifold is completely contained along the onedimensional manifold, .
Principal surfaces:

The gradient flow separates the manifold into nonoverlapping regions  attraction basins. Again .

The gradient flow curves are not necessarily geodesics. If the initial point is on the manifold the gradient flow integral curves will be restricted to the manifold. Thus, if , .

Curvature must be introduced, isometric unfolding can no longer be guaranteed for arbitrary surfaces.

. Thus the direction of the gradient determines the gradient flow along the manifold.
4 Proposed method: Principal curve unwrapping by tracing the ridge
In this section we present the first part of our suggested algorithm for unwrapping manifolds, which deals with data distributed on and along a onedimensional manifold. We start by recalling that the probability density function along a local principal curve/onedimensional density ridge will contain local maxima. For a point the gradient, , is by definition orthogonal to all except one of the eigenvectors of the local Hessian . Thus any gradient on the ridge which is not exactly zero will point along the ridge towards a local mode. This restricts the gradient flow, (AriasCastro et al., 2013), along the ridge to stay on the ridge, and thus we can formulate a differential equation that sends the points on the principal curve, along the curve/ridge towards the local mode. We note that AriasCastro et al. showed that a gradient ascent scheme for estimating the gradient flow lines converges uniformly to the true integral curve lines (AriasCastro et al., 2013). Then, by using either a gradient ascent scheme – like in mean shift clustering – or as in this work a differential equation solver, we can estimate the gradient flow lines along a ridge, and by calculating the piecewise Euclidean distances between the steps of the gradient flow we can estimate the distance from a point to a local mode along the ridge. Since this is the only possible path between the point on the ridge and the local mode it converges to, it is by construction a geodesic. These geodesic distances along the principal curve we will take as local coordinates in the basin of attraction of the local mode, , with the smallest eigenvector of as basis^{9}^{9}9The intuition is that the curvature as described by the Hessian eigenvalues should be lower along the ridge, than orthogonally off the ridge. Thus the choice of lowest eigenvector as basis.. We state the differential equation for following the gradient along the ridge as follows:
(5) 
where , such that .
Solving this equations gives coordinate lengths along the ridge and thus yields a complete local coordinate description for every point in the local basin of attraction of .
(6) 
In practice we solve this using a Runge Kutta fourth and fifth order pair, also known as the RungeKuttaFehlberg method(RKF5)
(Fehlberg, 1969), implemented in MATLAB. This is an adaptive stepwise solver, which yields benefits both in terms of speed and accuracy. For density ridge estimations we use the kernel density estimate from Equation (2), its gradient, Equation (3), and the Hessian matrix from Equation (4) as presented in the previous section. The ridge projection for each data point considered is independent from other projections, so processing each point in parallel is possible. In this work we have used the parallel processing toolbox in MATLAB to achieve a local speedup proportional to the number of cores on the computer.We have summarized the steps necessary to perform local principal curve unwrapping in Algorithm 1.
We conclude this section with an illustration of Algorithm 1. In Figure 6 we see a two dimensional sample from a Gaussian distribution superimposed with an underlying onedimensional nonlinear structure. The two orthogonal density ridges found using the RKF45 solver are shown in red and blue dotted lines.
In Figure 7 we see the stages in Algorithm 1 for a single data point from Figure 6. First the point is projected to each of the two orthogonal density ridges using the RKF45 algorithm, shown in red. Second, the point is projected along the ridge by following the gradient towards the local mode shown in green. We see that by calculating the piecewise Euclidean distances between the gradient ascent steps – the green points – we get approximate coordinate distance along the intrinsic nonlinear geometry.
5 Parallel transport along density ridges
The local density ridge unwrapping algorithm presented in the previous section is only valid for local regions of the manifold. This implies that the local unwrapping algorithm can result in several charts for a single connected manifold if there are multiple local maxima along the ridge. So given a manifold embedded in and a ridge approximator consisting of a set over nonoverlapping attraction bases, we can, by using Algorithm 1, obtain a local coordinate chart for each of the local attraction basins.
In the current setting where the underlying manifold describing the nonlinearity can be described by a onedimensional ridge, the trajectories along the manifold can trivially be considered geodesics^{10}^{10}10The shortest path between two points on a one dimensional curve must be along the curve itself.. This allows the use of parallel transport to transport vectors from one mode to another if they are connected by a principal curve/onedimensional ridge Lee (1997). In this work we will not go further into the differential geometric framework for parallel transport – see for example (Lee, 1997) – , but rather develop a discrete empirical algorithm for sewing together local charts stemming from Algorithm 1 by translation along geodesics.
We make the following assumptions:

If there is a onedimensional ridge between two local modes, it can by definition be considered a geodesic.

The local neighborhood around a density mode can be described by trajectories along local orthogonal density ridges.
To form a global representation of a manifold , we need to relate the coordinate charts of the manifold to each other.
To do this we select a reference mode along the principal curve and translate the other local coordinate systems connected to the same ridge towards the reference mode. This is a straight forward operation, but there are a few issues that needs to be considered. First, each local system needs a basis and consequently the bases along the curve needs to be checked in case they are not equally aligned to the ridge. Second, once a reference mode and basis is chosen, the translation of other charts towards that mode can give ambiguities in terms of sign, if two charts are translated towards the same mode from different directions, a positive and negative direction needs to be defined.
Given a manifold and its ridge estimator , the set of local modes, , along the ridge, the gradient and Hessian at each point in both and , we define the following: At a given a reference mode , we define the eigenvectors of the Hessian of the kernel density estimate at as the basis of the unwrapped coordinates. Using this we can check the sign of the determinant of the other bases along and change signs if the bases have a different orientations. This ensures a consistens orientation, unless the manifold is not orientable (e.g. a Möbius strip) (Spivak, 1965). Given a reference mode , we define the direction towards the closest local mode along the ridge as the positive direction. As the ridge connecting two local modes is onedimensional, tangent vectors by finite differences from mode to reference mode can be used to identify the relevant directions. Even though the ridge estimate and local modes are known, we do not know if two local modes are connected and the order of the points contained in a local attraction base. As preliminary solutions to this, we use Dijkstra’s algorithm for identifying the path along the ridge from mode to mode, and use that path to calculate finite difference tangent vectors (Dijkstra, 1959). To check if two modes are connected we use the flood fill algorithm (Heckbert, 1990). Both algorithms operate on a nearest neighbor graph which in this case consists of the every input point projected to the density ridge estimator as well as the set of local modes found by following the gradients along the ridge for all points. The flood fill algorithm simply checks the number of connected components in a graph. Thus if there are several components along a onedimensional density ridge, either the neighborhood graph must be expanded or the ridge is disconnected. Dijkstras algorithm calculates the shortest path along nodes in a sparse graph. In ridges of dimension higher than one, this would lead to poor estimates, but since all points along the onedimensional ridges lie on a curve this is not a problem. In practice, the Dijkstra algorithm ends up simply sorting the points along the density ridge. Thus we have all steps needed to translate a local chart along the connecting density ridge and obtain a global unwrapping. The density ridge translation algorithm is summarized in in Algorithm 2.
(7) 
We end this section with an example. In Figure 7(a) and Figure 7(b) we see a twodimensional uniform sample superimposed with a nonlinear structure, its density and onedimensional orthogonal density ridges. The color coding of Figure 7(a) represents the order of the data points in the horizontal direction.
Figure 9
shows the results of the density ridge translation algorithm. We see that the underlying uniform distribution is revealed. There are some artifacts at the boundary of the manifold, this is probably due to the smoothness of the kernel density estimate and the fact that the kernel density estimate is close to a mixture of two Gaussians. There will always be a tradeoff between bias in the principal curves and nonsmooth, but well fitting ridges. In other words, the smoother the curve, the larger the bias and the closer the density is to a mixture of a low number of Gaussians
^{11}^{11}11Recall that a kernel density estimate in the extreme case of very small or very large bandwidth is just a mixture of either Gaussians or a single Gaussian respectively..We also have to note that the origin of the data structure has been moved to . This is due to the arbitrary choice of reference mode. One could keep the original reference mode as origin of the unwrapped representation, but this would not affect the result other than a translation.
6 Expansion to dimensional ridges
In the case where the underlying manifold cannot be described by a principal curve, a density ridge of higher dimension, a principal surface, must be considered. This causes a problem with our previous approach: the gradient, , of a point on a dimensional ridge lies in the span of eigenvectors of such that following the gradient flow will not lead to unique coordinates on the manifold. So for example applying Equation (5) on a dimensional ridge would lead to a ridgeconstrained mean shift, i.e. gradient flow trajectories that lies completely on the ridge. Depending on how the data is distributed along the manifold as well as geometric properties such as curvature, a dimensional ridge cannot in general be decomposed into orthogonal onedimensional ridges.
Also, in general, the curvature of the manifold has to be considered in higher dimensions – we can no longer be sure that an isometric unfolding exists. Furthermore, comparing with the previous section, a path between two points is also not necessarily a geodesic even if the path lies completely on the manifold. Thus the parallel transport/translation analogy cannot be directly used.
Based on the aforementioned, we propose an alternative unwrapping scheme for the density ridge framework in the case of dimensional nonlinear structures. We recall that in the one dimensional case we unfold local charts of the density ridge, and if there are several charts, transport them along the ridge to get a global representation. In the dimensional case we cannot directly unfold the local charts by following the gradient flow, so we insted do a local linear approximation by projecting the data points in the local basin of attraction to the tangent space of the local mode of the chart, presented in Section 6.1. Once we have linear approximations of all local charts of the manifold we calculate approximate geodesics and use parallel transport to send all charts to a reference mode. Finally, the unwrapping is done by sending the charts back along the geodesic by parallel transport, while at the same time constraining the piecewise steps of the geodesic to lie in the tangent space of the reference mode, presented in Section 6.3. To approximate the geodesic we use an algorithm proposed by Dollár et al. (Dollár et al., 2007), presented in Section 6.2.
6.1 Local linear approximation
As calculating distances along gradient lines that cover the dimensional manifold will not give consistent orthogonal coordinates we will expand our approach and use local linear approximations. The approximations will consist of a local flattening centered at the local modes of the dimensional density ridge. By flattening at the local modes we get charts centered at the points of highest probability and the points within the chart are points where the gradient flow converges to the same mode. This ensures that the charts represents points that are close in density and that the charts are limited in extension since the gradient flow lines are restricted to locally smooth areas of the manifold. Following this, the flattening process consists of projecting the data points corresponding to a local mode to the space spanned by the Hessian eigenvectors that span the gradient (the local tangent space):
(8) 
where is a local coordinate for a point that has converged to mode #, . This is equivalent to an approximate inverse exponential map on the manifold at – the log map. In this manner we get a local linear approximation for each local mode in the data set.
Also, comparing with the previously noted similarities between the Hessian of the KDE estimate and the local sample covariance matrix, we can view this local flattening directly as local PCA where the neighborhood is determined by the coverage of the kernel bandwidth.
6.2 Approximating geodesics
To obtain a global coordinate representation on a connected manifold that has a set of approximated local coordinate charts we proposed parallel transport of tangent space coordinates towards a reference mode. In the onedimensional case a geodesic was directly available in the principal curve. In the dimensional case we need a numerical scheme to approximate the geodesic. In this paper we reframe an idea presented by Dollár et al. (Dollár et al., 2007), to approximate geodesics on the manifold. The idea is to minimize the distance between two points, while at the same time keeping the starting points and endpoints fixed and making sure that all points on the path lies approximately on the manifold. Formally, given a sequence of points between two points on the manifold and we can formulate the problem of finding a shortest path constrained to the manifold as follows:
(9)  
subject to 
To optimize (9) the path
is initialized using Dijkstra’s algorithm and further discretized with linear interpolation between the
given points. Then minimization is performed by alternating between gradient descent to shorten distance and density ridge projection, equation (1) to ensure that points stay on the manifold. In addition we use another idea from the work of Dollár et al. (Dollár et al., 2007) for fast out of sample projections. After a selection of points have been projected to the density ridge, by equation (1), the tangent space of the ridge estimator is at each point spanned by the parallel Hessian eigenvectors . To project a new outofsample point orthogonally to the ridge, , where is a point on the ridge, needs to be minimized. This can be solved by setting to the closest point of on the manifold and then performing gradient descent as follows:(10) 
6.3 Combining geodesics and local approximations
Once we have local approximations in place we can in principle directly unfold manifolds that are isometric. We recall that if a manifold is isometric it has no intrinsic curvature and can be covered by a global chart, in other words: the diffeomorphism preserves distances. Still, if we use the algorithms presented previously in this paper we cannot in general find a single global chart. Combining the two previous sections we have the following: an estimate of the underlying density ridge which is close to the true manifold, a set of local linear approximations of each local basin of attraction and the possibility of estimating geodesics between points in by Equation (9).
Using these tools we can construct an algorithm for stitching together the local approximations to obtain a global unwrapping. Below we present a short outline of the algorithm, see the appendix for a complete description of the algorithm:

For each mode, project the points in the local attraction basin to the space spanned by the parallel Hessian eigenvectors.

Among all modes, select a reference mode and calculate the piecewise geodesic distance from all other modes to the reference mode
We conclude this section with a short example. Again we turn to the swiss roll example, where a plane is deformed into a swiss roll shape embedded in . See Figure 23 for the data and unwrapped result.
7 Numerical experiments
We now show a selection of experiments to illustrate the presented frameworks. Both real and synthetic data is used. In the real data sets we preprocess the data with pca and reduce the dimension to two or three to enable visual evaluation.
We start by illustrating the density ridge projections using rkf45 and also how the estimates are influenced by the choice of kernel size. Then we illustrate density ridge unwrapping and translation on synthetic data sets and images of faces and the mnist handwritten digits, (LeCun et al., 1998)^{12}^{12}12http://www.cs.nyu.edu/ roweis/data.html.
In all experiments we assume that the dimension of the underlying manifold is known.
7.1 Density ridge estimation
We start by illustrating the density ridge estimation algorithm using the rkf45 scheme implemented in matlab. This is an adaptive solver for initial value problems that reduces the parameter choices to only the kernel size of the kernel density estimate. To emphasize the algorithm we focus on examples in three dimensions so they easily can be visualized.
We start with the ‘helix’ data set, available in the drtoolbox of van der maaten (van der Maaten et al., 2009). It is a onedimensional manifold embedded in , sampled uniformly with gaussian noise. We run the rkf45 scheme over a range of kernel sizes and select the one with the lowest mean square error. The result is shown in figure 11.
The next example shows a two dimensional plane bent into a halfcylinder shape. A uniform sample is drawn from the surface and zero mean gaussian noise is added, see figure 11(a). We show the result using different kernel sizes in figure 11(b), to visually confirm that the bias of the density ridge estimate increases as the kernel size increases. the largest kernel size gives the most smooth result, but the curvature of the halfcylinder is reduced. for the smaller kernel sizes we see that the ridge estimator is closer to the true manifold, but that many of the noisy points become local maxima and is not projected to the density ridge.
7.2 influence of kernel size
The choice of kernel bandwidth is very important in estimating density ridges, and can have great influence on the results.
To illustrate the influence of kernel size we sample uniformly from a sphere, , and add different levels of noise. In table 1 the mean squared error between the true manifold and the ridge for the different kernel bandwidths and levels of noise is shown.
\  .05  .1 

.1  0.0103  0.0973 
.25  0.0079  0.0909 
.5  0.0140  0.0969 
.7  0.0347  0.1267 
1  0.1256  0.1901 
2  0.2046  0.2738 
In this concrete example we see that a kernel size of approximately gives the density ridge that is closest to the underlying manifold. We also see that even in the low noise case, the mean squared error of the estimates deteriorates quickly due to bias in the estimations.
7.3 Manifold unwrapping
In this section we show results on unwrapping onedimensional manifolds with noisy embeddings in higher dimensions – , . We start with the full unimodal example from section 4. In figure 12(a) we see the data set, its two orthogonal density ridges and the contours of the density.
The same data set can be seen in figure 12(a). After applying the local density ridge unwrapping algorithm, we get the results shown in figure 12(b). We see that by calculating distances along the underlying structure we reveal the distribution along the intrinsic geometry. In this case a gaussian kernel was used with a manually chosen kernel bandwidth of . If we inspect closer we see that there is a slight bias in the first ridge, show in red, in figure 12(a). This bias can also be seen in the unwrapped results, as they show a very downscaled version of the original nonlinearity. This could be avoided by either choosing a smaller kernel size, but risking more variance in the ridge and thus less accurate unwrapped coordinates or by using a varibable kernel density estimator (Wand and Jones, 1994; Bas, 2011).
7.4 Density ridge translation
To illustrate the density ridge translation algorithm we use a spiral shaped data set with gaussian noise as shown in figure 14. The underlying manifold is parameterized as , with and . The parameter is indicated by the color coding from dark blue to light yellow. The two local orthogonal density ridges are shown in figure 14(a) and we note that there are two separate local maxima. This will give rise to two different local charts by algorithm 1. In figure 14(b) we see the results after applying the density ridge translation algorithm on the charts.
It is clear that the new unwrapped representation of the data follows the color coding directly, and the total horizontal variation from to fits nicely with a centered version – – of the original parameterization from to . In the top left of the unwrapped data structure we see some distortion with respect to the parameterization/color coding of the original data. This is most probably again due to bias in the density ridge estimation. The nonlinearity of the innermost part of the spiral is most likely on a smaller scale than the kernel bandwidth is able to capture, resulting in a density ridge that is straighter than it should be and thus the projections will be less accurate.
7.5 pca preprocessing and local density ridge unwrapping
In this section we present results on real data sets preprocessed by principal component analysis
(Jolliffe, 2002).7.5.1 Subset of mnist handwritten digits
In this experiment we use a subset of the mnist handwritten images. We preprocess the data by projecting the data to the first three principal components. We use images of the number one – – in this experiment, since it is very susceptible to geometric variations like rotations, translations and also more nonlinear deformations due to its simple shape.
The data projected to the top three principal components is shown in figure 15(a). We clearly see that the data follows a nonlinear submanifold and that the variation of the images varies along a low dimensional surface. To test our framework, we project the data to the twodimensional density ridge, and find the local modes by following the gradient.
If we use a large kernel bandwidth we introduce some bias in the density ridge estimate seen e.g. in the splitting of the curves near the boundary of the manifold, but at the same time we get two global principal curves that cover the entire projected data set, shown in figure 15(b) such that there is no need to translate charts together and we can directly get a global unwrapping. In figure 17 we see the results of applying the local density ridge unwrapping algorithm (algorithm 1).
In figure 17 we see the unwrapped mnist digits with a selection of random images of the actual digits on top. We see that the coordinates represents a clear structure in the data – a straight up version of the digit around the origin and different, but symmetric variations along the two axes. The horizontal axis represents the orientation of the digits, while the vertical axis represesents thickness of the digits. in section 7.9 we compare our results with other known manifold learning algorithms. as a final remark in this setting we note that the use of pca preprocessing is a linear projection such that the geometric properties of the data should not be influenced by the transformation. Ideally pca would remove all ambient space except such that the codimension^{13}^{13}13if is embedded in the codimension is . Is one.
7.5.2 Frey faces
In this section we test the local density ridge unwrapping on the frey faces data set. For visualization purposes we reduce the dimension of the data set down to two by pca – shown in figure 17(a). We clearly see an underlying curved structure.
The curved structure has seemingly large orthogonal variance so we project to both of the two available onedimensional density ridges. The result can be seen in figure 17(b). The first density ridge/principal curve, according to the ordering of the eigenvalues of the local hessian, is shown in red. It captures well the underlying nonlinearity. The second ridge, shown in orange, fits well with the orthogonal variation along the underlying nonlinearity. We also have to note the disconnected ridges in the bottom left part of the data set. This is probably due to a separate curvilinear structure in the data. We could alleviate this by choosing a larger kernel size to increase the smoothness of the ridge, but then there would be a risk of smoothing away the nonlinear curved structure of the data. We proceed with the chosen kernel size and note that this case cannot be described by a single onedimensional manifold.
In figure 19 we see the results of the local density ridge unwrapping algorithm. A random selection of 100 of the original input images are displayed on top of the unwrapped coordinates. We see that the main horizontal direction, exhibiting the largest variance, captures the leftright orientation of the faces. In figure 20 each row shows the largest negative, origin and largest positive coordinate of each of the two coordinates of the unwrapping respectively. We again see that the first ridge  top row  corresponds to the leftright orientation of the faces. The second ridge is harder to interpret, but we note that the orientation of the face is looking straight ahead in all bottom three images in figure 20, which fits well with the second coordinate being orthogonal to the horizontal direction represented by the first coordinate.
The disconnected ridges results in the region in the top right corner. The analysis and discussion of these cases are left for future research.
7.6 dimensional ridges
In this section we present some applications of our nonparametric dimensional framework. Inspired by dollár et al. (Dollár et al., 2007), we can perform several different operations once each point that lies on the manifold has been equipped with a tangent space  spanned by the hessian eigenvectors. Most important in this setting are calculating geodesics as in equation (9), and outofsample projections that are faster than projections by solving differential equations.
We include two examples: the first example shows out of sample projections by projecting so point to the closest local tangent space. The second example shows smooth geodesics found by the alternating least squares algorithm presented in section 6.2.
In figure 21 we see a synthetic hemisphere sampled uniformly with noise added.
In figure 22 we see the points projected to the closest tangent space of the density ridge by equation 10.
It is obvious that the out of sample projections are working as expected, but we also see that when samples are so noisy that they cannot be orthogonally projected to the manifold  in this case they are not directly above the hemisphere  they are projected to the tangent space of the boundary of the manifold, and the structure is thus lost. In this case the flat tangent spaces at the boundary gives the appearance of projecting to a cylinder below the hemisphere. Suggestions and solutions to this problem is left to future research.
7.7 Isometric unwrapping
In the case where the intrinsic manifold has zero gaussian curvature it can be unfolded isometrically by the algorithms from section 6. We start with the ‘swissroll’ example, a two dimensional surface of zero gaussian curvature distorted by a nonlinear function and noise. This is a popular dataset/example to illustrate manifold learning algorithms. It is useful because it represents a strong nonlinear structure, but at the same time it is a simple surface without gaussian curvature such that it can in fact be represented by a single coordinate chart. After projecting to the ridge and running the algorithms, we get the result of unwrapping shown in figure 23.
The color coding illustrates the local basins of attraction. In the unwrapped version  shown in the tangent space of one of the modes  we clearly see that the ordering of the local charts is preserved.
7.8 mnist autoencoder
We end this section with an example of a real data set where multiple dimensional charts emerge such that the isometric unwrapping algorithm, algorithms 3 and 4, has to be applied. Again we use the mnist one digits, but this time we reduce the dimension to three by using a deep autoencoder  as given in (van der Maaten et al., 2009).
In figure 24 we see both the data set and the data projected to the twodimensional density ridge. The underlying ridge fits the data well, and we see that there are several local charts involved, indicated by the color coding.
After running the isometric density ridge unwrapping scheme we get the results shown with a random selection of digits overlaid in figure 25. We again se a clear structure in the data with the ones rotating/changing orientation along the horizontal axis. The structure in the other mnistones example, where the ones changed from thicker to thinner along the vertical axis is not as strong in this example.
7.9 Comparison with other unwrapping algorithms
Finally we compare our algorithm with isomap (Tenenbaum et al., 2000), local linear embedding (Roweis and Saul, 2000), local tangent space alignment (Zhang and Zha, 2004) and maximum variance unfolding (Weinberger and Saul, 2006). We use a synthetic data set consisting of a twodimensional surface embedded in with additive/convolutional gaussian noise in addition to the mnistones used previously.
For the mnistones, we use our isometric unwrapping with a slightly smaller kernel size than the previous example  recall that the previous example a large kernel size was used to obtain onedimensional ridges that cover the entire manifold. This leads to several local modes, and thus several charts over the manifold such that algorithms 3 and 4 have to be used. The result is shown in figure 27. The other algorithms  isomap, lle, ltsa and mvu  all have a nearestneighbor parameter that has to be set so we run them with different values of and select the most visually meaningful results, shown in figure 26. We note that the ltsa had problems with finding the eigenvalue/eigenvectors such that the values had to be set much higher  .
Local tangent space alignment, figure 25(d) shows a similar structure to the density ridge unwrapped result that follows the orientation of the digit, but the structure is compressed to a curve and the structure orthogonal to the curve is lost. There is also no apparent unwrapping, the manifold is still curved. The same goes for the local linear embedding result, figure 25(c), no apparent unwrapping. Also, looking at the images overlaid over the data, there does not seem to be the same smooth consistent structure as in the density ridge unwrapped version. Isomap, figure 25(b) has unwrapped the manifold, but the structures seen in the overlaid images does not seem to follow specific patterns. Finally, maximum variance unfolding in figure 25(a) has also failed in unwrapping the noisy cylinder shape, we also note that the orientation of the numbers is not preserved properly.
Finally we do the same comparison on a synthetic data set with a high level of noise to indicate that our algorithm can handle noisy situations. The data set with and without noise and the density ridge unwrapped version projected to the tangent space of the reference mode is shown in figure 28. We see that even with a quite high level of noise the density ridge unwrapping gives meaningful results. In figure 29 we see the data processed to twodimensions by isomap, lle, mvu, ltsa, laplacian eigenmaps and density ridge unwrapping. The original parameterization is shown in red dots, and the unwrapped coordinates from the different algorithms are shown with the same color coding as in figure 28. All result were obtained with neighborhood parameter and the results as well as the parameterization have been normalized and centered.
Looking at the figures we see that isomap is the closest to ours, but it is not able to capture the manifold without retaining the noise. The other four algorithms, laplacian eigenmaps, local linear embedding, maximum variance unfolding and local tangent space aligment all fail to unfold the underlying manifold and the results are accordingly hard to comment on. We tried different parameters for all algorithms, but all failed to give reasonable results, except isomap and our algorithms, that is why we chose to show only the results of the default parameter .
8 Conclusion and future work
In this paper we have shown that density ridges of a kernel density estimate can be used to unwrap manifolds. We have presented a geometrically intuitive algorithm for unwrapping onedimensional manifolds embedded in euclidean space based on gradient flow. We have extended this algorithm to higher dimensional manifolds, replacing coordinates calculated via integral curves to local linear projections. promising results and illustrations have been shown on both synthetic and real data sets that match the geometric intuition of unfolding.
We conclude this paper with a short section on possible extensions and directions of further research.
8.1 Future research
The work presented in this paper contains suggestions on how to learn linear coordinate systems on nonlinear data distributions that can be represented by principal curves or surfaces (density ridges). The perhaps most obvious lines of further research are related to the properties of the underlying manifold and the estimate of the manifold surrogate which relies on the kernel density estimate of the distribution of points on the manifold.
Regarding the properties of the underlyng manifold theoretical results should be established related to upper bounds on the curvature, boundary effects and smoothness properties in higher dimension. The online appendix for the convergence proofs of the isomap algorithm could be applied here (Tenenbaum et al., 2000).
Furthermore, the properties of the underlying manifold directly influences the density ridge estimate. Especially the choice of kernel size and how it influences the relationship between curvature, bias and smoothness of the density ridge estimate should be established theoretically. Also the properties of the kernel density estimate is known to deteriorate when the dimension increases since the distance distribution of a point set converges to the mean distance in higher dimensions. An upper bound on dimensions where the gradient flow of a kernel density estimate holds could be established and used in our setting.
Recently chacon and duong, Chacón et al. (2013) suggested kernel size estimators that yield different sizes for kernel density its gradient and its hessian. These could be implemented in our setting, perhaps increasing accuracy in density ridge estimations.
Aknowledgements
This work was partially funded by the research council of norway over fripro grant no. 239844 on developing the next generation learning machines (rj).
Appendix A Approximate parallel transport
In this section we present the algorithm for approximate parallel transport in the dimensional setting. Let be the points on the ridge estimate that lies in the basin of attraction of mode and the local coordinates of , . Let be the vector containing the points of the approximate geodesic from to as found by (9). The approximate parallel transport is performed by translating the local coordinates along the finite difference tangent vectors of the approximate geodesic. At each step of the translation the points are projected/rotated to the local tangent space by . This is to ensure that the local coordinates stay in the tangent space all the way along the geodesic towards the target mode . The algorithm for approximate parallel transport is summarized in Algorithm 3.
(11) 
Appendix B Algorithm for unwrapping transported coordinates
Given a framework for approximate parallel transport, we can unwrap isometric manifolds. This is done by first transporting all local linear attraction basin approximations (charts) to a reference mode. Then each chart is iteratively translated back along the geodesic it came from whilst at the same time projecting each step down to the dimension of the tangent space along the same geodesic. This will give an unfolded version of the manifold where all charts lie in the dimension of the tangent space of the manifold centered on the reference mode.
Let be the vector in reversed order. Since lies approximately on the manifold the finite difference tangent vectors will also lie on the manifold. So by pushing the coordinates along the geodesic and at the same time projecting to the intrinsic dimension of the manifold, we get a direct unfolding. The complete algorithm is presented in Algorithm 4.
Consider standing on the north pole of the earth and taking a step in some direction along the surface of the earth. Following the algorithm, one would take a step in the chosen direction and project the vector pointing from the north pole to the destination of the step to the tangent space of the north pole. This would yield a twodimensional step in the tangent space of the north pole. Then the procedure is repeated from the next position; take a step and then project to the tangent space of the origin. The tangent space will still be twodimensional, but the orientation might have changed slightly, so that we need to compensate for the change of basis from tangent space to tangent space.
In this manner one will get a twodimensional representation of the path walked, e. g. walking from the north pole to the south pole along a great circle would yield a straight line in the unfolded coordinates. Note the similarity to stereographic projection coordinates. Of course this example is in effect not relevant since the earth is close to a sphere which cannot be unfolded isometrically, but the analogies to the algorithm still holds.
References
 AriasCastro et al. (2013) Ery AriasCastro, David Mason, and Bruno Pelletier. On the estimation of the gradient lines of a density and the consistency of the meanshift algorithm. Unpublished manuscript, 2013.
 Banfield and Raftery (1992) Jeffrey D Banfield and Adrian E Raftery. Ice floe identification in satellite images using mathematical morphology and clustering about principal curves. Journal of the American Statistical Association, 87(417):7–16, 1992.
 (3) Margherita Barile and Eric W. Weisstein. T2space. URL http://mathworld.wolfram.com/T2Space.html. Visited on 25/02/15.
 Bas (2011) Erhan Bas. Extracting structural information on manifolds from high dimensional data and connectivity analysis of curvilinear structures in 3D biomedical images. PhD thesis, NORTHEASTERN UNIVERSITY, 2011.
 Brun et al. (2005) Anders Brun, CarlFredrik Westin, Magnus Herberthson, and Hans Knutsson. Fast manifold learning based on riemannian normal coordinates. In Image Analysis, pages 920–929. Springer, 2005.
 Burges (2009) Christopher JC Burges. Dimension reduction: A guided tour. Machine Learning, 2(4):275–365, 2009.
 Chacón et al. (2013) José E Chacón, Tarn Duong, et al. Datadriven density derivative estimation, with applications to nonparametric clustering and bump hunting. Electronic Journal of Statistics, 7:499–532, 2013.

Chang and Ghosh (1998)
KyuYu Chang and Joydeep Ghosh.
Principal curve classifiera nonlinear approach to pattern classification.
In Neural Networks Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Joint Conference on, volume 1, pages 695–700. IEEE, 1998.  Chen et al. (2014a) YenChi Chen, Christopher R Genovese, and Larry Wasserman. Asymptotic theory for density ridges. arXiv preprint arXiv:1406.5663, 2014a.
 Chen et al. (2014b) YenChi Chen, Christopher R Genovese, and Larry Wasserman. Generalized mode and ridge estimation. arXiv preprint arXiv:1406.1803, 2014b.
 Chen et al. (2014c) Yenchi Chen, Larry Wasserman, and Christopher R. Genovese. SOLUTION MANIFOLDS IN STATISTICS By YenChi Chen, Christopher R. Genovese and Larry Wasserman Carnegie Mellon University June 29, 2014. pages 1–33, 2014c.
 Comaniciu and Meer (2002) Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(5):603–619, 2002.
 Dijkstra (1959) Edsger W Dijkstra. A note on two problems in connexion with graphs. Numerische mathematik, 1(1):269–271, 1959.
 Dollár et al. (2007) Piotr Dollár, Vincent Rabaud, and Serge Belongie. Nonisometric manifold learning: Analysis and an algorithm. In Proceedings of the 24th international conference on Machine learning, pages 241–248. ACM, 2007.
 Einbeck et al. (2005) Jochen Einbeck, Gerhard Tutz, and Ludger Evers. Local principal curves. (1992):301–313, 2005.
 Fehlberg (1969) Erwin Fehlberg. Loworder classical RungeKutta formulas with stepsize control and their application to some heat transfer problems. 1969.

Freifeld et al. (2014)
Oren Freifeld, Soren Hauberg, and Michael J Black.
Model transport: Towards scalable transfer learning on manifolds.
In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1378–1385. IEEE, 2014.  Genovese et al. (2009) Christopher R Genovese, Marco PeronePacifico, Isabella Verdinelli, Larry Wasserman, et al. On the path density of a gradient field. The Annals of Statistics, 37(6A):3236–3271, 2009.
 Genovese et al. (2012) Christopher R Genovese, Marco PeronePacifico, Isabella Verdinelli, and Larry Wasserman. The geometry of nonparametric filament estimation. Journal of the American Statistical Association, 107(498):788–799, 2012.
 Genovese et al. (2014) Christopher R. Genovese, Marco PeronePacifico, Isabella Verdinelli, and Larry Wasserman. Nonparametric ridge estimation. The Annals of Statistics, 42(4):1511–1545, August 2014. ISSN 00905364. doi: 10.1214/14AOS1218. URL http://projecteuclid.org/euclid.aos/1407420007.
 Gerber (2013) Samuel Gerber. RegularizationFree Principal Curve Estimation. 14:1285–1302, 2013.
 Gerber and Whitaker (2013) Samuel Gerber and Ross Whitaker. Regularizationfree principal curve estimation. The Journal of Machine Learning Research, 14(1):1285–1302, 2013.
 Hastie and Stuetzle (1989) Trevor Hastie and Werner Stuetzle. Principal curves. Journal of the American Statistical Association, 84(406):502–516, 1989.
 Heckbert (1990) Paul S Heckbert. A seed fill algorithm. In Graphics gems, pages 275–277. Academic Press Professional, Inc., 1990.
 Hein and Maier (2006) Matthias Hein and Markus Maier. Manifold denoising. In Advances in neural information processing systems, pages 561–568, 2006.
 Jolliffe (2002) Ian Jolliffe. Principal component analysis. Wiley Online Library, 2002.
 Kégl et al. (2000) Balázs Kégl, Adam Krzyzak, Tamás Linder, and Kenneth Zeger. Learning and design of principal curves. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(3):281–297, 2000.
 Krzy (2002) Adam Krzy. Piecewise linear skeletonization using principal curves ∗. 7, 2002.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Lee (1997) John M Lee. Riemannian manifolds: an introduction to curvature, volume 176. Springer, 1997.
 Lin and Zha (2008) Tong Lin and Hongbin Zha. Riemannian manifold learning. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(5):796–809, 2008.

Myhre and Jenssen (2012)
Jonas Nordhaug Myhre and Robert Jenssen.
Mixture weight influence on kernel entropy component analysis and semisupervised learning using the lasso.
In Machine Learning for Signal Processing (MLSP), 2012 IEEE International Workshop on, pages 1–6. IEEE, 2012. 
Myhre et al. (2015)
Jonas Nordhaug Myhre, Karl Øyvind Mikalsen, Sigurd Løkse, and Robert
Jenssen.
Consensus clustering using knn mode seeking.
In Image Analysis, pages 175–186. Springer, 2015.  Ozertem and Erdogmus (2011) Umut Ozertem and Deniz Erdogmus. Locally defined principal curves and surfaces. Journal of Machine Learning Research, 12(4):1249–1286, 2011.
 Pitelis et al. (2013) Nikolaos Pitelis, Chris Russell, and Lourdes Agapito. Learning a manifold as an atlas. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1642–1649. IEEE, 2013.
 Pulkkinen (2015) Seppo Pulkkinen. Ridgebased method for finding curvilinear structures from noisy data. Computational Statistics & Data Analysis, 82:89–109, 2015.
 Pulkkinen et al. (2014) Seppo Pulkkinen, Marko M Mäkelä, and Napsu Karmitsa. A generative model and a generalized trust region newton method for noise reduction. Computational Optimization and Applications, 57(1):129–165, 2014.
 Rifai et al. (2011) Salah Rifai, Xavier Muller, Xavier Glorot, Grégoire Mesnil, Yoshua Bengio, and Pascal Vincent. Learning invariant features through local space contraction. arXiv preprint arXiv:1104.4153, 2011.
 Roweis and Saul (2000) Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
 Shaker et al. (2014) Matineh Shaker, Jonas N Myhre, M Devrim Kaba, and Deniz Erdogmus. Invertible nonlinear cluster unwrapping. In Machine Learning for Signal Processing (MLSP), 2014 IEEE International Workshop on, pages 1–6. IEEE, 2014.
 Sorzano et al. (2014) Carlos Oscar Sánchez Sorzano, J Vargas, and A Pascual Montano. A survey of dimensionality reduction techniques. arXiv preprint arXiv:1403.2877, 2014.
 Spivak (1965) Michael Spivak. Calculus on manifolds, volume 1. WA Benjamin New York, 1965.
 Stanford and Raftery (2000) Derek C Stanford and Adrian E Raftery. Finding curvilinear features in spatial point patterns: principal curve clustering with noise. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(6):601–609, 2000.
 Sun et al. (2012) Ke Sun, Eric Bruno, and Stéphane MarchandMaillet. Stochastic unfolding. In Machine Learning for Signal Processing (MLSP), 2012 IEEE International Workshop on, pages 1–6. IEEE, 2012.
 Tenenbaum et al. (2000) Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
 Tibshirani (1992) Robert Tibshirani. Principal curves revisited. Statistics and Computing, 2(4):183–190, 1992.
 Tu (2008) Loring W Tu. An introduction to manifolds, volume 200. Springer, 2008.
 van der Maaten et al. (2009) Laurens JP van der Maaten, Eric O Postma, and H Jaap van den Herik. Dimensionality reduction: A comparative review. Journal of Machine Learning Research, 10(141):66–71, 2009.
 Vincent and Bengio (2002) Pascal Vincent and Yoshua Bengio. Manifold parzen windows. In Advances in neural information processing systems, pages 825–832, 2002.
 Wand and Jones (1994) Matt P Wand and M Chris Jones. Kernel smoothing. Crc Press, 1994.
 Weinberger and Saul (2006) Kilian Q Weinberger and Lawrence K Saul. An introduction to nonlinear dimensionality reduction by maximum variance unfolding. In AAAI, volume 6, pages 1683–1686, 2006.
 Zhang and Zha (2004) Zhenyue Zhang and Hongyuan Zha. Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. Journal of Shanghai University (English Edition), 8(4):406–424, 2004.
Comments
There are no comments yet.