# Estimation via length-constrained generalized empirical principal curves under small noise

In this paper, we propose a method to build a sequence of generalized empirical principal curves, with selected length, so that, in Hausdor distance, the images of the estimating principal curves converge in probability to the image of g.

## Authors

• 3 publications
• 5 publications
• ### Spherical Principal Curves

This paper presents a new approach for dimension reduction of data obser...
03/05/2020 ∙ by Jang-Hyun Kim, et al. ∙ 0

• ### Neural network modeling of data with gaps: method of principal curves, Carleman's formula, and other

A method of modeling data with gaps by a sequence of curves has been dev...
05/21/2003 ∙ by Alexander N. Gorban, et al. ∙ 0

• ### Decomposed Richelot isogenies of Jacobian varieties of hyperelliptic curves and generalized Howe curves

We advance previous studies on decomposed Richelot isogenies (Katsura–Ta...
08/16/2021 ∙ by Toshiyuki Katsura, et al. ∙ 0

• ### Multiple penalized principal curves: analysis and computation

We study the problem of finding the one-dimensional structure in a given...
12/15/2015 ∙ by Slav Kirov, et al. ∙ 0

• ### Filament Plots for Data Visualization

We construct a computationally inexpensive 3D extension of Andrew's plot...
07/20/2021 ∙ by Nate Strawn, et al. ∙ 2

• ### Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly

When confronted with massive data streams, summarizing data with dimensi...
05/18/2018 ∙ by Benjamin Guedj, et al. ∙ 0

• ### Similarity of Polygonal Curves in the Presence of Outliers

The Fréchet distance is a well studied and commonly used measure to capt...
12/07/2012 ∙ by Jean-Lou De Carufel, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Preliminary picture of the estimation result

Let . We observe random vectors , given by

 Xni=g(Uni)+εni,i=1,…,n, (1.1)

where the unknown function is continuous. Moreover, is assumed to have finite length equal to its 1-dimensional Hausdorff measure and to have constant speed. Here, the random variables , , taking their values in

, are independent, and belong to a class of distributions with full support, enclosing for instance the uniform distribution as a particular case.

We study an asymptotic context, where the noise tends in probability to 0 (in a sense that will be specified below) when the number of observations tends to infinity.

The main result of this paper is the construction, relying on the principal curve notion, of an estimator , which converges to the unknown curve in Hausdorff distance, in the sense that the Hausdorff distance between and converges in probability to 0.

### 1.2 Related work

The problem of estimating the image of may be cast into the general context of filament or manifold estimation from observations sampled on or near the unknown shape.

The literature mainly focuses on shapes with a reach bounded away from zero. The reach , characterizing the regularity of the shape, is the maximal radius of a ball rolling on it (see Federer (1969)). In Genovese et al. (2012a), an additive noise model of the form (1.1) is studied. The curve is parameterized by arc-length, normalized to . The authors assume that the , , have a common density with respect to the Lebesgue measure on , bounded and bounded away from zero. The noise has support in a ball , with , and admits a bounded density with respect to the Lebesgue measure, which is continuous on , nondecreasing and symmetric, with a regularity condition on the boundary of the support. For an open curve (with endpoints), in addition, . In the plane , the assumptions made allow to estimate the support of the distribution of the observations, the boundary of this set , in order to find its medial axis, which is the closure of the set of points in that have at least two closest points in the boundary . In the same article, the authors also consider clutter noise, corresponding to the situation where one observes points sampled from a mixture density , where is the uniform density over some compact set, and is the density of points on the shape. Another additive model is investigated in Genovese et al. (2012b), for the estimation of manifolds without boundary, with dimension lower than the dimension of the ambient space, contained in a compact set. The model may be written

 Xi=Gi+εi,i=1,…,n, (1.2)

where the random vectors are drawn uniformly on the shape , and the noise is drawn uniformly on the normal to the manifold, at distance at most . The article Genovese et al. (2012c) is also dedicated to manifold estimation, under reach condition, first in a noiseless model, where the observations are exactly sampled on the manifold, according to some density with respect to the uniform distribution on the manifold, and then in the presence of clutter noise. An additive noise model, with known Gaussian noise, is examined as well. This latter case is related to density deconvolution. Estimating manifolds without boundary, with low dimension and a lower bound on the reach, is also the purpose of Aamari and Levrard (2018, 2019). The points sampled on the manifold have a common density with respect to the -dimensional Hausdorff measure of the manifold, which is bounded and bounded away from zero. In Aamari and Levrard (2018), estimation relies on Tangential Delaunay Complexes. It is performed in the noiseless case, with additive noise, bounded by , and under clutter noise. Aamari and Levrard (2019) deal with compact manifolds belonging to particular regularity classes. The authors examine the noiseless situation, as well as centered bounded noise perpendicular to the manifold. Estimators based on local polynomials are proposed.

To sum up, all these models involve strong conditions on the noise, which is either bounded, or of type clutter noise. Such assumptions allow the authors to derive rates of convergence. Here, we investigate a different situation, with a weak assumption on the noise. In particular, the noise does not need to be bounded. Regarding the regularity of the curve , which has constant speed, there is no reach assumption, and is not required to be injective. Although rates of convergence cannot be expected here, this weak framework is worth studying, since it is not obvious at first sight that it is even possible to build a convergent estimator without knowledge of either length or noise.

The estimation strategy relies on generalized empirical principal curves.

### 1.3 Extension of the notion of length-constrained principal curve

The notion of principal curve with length constraint has been proposed by Kégl et al. (2000). According to this definition, if

denotes a random vector with finite second moment, a principal curve is a continuous map

minimizing under a length constraint the quantity

 (1.3)

where denotes the Euclidean norm and stands for the Euclidean distance to a set. This optimization problem may also be seen as a version of the “average distance problem” studied in the calculus of variations community (see, e.g., Buttazzo et al. (2002); Buttazzo and Stepanov (2003)). Originally, a principal curve was defined by Hastie and Stuetzle (1989) as a self-consistent curve, that is, a curve satisfying with given by In addition to self-consistency, smoothness conditions were required: the principal curve has to be of class , it does not intersect itself, and has finite length inside any ball in . Tibshirani (1992) revisited the problem as a mixture model, which forces the curve in models of the form (1.1) to be a principal curve. The point of view by Kégl et al. (2000), where no smoothness assumption is made, was motivated in particular by the fact that the existence of principal curves defined in terms of self-consistency was only proved for a few particular examples (see Duchamp and Stuetzle (1996a, b)). Note that principal curves introduced by Kégl et al. (2000) include polygonal lines.

As stated in the next lemma, shown in Section A.1, existence of optimal curves is still guaranteed when replacing the squared Euclidean distance in the definition (1.3) by more general distortion measures.

###### Lemma 1.1.

Let be a lower semi-continuous, strictly increasing function, continuous at 0, and such that . Let denote a random vector such that . Then, for any finite length , there exists a curve with length minimizing over all curves with length at most the criterion

 Δ(f)=E[V(d(X,Imf))].

The motivation for introducing this generalized notion of principal curves is that this allows for greater flexibility in the way we measure distances. This framework encloses for instance as particular cases the power functions , . An appropriate choice of may enhance robustness. A typical example in this regard is the function defined by .

In a statistical context, one has at hand independent observations , and an empirical principal curve is defined as a minimizer, under a length constraint, of the criterion

 1nn∑i=1d(Xi,Imf)2.

Similarly, a generalized empirical principal curve may be obtained by minimizing

 1nn∑i=1V(d(Xi,Imf)).

Observe that, in this case, existence of a minimizer is more straightforward since the empirical measure is compactly supported.

### 1.4 Organization of the paper

The manuscript is organized as follows. In Section 2, we set up notation and introduce more formally the model. In Section 3, we state and prove the main result: we build a sequence of generalized empirical principal curves converging to the curve to be estimated in Hausdorff distance. The proof is structured in two subsections. The first one gathers results around the Cauchy-Crofton formula, which allows to show a useful fact about the considered class of sampling distributions on . The proof of the existence Lemma 1.1, as well as a technical measurability result, are collected in Appendix A.

## 2 Definitions and notation

### 2.1 Notation

We consider the space , equipped with the standard Euclidean norm, associated to the inner product . Here, denotes the Borel sigma-algebra of a space .

Let denotes the 1-dimensional Hausdorff measure in .

In the sequel, for a compact set , stands for the diameter of a set and for the distance from the point to the set , that is

 diam(A)=maxx,y∈A|x−y|,d(x,A)=miny∈A|x−y|.

We denote by the Hausdorff distance between two sets and , given by

Let stand for the Lebesgue measure and for the Dirac measure at .

Throughout, an interval will denote an open interval of equipped with the induced topology.

Denote by a metric associated to weak convergence. For a probability measure and a closed set of probability measures , let .

For two probability measures and , we define the bounded Lipschitz metric between and by

 |μ−μ′|BL=sup{|μ(h)−μ′(h)|:|h|∞≤1,supx≠y|h(x)−h(y)||x−y|≤1}.

A continuous function from to will be called a curve. If a curve is rectifiable, its length will be denoted by . Finally, we will denote by the metric space of continuous functions from to , equipped with the topology of uniform convergence.

### 2.2 Description of the model

Let be a curve with finite length and constant speed, such that the length equals the 1-dimensional Hausdorff distance.

Given , we define

as the closed family of probability distributions

on satisfying on .

For , we observe a triangular array of random vectors , given by the model

 Xni=g(Uni)+εni,i=1,…,n, (2.1)

where the , , are independent and for every , the distribution of belongs to .

Let be a lower semi-continuous, strictly increasing function, continuous at 0, and such that . Moreover, we assume that satisfies the following property: there exist a constant , such that, for every

 V(x+y)≤C(V(x)+V(y)).

For a curve , we define

 Δn(f)=1nn∑i=1V(d(Xni,Imf)).
###### Remark 1.

If we set , we find the usual principal curve definition by Kégl et al. (2000).

We also define a function , by setting

 T(f,x)=maxargmint∈[0,1]|x−f(t)|.

For every , let

 Gn(L)=minL(f)≤LΔn(f),

and let denote an empirically optimal curve with length at most , that is a random variable taking its values in such that

 Δn(^fn,L)=Gn(L).

Moreover, we choose -Lipschitz. We set .

## 3 Main result

We consider the estimation of the curve in Model (2.1) using a sequence of generalized empirical principal curves, that is a sequence of curves which are optimal with respect to the criterion .

###### Theorem 3.1.

Let be a curve, such that and a.e.. Assume that . We consider Model (2.1), with tending to 0 in probability as tends to infinity. Let be defined by

 ^Ln∈argminL∈anN∩[0,Λn∧Λ][L2D(1nn∑i=1δT(^fn,L,Xni),Mc)+Δn(^fn,L)],

where for every and as . Then, converges in probability to 0 as tends to infinity.

First, let us discuss the assumptions. The requirement is technical. It allows, in the proof, to consider limit points of the constructed sequence of empirical principal curves. From a applied point of view, this is not a limitation of the procedure. Indeed, in practice, we will always consider a finite grid for the length. Moreover, with a fixed number of observations, the minimal length needed to join all points is a finite upper bound for the length. The condition ensures that the image of is parameterized with minimal possible length. Indeed, there exist an infinite number of parameterizations, with infinite possibilities for the length. In words, generically, a portion of image of cannot be traveled several times. The case were is injective is a particular case. Nevertheless, here, an image with loops is allowed. We also require a.e., which means that the image of is parameterized with constant speed . These assumptions about the parametrization allow to show a key relation between the distribution class and its image by (see Lemma 4.4 below), the proof of which relies on the Cauchy-Crofton formula for the length of a rectifiable curve.

Observe that the main strength of the result is that it provides a convergent estimator in a very general framework. Neither the length, nor the noise level, converging to 0 in a very weak sense, is known. Intuitively, considering a practical situation with a fixed number of observations, the same data cloud could arise from several different generative curves, more or less long, in a model with more or less noise. This illustrates the benefit of an estimator construction which does not require the knowledge of any of the two parameters. Apart from the upper bound , which does not really need calibration in practice, as already mentioned, the procedure only depends on a single parameter, namely the constant characterizing the class of possible sampling distributions .

It should be noticed that the theorem does not guarantee that the procedure allows to recover the true underlying length. Nevertheless, the proof below shows that the selected length cannot be too short: for all one has .

If is closed (), then Theorem 3.1 still holds when is chosen as a closed empirical optimal curve with length less than .

As mentioned in the Introduction, the proof of Theorem 3.1 is split into two parts. First, we state and prove the Cauchy-Crofton formula, together with a related result, and we use them to establish an equivalence linking and its image by (Section 4). The rest of the proof of the theorem, divided in several lemmas, is presented thereafter (Section 5).

## 4 Cauchy-Crofton formula and relation linking Mc to its image

In the sequel, we will make use of the Cauchy-Crofton formula (Cauchy (1850); Crofton (1868)) for the length of a rectifiable curve in (see, e.g., Ayari and Dubuc (1997)). We recall the formula in the next lemma, and give a proof for the sake of completeness.

Let . For and , let

 Dθ,r={z∈Rd∣⟨θ,z⟩=r}.
###### Lemma 4.1 (Cauchy-Crofton formula).

The length of a rectifiable curve is given by

 L(f)=1cd∫Sd−1∫∞0% Card({t∈[0,1],f(t)∈Dθ,r})drdθ,

where is a constant depending on the dimension .

###### Remark 2.

This result may also be written in the following equivalent form :

 L(f)=1cd∫Sd−1∫∞0∑y∈Imf∩Dθ,rCard(f−1({y}))drdθ.
###### Proof.

For , consider the polygonal line defined by the segments , . We define and respectively by and . There exist and , , such that . Then, the variation of is Hence,

 ∫Sd−1V(fp,θ)dθ=p−1∑i=1ρi∫Sd−1|⟨θ,θi⟩|dθ:=cdp−1∑i=1ρi=cdL(fp).

We have and (Alexandrov and Reshetnyak, 1989, Corollary of Theorem 2.1.2). By the Cauchy-Schwarz inequality, , and by definition of the length, . Thanks to Lebesgue’s dominated convergence theorem,

 limp→+∞∫Sd−1V(fp,θ)dθ=∫Sd−1V(fθ)dθ.

We deduce that

 L(f)=1cd∫Sd−1V(fθ)dθ.

Besides, according to Banach’s formula (see Banach (1925)), we have

 V(fθ)=∫∞0Card({t∈[0,1],⟨θ,f(t)⟩=r})dr.

Consequently, we get the Cauchy-Crofton formula:

 L(f)=1cd∫Sd−1∫∞0% Card({t∈[0,1],⟨θ,f(t)⟩=r})drdθ.

The next equality, corresponding to the Cauchy-Crofton formula applied to open subset of , will be useful in the sequel.

###### Remark 3.

Let . Then,

 L(f|(a,b))=1cd∫Sd−1∫∞0Card({t∈(a,b),f(t)∈Dθ,r})drdθ.

Since

 L(f|(a,b))=∫101(a,b)(t)|f′(t)|dt,

we have

 ∫101(a,b)(t)|f′(t)|dt=1cd∫Sd−1∫∞0∑t∈[0,1]1(a,b)(t)1{f(t)∈Dθ,r}drdθ.

Hence, by linearity, if , , are pairwise disjoint open intervals of , we have

 ∫101⋃i≥1(ai,bi)(t)|f′(t)|dt=1cd∫Sd−1∫∞0∑t∈[0,1]1⋃i≥1(ai,bi)(t)1{f(t)∈Dθ,r}drdθ. (4.1)

In the sequel, we will also use a Cauchy-Crofton-type formula for taking the form of an equality for measures.

This result relies on the following lemma.

###### Lemma 4.2.

Let be a rectifiable curve. Then, the trace of on satisfies , where is the measure defined on every Borel set by

 γ(A)=1cd∫Sd−1∫∞0Card(A∩Dθ,r)drdθ.

As a preliminary result, the next lemma states the measurability of . The proof is postponed to the Appendix (Section A.2).

###### Lemma 4.3.

Let be a rectifiable curve. For , , a Borel subset of , let . Then, the function is measurable for the Lebesgue sigma-algebra.

###### Proof of Lemma 4.2.

An open subset of may be written

 O∩Imf=f(⋃k≥1(ak,bk))=⋃k≥1f((ak,bk)),

where is an open subset of , and , , are pairwise disjoint open intervals of . Let . This set is a Vitali class for , that is, for every , where and every , there exist , such that and . According to Vitali’s covering theorem, for every , there exist intervals , , such that the sets , , are pairwise disjoint, and (see (Falconer, 1985, Theorem 1.10)). Hence, for every , there exist , in , such that

 H1(O∩Imf) ≤∑i≥1|f(xi)−f(yi)|+ε =∑i≥11cd∫Sd−1∫∞0Card(t∈[0,1],hi(t)∈Dθ,r})drdθ+ε,

thanks to the Cauchy-Crofton formula applied to the functions , for all . Observe that a.e.. By Lemma 4.3, the function is measurable for the Lebesgue sigma-algebra, for every Borel subset of . If , then . Thus,

 H1(O∩Imf) ≤∑i≥11cd∫Sd−1∫∞0Card({f([αi,βi])∩Dθ,r})drdθ+ε =1cd∫Sd−1∫∞0% Card({O∩Imf∩Dθ,r})drdθ+ε.

As is arbitrary, . We define , for every Borel set , by : is a measure, satisfying . According to the Cauchy-Crofton formula, , so that the measure is finite. By outer regularity of finite measures, the trace of on is less than the measure . ∎

###### Remark 4.

For such that , . Indeed, since by Lemma 4.2, it is sufficient to show that both measures have the same mass. Yet, on the one hand, by Lemma 4.2, and on the other hand, by the Cauchy-Crofton formula (Lemma 4.1), so that the assumption implies .

###### Remark 5.

For such that , for almost every with respect to the trace of on . This fact follows from

 H1(Img)=γ(Img)=1cd∫Sd−1∫∞0∑y∈Img∩Dθ,r1drdθ,

together with the Cauchy-Crofton formula for (see Remark 2):

 L(f)=1cd∫Sd−1∫∞0∑y∈Img∩Dθ,rCard(g−1({y}))drdθ.

We are now in a position to state the next lemma, which characterizes the image by of a distribution belonging to the class .

###### Lemma 4.4.

Let be a curve such that , a.e., and . Let be a probability distribution supported in , and let denote a constant. Then,

 μ≥cλ⇔∀A⊂B(Rd)∩Img,μ∘g−1(A)≥cH1(A)L(g). (4.2)

Let us denote by the family of probability distributions on , with support , such that . Hence, the equivalence (4.2) means

 μ∈Mc⇔μ∘g−1∈Mgc.

In the proof of Lemma 4.4, we will use the fact that the property may be localized, as shown in the next lemma.

###### Lemma 4.5.

Let be a curve such that , and . Considering a subdivision , we have, for every ,

 L(g|(ai−1,ai))=H1(g((ai−1,ai))).
###### Proof.

If not, there exists such that , which implies

 H1(g([0,1])) ≤n∑i=1H1(g((ai−1,ai)))

###### Proof of Lemma 4.5.
1. Assume that . An open subset of may be written

 O∩Img=g(⋃i≥1(ai,bi))=⋃i≥1g((ai,bi)),

where is an open subset of , and , , are pairwise disjoint open intervals of .

Thanks to the assumption on , we have

 μ∘g−1(O∩Img) ≥cλ(g−1(O∩Img)) ≥cλ(⋃i≥1(ai,bi)) =c∑i≥1(bi−ai) (4.3) =c∑i≥1L(g|(ai,bi))L(g) (4.4) (4.5) ≥cH1(O∩Img)L(g).

For the equality (4.3), we used that the intervals are disjoint, for (4.4), the property a.e., and then for (4.5), the localized version of the equality (Lemma 4.5).

The result extends to every Borel subset of , using the outer regularity of probability measures.

###### Remark 6.

Taking and , we obtain that is the trace of on , since both measures are probability measures.

2. Assume that . An open subset of , for the induced topology, has the form , where , are pairwise disjoint open intervals of . Let . Using the assumption, the fact that (Remark 4), and the property for a.e. with respect to the trace of on (Remark 5), we may write

 μ∘g−1(O∩Img) ≥cH1(O∩Img)L(g) =cL(g)1cd∫Sd−1∫∞0Card(O∩Img∩Dθ,r)drdθ =cL(g)1cd∫Sd−1∫∞0∑y∈O∩Imf∩Dθ,r1drdθ =cL(g)1cd∫Sd−1∫∞0∑y∈O∩Imf∩Dθ,r% Card(g−1({y}))drdθ =cL(g)1cd∫Sd−1∫∞0∑t∈[0,1]1⋃i≥1(ai,bi)(t)1{g(t)∈Dθ,r}drdθ

Thanks to the equality (4.1), and using a.e., we deduce that

 μ∘g−1(O∩Img) ≥cL(g)∫101⋃i≥1(ai,bi)(t)|g′(t)|dt =c∑i≥1(bi−ai) =cλ(⋃i≥1(ai,bi)).

Let us show that is negligible for .

Let be a negligible set for the trace of on . Then, is negligible for . Indeed, there exists a Borel set , such that and . Since , By Remark 6, . Thus, .

Hence, the fact that is a negligible set for the trace of on (Remark 5) implies that is negligible for .

Consequently,

 μ(⋃i≥1(ai,bi)) =μ∘g−1(O∩Img),

and, thus,

 μ(⋃i≥1(ai,bi))≥cλ(⋃i≥1(ai,bi)).

This extends to every Borel subset of by outer regularity. Hence, .

Now, equipped with Lemma 4.4, let us turn to the proof of Theorem 3.1 itself.

## 5 Proof of Theorem 3.1

Note that . We set, for every , .

### 5.1 Step 1

We will first prove that converges in probability to 0 as goes to infinity.

To begin with, let us consider the term .

###### Lemma 5.1.

converges in probability to 0 as tends to infinity.

###### Proof of Lemma 5.1.

We write

 Δn(f∗n) ≤Δn(g) =1nn∑i=1V(d(Xni,Img))=1nn∑i=1V(d(g(Uni)+εni,Img