# On the Estimation of Pointwise Dimension

Our goal in this paper is to develop an effective estimator of fractal dimension. We survey existing ideas in dimension estimation, with a focus on the currently popular method of Grassberger and Procaccia for the estimation of correlation dimension. There are two major difficulties in estimation based on this method. The first is the insensitivity of correlation dimension itself to differences in dimensionality over data, which we term "dimension blindness". The second comes from the reliance of the method on the inference of limiting behavior from finite data. We propose pointwise dimension as an object for estimation in response to the dimension blindness of correlation dimension. Pointwise dimension is a local quantity, and the distribution of pointwise dimensions over the data contains the information to which correlation dimension is blind. We use a "limit-free" description of pointwise dimension to develop a new estimator. We conclude by discussing potential applications of our estimator as well as some challenges it raises.

## Authors

• 2 publications
• 2 publications
01/11/2019

### Non-Parametric Inference Adaptive to Intrinsic Dimension

We consider non-parametric estimation and inference of conditional momen...
11/16/2017

### A New Method for Performance Analysis in Nonlinear Dimensionality Reduction

In this paper, we develop a local rank correlation measure which quantif...
10/09/2019

### Metric Dimension

In this manuscript, we provide a concise review of the concept of metric...
05/04/2018

### Local angles and dimension estimation from data on manifolds

For data living in a manifold M⊆R^m and a point p∈ M we consider a stati...
11/08/2017

### Dimension Estimation Using Random Connection Models

Information about intrinsic dimension is crucial to perform dimensionali...
02/01/2013

### Distribution-Free Distribution Regression

`Distribution regression' refers to the situation where a response Y dep...
01/29/2018

### Structured Spreadsheet Modelling and Implementation with Multiple Dimensions - Part 1: Modelling

Dimensions are an integral part of many models we use every day. Without...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Dimension has become an important tool in the study of dynamical systems. Its importance arises from the attempt to understand dynamical sytems by their attractors. The dimensions of these attractors provide useful invariants of the systems in question. There are many notions of dimension that one can consider in this context. The most well-known of these is Hausdorff dimension. Hausdorff dimension, however, is difficult to estimate numerically and this has given rise to many of the other conceptions.

Often, one can associate with ergodic systems certain probability measures supported on attractors of interest. There is a dimension theory for probability measures, roughly analogous to the classical dimension theory for sets, which one can bring to bear in the context of such systems. This is the setting for this paper. In particular, we focus on the problem of estimating the

pointwise dimension distribution of a probability measure by sampling from it.

In Section II, we discuss various mathematical notions of dimension and the relationships between them.

In Section III, we discuss the current most popular technique for estimating fractal dimension – that of Grassberger and Procaccia Grassberger and Procaccia (1983).

This discussion motivates our derivation of our estimator of pointwise dimension, which we present in Section IV.

In Section V, we establish a certain degree of reliability of the estimator proposed in Section IV by performing a dimensional analysis of some test data.

Finally, in Section VI, we discuss potential applications of our estimator as well as certain theoretical questions arising from our considerations.

## Ii Mathematical Background

### ii.1 Pointwise Dimension

Very generally, the purpose of any notion of dimension is to tell apart certain types of objects given a certain notion of equivalence of such objects. Pointwise dimension tells apart Borel probability measure on metric spaces which are not equivalent by a locally bi-Lipschitz injection.

To be precise let and be Borel probability measures on the metric spaces and respectively. For a point in a metric space and a number , let us write

 B(x,ϵ):={y∈X∣ρ(y,x)<ϵ}.

A map is called locally bi-Lipschitz if for all there exists and a constant such that for all we have

 1Kρ1(u,v)≤ρ2(f(u),f(v))≤Kρ1(u,v).

Let us say that the measures and are equivalent if there exists a locally bi-Lipschitz injection such that for all -measurable we have

 μ1(f−1(U))=μ2(U). (1)

Pointwise dimension is a source of invariants of such probability measures under this notion of equivalence. Rather than prove this general statement, we will illustrate this in a very simple situation which motivates the concept of pointwise dimension. The illustration will contain all the ideas require for the general proof.

Consider the unit interval along with its uniform measure and the unit square along with its uniform measure , both under their natural Euclidean metrics. The “pointwise dimension” method of telling these measures apart is to observe that, for , we have

 ν1(B(x,ϵ))∼ϵ,

whereas, for , we have

 ν2(B(y,ϵ))∼ϵ2. (2)

If there were an equivalence between and , then we would have

 B(x,1Kϵ)⊆f−1(B(f(x),ϵ))⊆B(x,Kϵ).

This would force

 ν1(f−1(B(f(x),ϵ)))∼ϵ,

For a general Borel probability measure, growth rates such as the ones we computed above for and may not exist at all points or even at any point in the corresponding metric space. This motivates the following definition:

###### Definition 1.

Let be a Borel probability measure on a metric space . The lower pointwise dimension of at is defined to be

 D––μ(x):=liminfϵ→0logμ(B(x,ϵ))logϵ. (3)

The upper pointwise dimension of at is defined to be

 ¯¯¯¯¯Dμ(x):=limsupϵ→0logμ(B(x,ϵ))logϵ. (4)

If these two quantities agree, we call the common value the pointwise dimension of at and denote it .

The calculations we performed highlight two properties of pointwise dimension:

1. Pointwise dimension is invariant under bi-Lipschitz injections.

2. If is absolutely continuous with respect to Lebesgue measure on , then the pointwise dimension of at any point in its support is .

Let us now see how pointwise dimension behaves for other classes of measures. We begin by considering the Cantor measure – this is the measure resulting from the Cantor middle-thirds construction if at each scale we choose one of the remaining intervals uniformly. To be precise, for an interval , let denote the uniform measure on . At the stage of the middle third construction, there are admissible interval . Let us write

 ξn:=12n2n∑j=1uI(n)j.

There is a limiting measure for the family which we call the Cantor measure. is supported on the Cantor set .

For and , let us first calculate . Note that the scale of this ball in terms of the middle-thirds construction is given by . This means that

 2−⌈log1/3(ϵ)⌉≤ξ(B(x,ϵ))≤2−⌊log1/3(ϵ)⌋.

This shows that

 log(ξ(B(x,ϵ)))log(ϵ)∼log3(ϵ)⋅log(2)log(ϵ),

which in turn proves that exists and is equal to .

Of course, the Cantor measure is a special case of a much more general class of measures which we call Cantor-like measures. This class of measures was first studied by Billingsley in the context of dimension theory Billingsley (1960, 1961). We restrict our attention here to the case of Borel probability measure on and essentially reproduce Billingsley’s definition here. It is possible to give a more modern definition in terms of iterated function systems.

Let be an integer and let

be a probability vector. Divide the unit interval

into subintervals of equal length . Using our notation for the uniform measure from before, we write

 ξπ,1:=k∑j=1pjuIj.

We can further subdivide each of the intervals into subintervals of equal length . This allows us define

 ξπ,2:=k∑j1=1pj1k∑j2=1pj2uIj1,j2.

Iterating this procedure times yields a probability measure . The family has a limiting measure which we denote and which generalizes our earlier construction of the Cantor measure. We will now calculate the pointwise dimension of such a measure at a point .

A point is uniquely specified by a sequence of integers as

 {x}=∞⋂i=1Ij1,j2,…,ji.

Let . The scale of the ball in terms of the construction of is given by . In fact, if we write and

 S(ϵ)∏i=1pji≤ξπ(B(x,ϵ))≤s(ϵ)∏i=1pji. (5)

Let us count, for ,

 fn(j):=|{i∣1≤i≤n,ji=j}|.

From (5), we see that

 log(ξπ(B(x,ϵ)))log(ϵ)∼k∑j=1fS(ϵ)(j)log(pj)log(ϵ). (6)

Let us denote by the set of for which, as ,

 fS(ϵ)(j)log(ϵ)→pjlog(1/k).

Then

by the strong law of large numbers. Therefore, almost everywhere with respect to the measure

, we have

 Dξπ(x)=−1log(k)k∑j=1pjlog(pj). (7)

Furthermore, from (6), we can see that there are infinitely many points in the support of at which its pointwise dimension is not defined. We illustrate this idea with the following example:

Put . For each integer , define

 nm:=min{n∣n−nm−1n>1−12m}.

Let denote the vector space of measures generated by for in . Let us define by

 Su[a,b]:=12(u[a,b−a5]+u[b−b−a5,b]).

Let us define by

 Tu[a,b]:=13(u[a,b−a5]+u[a+2b−a5,a+3b−a5]+u[b−b−a5,b]).

Finally, put

 μ2m+1:=Sn2m+1Tn2m⋯Tn2Sn1⋅u[0,1],

and

 μ2m:=Tn2mSn2m−1⋯Tn2Sn1⋅u[0,1].

Then each measure is a Borel probability measure and the family converges to a Borel probability measure . It is easy to see that, for every point in the support of , we have

 D––μ(x)=12log(2)log(5),

whereas

 ¯¯¯¯¯Dμ(x)=13log(3)log(5),

Therefore, at no point in the support of does it have a well-defined pointwise dimension. More serious examples of such pathological behavior have been given by Ledrappier and Misiurewicz Ledrappier and Misiurewicz (1985), and Cutler Cutler and Dawson (1990a).

Although there is much left to be said on the matter of pointwise dimension, we end this section here hoping that we have given the readers enough of sense of this quantity that he is comfortable working with it.

### ii.2 Other notions of dimension

#### ii.2.1 Hausdorff dimension

One of the most well-known notions of dimension is that of Hausdorff Hausdorff (1918). The notion of Hausdorff dimension is built upon the notion of a Hausdorff measure.

Let , and let . An -covering of is a countable collection of sets such that and . For , put

 Hα(E,ϵ):=inf{Ek}{∑kdiam(Ek)α∣∣{Ek}k}, (8)

where the infimum is over -coverings of . Then Hausdorff -measure () is defined by

 Hα(E):=limϵ→0Hα(E,ϵ). (9)

Note that this limit either exists or is because is a non-increasing function of .

Observe that, for , we have the inequality

 Hβ(E,ϵ)≤ϵβ−αHα(E,ϵ).

Suppose that . Then allowing in the above inequality forces . Therefore there exists a unique critical exponent such that

 Hβ(E)={0,β>α∞,β<α. (10)
###### Definition 2.

For a set , we call the unique exponent satisfying (10) its Hausdorff dimension, and denote it by .

This exponent is defined to be the Hausdorff dimension of the set . Hausdorff dimension has the following properties:

1. Hausdorff dimension is invariant under locally bi-Lipschitz injections,

2. for a decomposition of into countably many sets , we have ,

3. the Hausdorff dimension of a ball in is .

Broadly speaking, Hausdorff dimension provides a measure of complexity for sets which is more finely stratified than, say, Lebesgue measure. For example, the Cantor set has Lebesgue meaure zero but Hausdorff dimension . This makes Hausdorff dimension a very useful tool in the sciences. We now discuss certain developments around Hausdorff dimension in the numerical study of dynamical systems.

Although there is no rigorous criterion by which one may classify a set as fractal, a fractional Hausdorff dimension is one important indicator of such structure. As a consequence, Hausdorff dimension has long been used to detect chaotic behavior in dyanamical systems (see, for example, Ott

Ott (2002)). For example, the Hausdorff dimension of a smooth attractor is a geometric invariant and therefore one may use it to classify such attractors.

In practice, Hausdorff dimension is difficult to estimate from data. One of the reasons for this is the definition of the Hausdorff -measures themselves (Eq. 9). Given a set , effective estimation of requires one to produce an -covering of which is close to optimal. The problem with this is that at different places in such an -covering would employ sets of different diameters. Furthermore, it is difficult to identify the critical exponent with only estimates of the -measures. This is one reason that Hausdorff dimension is difficult to estimate. Another reason comes from property 2 of Hausdorff dimension listed above. The set of points in which dictate its Hausdorff dimension may be relatively “small” which presents difficulty in the estimation.

Most numerical estimators of dimension have sought to evade these difficulties by estimating instead quantities which merely provides approximations to Hausdorff dimension. The most well-known of these approximations are capacity or box-counting dimension, correlation dimension, and information dimension. We now present each of these in turn along with some estimators.

#### ii.2.2 Box-counting dimension

The idea behind box-counting dimension is that the dimension of a set is related to asymptotic behavior of the number of cubes of side length required to cover it as . For example, for , one require cubes of side length to cover the unit cube in . This allows us to recover the dimension as

 N=limϵ→0log(ϵ−N)log(ϵ−1).

This motivates the following general definition:

###### Definition 3.

Let . Let us denote by the minimal number of cubes (or ‘boxes’) of side length require to cover .

We define the lower box-counting dimension of by

 D––B(E):=liminfϵ→0log(B(E,ϵ))log(ϵ−1).

Similarly, we define the upper box-counting dimension of by

 ¯¯¯¯¯DB(E):=limsupϵ→0log(B(E,ϵ))log(ϵ−1).

If these quantities agree, we call the common value the box-counting dimension of , and denote it by .

If, in the definition of Hausdorff dimension, we take the infimum in (8) over covers of where each has a diameter exactly , and if the corresponding limit (9) exists, we recover the box-counting dimension of .

Incidentally, the above relationship between the definitions of Hausdorff dimension and box-counting dimension shows that, for any ,

 DH(E)≤D––B(E). (11)

Frequently, it is the case that the inequality in (11) is an equality. In a sense, it is the situations in which the inequality is strict which make box-counting dimension a poor notion of dimension. For example, box-counting dimension does not satisfy anything like Property (2) of Hausdorff dimension – one cannot calculate the box-counting dimension of a set using a countable decomposition of that set. As an illustration, let . As is dense in , any covering of by -balls must also cover . Therefore, . On the other hand, if we enumerate and write

 Q=⋃k{qk},

we have for all .

Despite its flaws, box-counting dimension remains popular because it is so easy to estimate. We will discuss the estimation of box-counting dimension in Section III.

#### ii.2.3 Correlation Dimension

Grassberger and Procaccia, in Grassberger and Procaccia (1983), proposed a method of estimation of fractal dimension. However, it was not originally clear what kind of dimension they were estimating. Since then this notion has been formalized to produce what we know today as correlation dimension – see, for example, Pesin Pesin (1993). In this section, we give a brief account of this historical development.

The basic idea of Grassberger and Procaccia was very similar to the idea which motivated our description of pointwise dimension in Section II.1. The problem that they considered was that of estimating some kind of fractal dimension for an attractor in a dynamical system given a finite number of iterates of some initial values. They reasoned that the rate at which the cardinality of intersections of -balls around the data points with the data decayed would provide a reasonable notion of dimension provided that such a rate was well-defined.

To be precise, given the data

 E={x1,x2,…,xn}⊂RN,

let us fix a norm on , and define the correlation sum

 C(E,ϵ):=1nn∑i=1⎡⎣1n−1∑j≠iχ[0,ϵ)(∥xi−xj∥)⎤⎦, (12)

where

denotes the characteristic function of the interval

. Grassberger and Procaccia Grassberger and Procaccia (1983) suggested that estimating the rate of decay of with would provide a good notion of dimension for the set . While this technique was in some sense very easy to apply numerically, it lacked a rigorous mathematical foundation – there was no known notion of dimension that the Grassberger-Procaccia method (GP) could be proved to provide an estimate of in general. Implicitly, Grassberger and Procaccia had introduced a completely new notion of dimension – correlation dimension. The previously cited paper of Pesin Pesin (1993) develops correlation dimension from a vague concept to a rigorous quantity.

One difficulty in formalizing the concept of correlation dimension is that correlation sums (12) are defined for finite sets of data. For example, this prevents us from defining the correlation dimension as simply a limit of the correlation sums as . In fact, this particular point introduces much difficulty even in numerical estimation. We will discuss this in detail in Secton III. For the purposes of rigorous formulation, another problem with the finiteness of data is that it is not clear what kind of underlying process of data generation one should consider when trying to define correlation dimension. In this paper, we will assume that the data is sampled according to a probability measure , and therefore we define the correlation dimension of .

Let be a probability measure, and let

be a sequence of independent random variables with distribution

. Let us define

 En:={X1,X2,…,Xn}.

Then, for a fixed , the limit

 C(μ,ϵ):=limn→∞C(En,ϵ) (13)

exists almost surely by the strong law of large numbers, and is equal to the probability that .

###### Definition 4.

The limit

 DC(μ):=limϵ→0log(C(μ,ϵ))log(ϵ),

if it exists, is called the correlation dimension of .

If is a Borel probability measure on , then let us write for

 Fμ,ϵ(x):=μ(B(x,ϵ)).

In this case, we have

 C(μ,ϵ)=∥Fμ,ϵ∥1,

and so

 DC(μ)=limϵ→0log(∥Fμ,ϵ∥1)log(ϵ).

This suggests that there is an entire family of dimension related to the correlation dimension. These dimensions are called the -generalized dimensions, were first introduced by Hentschel and Procaccia Hentschel and Procaccia (1983) in a manner similar to that in which correlation dimension was introduced by Grassberger and Procaccia Grassberger and Procaccia (1983). Formally, we define the -generalized dimension of , if it exists, to be

 DC,q(μ):=limϵ→0log(∥Fμ,ϵ∥q−1)log(ϵ). (14)

We take the norm in above in order to be consistent with the notation used in the numerical literature. The reason for this convention is apparent from the derivation of Hentschel and Procaccia Hentschel and Procaccia (1983). In this paper, we will be concerned mainly with the correlation dimension although much of our discussion of correlation dimension carries over directly to the case of the -generalized dimension.

Note that, in definition, there is quite a bit of similarity between the correlation dimension of (Definition 4) and its pointwise dimension at some generic point in its support (Definition 1). For example, Cutler Cutler (1993) defines the average pointwise dimension of a Borel probability measure on by

 DP(μ):=∫RNDμ(x) dμ(x).

She points out that, while reflects the average of the local rates of decay of the measure of -balls around points in the support of , reflects the rate of decay of the average measure of -balls around points in the support of . This seems to have been a source of confusion in the past. We urge the reader to be wary both in reading about and applying these concepts.

To expand upon this, suppose that for each we have a Borel probability measure on with associated such that for each we have

 μi(B(x,ϵ))∼ϵαi.

Let us further assume that the supports of and are contained in disjoint open sets in and that . Finally, let us define

 μ=12μ1+12μ2.

As we will see in (16), for each , we have

 DC(μi)=αi.

Now, we also have, for ,

 C(μ,ϵ)=12C(μ1,ϵ)+12C(μ2,ϵ).

Thus, for , we have

 C(μ,ϵ)∼12ϵα1(1+ϵα2−α1).

Therefore,

 DC(μ)=limϵ→0log(ϵα1)+log(1+ϵα2−α1)log(ϵ)=α1. (15)

This already highlights one of the major problems with correlation dimension – it cannot even be used to tell apart a simple mixture of measures such as from its lowest dimensional component! We call this the dimension blindness of correlation dimension. This poses a serious problem with relying upon estimators of correlation dimension in the analysis of data. We seek to address this problem with our estimator.

### ii.3 Dimensions of Measures

Although Hausdorff dimension provides a measure of complexity of sets, as we have defined it, it is easy to derive from it a notion of dimension for measures. We follow here the development of Cutler Cutler (1993).

###### Definition 5.

Let be a Borel probability measure on . We define the Hausdorff dimension distribution of , which is a Borel probability measure on , by

 μH([0,α]):=sup{μ(E)∣DH(E)≤α}.

We define the Hausdorff dimension of to be

 DH(μ):=inf{α∈[0,N]∣μH([0,α])=1}.

At this point, we have described in some detail three notions of dimension associated with a Borel probability measure on – pointwise dimension, correlation dimension, and Hausdorff dimension. A lot is known about the relationships between these three dimensions, particularly when has constant pointwise dimension

 Dμ(x)=d,

for almost every in its support. Young Young (1982) proved that, when this is the case, we have

 DH(μ)=d.

Pesin Pesin (1993) proved that, in the same case, we have

 DC(μ)=d. (16)

In fact, the results of Young and Pesin are a bit more general. Cutler has explored in greater detail the relationship between the distribution of pointwise dimensions of and its corresponding Hausdorff dimension distribution :

###### Theorem 1 (Cutler Cutler (1993), Theorem 3.2.2).

Let be a Borel probability measure on . Let us consider the following subset of the support of :

 Dα:={x∣D––μ(x)≤α}.

Then , and .

To give a rough idea of the kind of reasoning required to prove these results, let us consider (16). Suppose that satisfies the following condition: there exist constants and such that, for all points in the support of and all ,

 cϵd<μ(B(x,ϵ))

Then it is easy to see that as, for , we have

 cϵd

For a general Borel probability measure with almost everywhere constant pointwise dimension , there is no such garantee of uniformity in local scaling. The trick to proving the general result is to observe that any such measure can be written as a limit of measures satisfying the condition (17). To see this, we can restrict to those points in its support at which the condition is satisfied for various values of and . is a limit of such restrictions.

The condition that has constant pointwise dimension almost everywhere in its support is not an exceptional one as far as dynamical data is concerned – for example, if is ergodic for a map , then this condition is satisfied under quite general constraints on . For more details, refer to Cutler (Cutler and Dawson (1990a); Cutler (1992a)). We state here a simplified version of Cutler’s results:

###### Proposition 1.

Let be a locally bi-Lipschitz, ergodic map with ergodic measure for which the pointwise dimension exists almost everywhere. Then has constant pointwise dimension almost everywhere in its support.

###### Proof.

Since is invariant under , defines an equivalence between and itself in the sense of (1) in Section II.1. This tells us that the function is -invariant. As is ergodic, this means that is constant. ∎

This can form the basis for a test of ergodicity. We will discuss this in Section VI.3.

## Iii Numerical Estimation

In this section, we demonstrate how one may use the idea of Grassberger and Procaccia Grassberger and Procaccia (1983) to estimate the correlation dimension of a measure. We call such methods GP estimators. Currently, GP estimators seem to be the most frequently used estimators of fractal dimension in the applied sciences. We begin by presenting the data to which we will apply the estimator, along with a theoretical analysis. We then describe how one would design a GP estimator for this data. We present the results of an analysis of the data with such an estimator. Finally, we conclude by discussing some of the drawbacks of GP estimators. This will motivate our estimator of pointwise dimension, which we present in the next section.

### iii.1 The data

We will use three measures in our demonstration of the GP estimator. The first probability measure that we use is the Cantor measure , which we defined in Section II.1. The second measure we use is

 ζ:=12(ξ+ξ′),

where is the measure derived from by mapping the interval linearly onto . This represents the image of

under a piecewise linear transformation. The third measure that we use is a mixture of two measures with different dimensional behavior:

 η:=13δ0×ξ+23(ξ∗δ1)×(ξ∗δ1),

where denotes the point-mass at .

Before we begin our analysis, we must calculate the correlation dimensions of each of these measures. The calculation for the first two measures follows from (16) upon observing that the pointwise dimension of those measures is equal to almost everywhere in their supports. For the measure , observe that

 DC(δ0×ξ)=log(2)log(3),

and

 DC((ξ∗δ1)×(ξ∗δ1))=2log(2)log(3).

The calculation of the correlation dimension of essentially follows the calculation (15), and we have

 DC(η)=log(2)log(3).

Once again, dimension blindness rears its ugly head. Still, it is interesting to see how it manifests itself in numerical estimation.

### iii.2 Grassberger-Procaccia Estimators

Suppose that we are given a dataset of points sampled from a Borel probability measure on , and that we wish to use this data to estimate the correlation dimension of . The basic idea in any GP estimator of correlation dimension is to use the correlation sums defined in (12) at various scales to perform this estimation according to Definition 4. Since the data is only available at a coarse resolution, some delicacy is required in inferring the limiting behavior of the correlation sums from the few which are available from the data.

What one often does is evaluate correlation sums at various scales, decides which range of scales contains the most information about the correlation dimension, and finally use a linear model to estimate the quantity

 log(C(μ,ϵ))log(ϵ)

over this range. This burden on the analyst to choose an informative range of scales is problematic as it can potentially be a large source of bias in the analysis. We will experience the difficulty of making this choice in our analysis of the test data.

### iii.3 Analysis

In Figure 1, we present graphs of as functions of for each of the measure of Section III.1. We sampled 10,000 points from each measure up to a precision of 30 ternary digits for this analysis. We have indicated our choices of informative regions on each graph by shading them. The horizontal line reflects the true correlation dimension of in each graph.

Observe that, after taking logs of the correlation sums and the parameters , the informative range of scales is the one over which the regression line for the data is closest to horizontal and the total least square error is relatively small. It is difficult to state more precise criterion for choosing this region. For example, in Figure 1(b), there seems to be no clear choice of informative scaling region.

From Figure 1, we obtain the following estimates (the ground truth is ):

1. ,

2. ,

3. .

From Figure 1(c), one may be tempted to claim that the effects of the mixture are observable – there is a distinct bump in the graph over the range . Such a claim is not easily justifiable, however, for the same behavior is present in Figure 1(a).

Figure 1(b) is of particular interest, for it shows that the GP estimates of correlation dimension are not sensitive even to piecewise linear transformations of the data.

## Iv The Estimator

### iv.1 Guiding Principles

We observed, in Section III, some of difficulties involved in estimating dimension using a GP estimator. Broadly, these difficulties arise from two sources:

1. The dimension blindness of correlation dimension.

2. The onus on the user to choose correct scale in analysis.

These are important issues to address when designing a new estimator.

Bearing in mind the dimension blindness of correlation dimension, it seems desirable to instead estimate a dimension which is more sensitive to the local structure of a given probability measure. Therefore, we choose to estimate pointwise dimension. This enables us to treat the dimension as a distribution, which has certain statistical advantages beyond the elimination of blindness of varying dimensions. This will become clear in our numerical analysis of the estimator.

Overcoming the problem of the sensitivity of data to scale is much more difficult. The reason that this sensitivity is such a problem in GP estimators is that Grassberger and Procaccia Grassberger and Procaccia (1983) explicitly depend upon limiting behavior to produce good estimates. Addressing this problem in the case of an estimator of pointwise dimension is not easy because pointwise dimension is fundamentally a local quantity. We utilize the idea that distances of data points to their nearest neighbors contain information about local dimensions of the measure that the data is sampled from. We expand upon this idea in more detail in the next section.

Finally, attempting to estimate pointwise dimension carries with it certain difficulties which are not present when estimating correlation dimension. The principal problem is that to get a true sense of distribution of pointwise dimensions over the data can be computationally very expensive. As a result of this computational difficulty, although schemes to estimate pointwise dimension have previously been suggested (for example, Cutler Cutler (1993)), there has been no large scale implementation that we are aware of. In our estimator, we utilize clustering techniques to mitigate this cost. The idea is to identify data points at which the underlying measure seems to have similar scaling properties. We discuss this in greater detail in Section IV.3 and IV.4.

Finally, one situation in which it was difficult to choose scales for analysis was when the data had been transformed by a locally bi-Lipschitz injection. Although we have not emphasized this point very much in our discussion so far, it was a primary motivation in developing our estimator. As pointwise dimension is purely local quantity, an estimator of pointwise dimension should naturally be less sensitive to the effects of such transformations. In fact, we incorporate such insensitivity to local scaling in our estimator in the form of local density parameters, this is certainly something that the reader should be aware of as the proceed through this paper.

To summarize, there are four points that the reader should keep in mind throughout the rest of this paper:

1. Pointwise dimension: We estimate pointwise dimension, which is sensitive to differences in scaling behavior over the data.

2. Limit-free description: We utilize a limit-free description of pointwise dimension, which makes estimation more effective.

3. Transformation invariance: We introduce a local density parameter which gives our estimator the flexibility to cope with data which has been transformed by a locally bi-Lipschitz injection.

4. Clustering: We use clustering techniques to identify data points with similar pointwise dimension characteristics, which saves on computational cost.

### iv.2 A limit-free description of pointwise dimension

In this section, we develop an estimator of pointwise dimension which utilizes the distances from data points to their nearest neighbors. The distributions of these distances are approximated by waiting time distributions for Poisson processes. Such approximations provide an effective means of estimation for a large class of generating measures . Cutler and Dawson Cutler and Dawson (1989, 1990b) have similar results for different classes of measures using very different techniques.

Let us begin by considering a Borel probability measure which satisfies the following local uniformity condition:

Suppose that the pointwise dimension is defined at every point in the support of . Suppose further that, at every such point , there is a constant and such that, for all ,

 μ(B(x,ϵ))=δxϵDμ(x). (18)

The main difficulty in obtaining the limit-free description of is that it is difficult to make general statements about the infinitesimal behavior of at . We will show that, as far as measures satisfying the local uniformity condition are concerned, one may obtain effective estimates of pointwise dimension by assuming that the infinitesimal behavior is that of a Poisson process.

Let be a Borel probability measure on satisfying the local uniformity condition (18) above, and suppose we would like to estimate given some data sampled from . Let us say that the question of whether or not there is a data point at distance from is determined by a Poisson process with rate . Explicitly counts the number of data points at distance up to from , and has density function.

 ζ(n)=δnxrnDμ(x)n!exp(−δrDμ(x)) (19)

Let us define, for , the waiting times

 Rn(x):=sup{r>0∣Z(r)

These are the distances from to its nearest data points. The probability densities of the random variables can be easily calculated from that of as

 ϕn(r)=Dμ(x)δnxrnDμ(x)−1(n−1)!exp(−δxrDμ(x)). (20)

The special case

has been derived in Cutler as a log gamma distribution (

Cutler (1992b); See also Cutler (1993) Definition 7.2.1) and Badii and Politi (1985). The essentially same form as (19) is also used by Nerenberg and Essex Nerenberg and Essex (1990) in the context of error analysis correlation dimension.

Since satisfies the local uniformity condition, it looks like a uniform measure in a small ball around . Specifically, it looks like a uniform measure which arises from the Poisson process described above by conditioning on the region where the data is sampled. As a result, the distribution corresponding to (20) is a good approximation to the true nearest neighbor distribution at small scales. This means that, if the data is fine enough, (20) provides an effective way of estimating the pointwise dimension. As we will see in our analysis of our estimator, this sensitivity to scale of the approximation is not very marked.

We end this section by showing how the limit-free description we have given above corresponds to a more conventional derivation of pointwise dimension.

Let us denote the distribution function of (20) by . Specifically, it is given by an incomplete gamma function:

 Fn(r)=∫δrDμ(x)0tn−1exp(−t) dt. (21)

In fact, gives the same asymptotic scaling behavior as . Namely, using LHopital’s rule, we get

 limr→01nlogFn(r)logr=Dμ(x). (22)

As the densities are allowed to vary, and since we are trying to estimate the distribution of pointwise dimensions of , it will be useful for us to distinguish between the distributions corresponding to different choices of parameters in (18). Therefore, we consider the conditional distributions

 Fn(r∣δx,Dμ(x)).

We denote their corresponding probability density functions by

 ϕn(r∣δx,Dμ(x)).

In practice, the sample probability density is often a better approximation to the true probability density of the distribution being sampled from than the sample distribution is of the distribution. Thus we seek to approximate the probability density functions from the data given to us. For a given , we call the estimate this gives us for the nearest-neighbor dimension.

As we have seen, the nearest-neighbor dimensions provide good estimates to pointwise dimensions for measures satisfying the local uniformity condition (18). We remark that there are certainly pathological measures for which this condition does not hold – for example, the measures defined in Billingsley Billingsley (1961), which Cutler Cutler (1993) terms “fractal measures”. It is an interesting problem to determine how one may give a limit-free description of pointwise dimension in the case of such measures.

As a final point of note, it is the densities in this description which enables us to build into our algorithm the desired insensitivity to locally bi-Lipschitz injections of the data which we discussed in the previous section.

### iv.3 Algorithm Overview

Suppose that we are given a sequence

 D={x1,x2,…,xk}

in of data sampled from a Borel probablity measure . For each and each , let us denote by the distance from to its closest neighbor in . For each , we define the sequence

 Ri:={r1(i),r2(i),…,rk(i)}.

In our algorithm, a parameter is specified at the outset. Our goal will be to use the -nearest neighbor distances of the data in order to obtain an approximation

 μ=M∑m=1θmμm, (23)

where

1. , .

2. Each is a Borel probability measure satisfying the local uniformity condition of (18).

3. Each measure has constant pointwise dimension and constant density for all points in its support.

4. The measures have disjoint supports.

Given (23), we can partition the data into clusters defined by the supports of the measures . represents the proportion of the data in the cluster corresponding to .

The quality of an approximation of the type specified by (23) depends on the choice of . For example, if , then any dimension estimate derived from the approximation reflects an aggregate of the local scaling behaviors of around the data points. On the other hand, if (the size of the data ), then one obtains a clear picture of the individual scaling behaviors, but at great cost. Often it is not necessary to go to this latter extreme in order to estimate the distribution of pointwise dimensions of at the data points. An important feature of our algorithm is that it chooses the clusters and the number of clusters adaptively, balancing concerns of efficiency with those of clarity. We will describe this in greater detail in Section IV.4.5.

The bigger problem is actually producing the measures . In order to do this, we will use a Variational Bayesian method. The idea is that, since each satisfies the local uniformity condition (18) with constant dimension and density, the -nearest neighbor distribution for each cluster of the data should have a probability density function which is approximately from (20) for the appropriate choice of parameters and . As a part of this procedure, we use the density functions to produce posterior distributions for the parameters and having made a choice of prior distributions. It is important to note that this forces us to treat the dimension (as well as the other parameters) implicitly as random variables. It is also worth noting that the choice of prior distributions for the relevant parameters can have a marked impact on the performance of the estimator when the data is sparse.

The cluster parameters – the number of clusters , the cluster assignments , and consequently the weights – are estimated concurrently with the parameters and . In order to do this, we make use of the latent cluster assignment variables

 Zim:={1,if xi∈Km,0,else.

Most of the optimization in our algorithm is conditioned on the number of clusters . In these steps, the variables that we seek to estimate from the data are , and . Our estimates for the variables are derived from those. Thus we require prior distributions for , and . We assume that the variables and follow independent gamma distribution with the parameter and respectively. For each integer , we assume that the vector follows a Dirichlet distribution with the parameter . Figure 2 provides a graphical model depicting the dependency between these variables.

#### iv.3.1 Glossary

Data:

• — a sequence of data points in sampled from the Borel probability measure for which we are trying to estimate the distribution of pointwise dimensions.

• — the sequence of distances from each of the data points in to its -nearest neighbor.

Basic variables used in estimation:

• — the number of clusters in the approximation (23).

• — the weight of cluster in (23).

• — the pointwise dimension of in (23).

• — the density parameter in the Poisson process approximating according to the discussion in Section IV.2.

• — the indicator that the data point in belongs to cluster .

Hyper-parameters:

• — the parameters for the gamma prior of the dimension parameter .

• — the parameters for the gamma prior of the density parameter .

• — the parameters for the Dirichlet prior of the weight parameters .

Some short-hand:

• — a list of the basic variables apart from .

• — the list of all the basic variables.

• — the list of hyper-parameters for a fixed number of clusters.

• — here, the number of clusters is allowed to vary.

#### iv.3.2 The Objective

The goal of the procedure described above is to compute the posterior distribution for the list of variables given and :

 P(Θ∣Ω,RN)=P(RN∣Θ)P(Θ∣Ω)P(RN∣Ω). (24)

### iv.4 The Variational Bayesian Algorithm

#### iv.4.1 Phases

As we have a clean description (20) of the relationship between the basic variables, we can feasibly use the variational Bayesian method presented by Jordan et al. Jordan et al. (1997); Attias (1999); Beal (2003); Ghahramani and Beal (1999) to estimate the posterior distribution (24). There are two distinct phases in our application of this method:

• The fixed phase — here, we estimate the posterior distribution conditioned upon the number of clusters .

• The clustering phase — here, we use the techniques of Ueda et al. Ueda et al. (2000); Ueda and Ghahramani (2002) to find the number of clusters for which the estimate from the corresponding fixed phase best approximates the posterior of (24).

Although we treat these phases separately, there is considerable between them as we iteratively update our estimate of the posterior (24).

In the following section, we describe the connection between the two phases of our procedure – the objective function that our estimator seeks to maximize.

#### iv.4.2 The Objective Function

At each step, and for each component of , we distinguish between the true marginal posterior distribution derived from (24) and our estimates of this distribution , which we also denote for convenience as we are not varying the hyper-parameters . Finally, this is crucial, we assume that these marginal distributions of are independent for each fixed number of clusters :

 Q(ΘM)=k∏i=1Q(θ(M))M∏m=1Q(Zim)Q(dm)Q(δm). (25)

We can estimate the quality of the approximation by considering the likelihood which measures how well the data is determined by the hyper-parameters . To begin with,

 logP(RN|ΩM)=log∫P(RN,ΘM∣ΩM) dΘM.

Let us write

 LM(Q)=∫Q(ΘM)logP(RN,ΘM∣ΩM)Q(ΘM) dΘ, (26)

and let us denote by

 κM(Q):=KL(Q(ΘM)∥P(ΘM,RN∣ΩM)). (27)

We have the decomposition

 logP(RN|ΩM)=LM(Q)+κM(Q). (28)

Our goal, in the fixed phase of the procedure, is to find the distribution which maximizes the objective function . This is equivalent from the above decomposition to minimizing the Kullback-Leibler divergence , which measures the error in our approximation.

In our implementation, we use the explicit description of the objective functions to perform the optimization. We will describe the details for each of the phases separately once we have presented explicit descriptions of the relevant prior distributions, which the following section is devoted to.

#### iv.4.3 The Prior Distributions

In this section, we present expressions for the prior distributions which we will make use of in our implementation. We also describe the distribution of the data conditioned on the basic variables which we use in conjunction with the priors to derive the posterior distribution .

We begin by describing our assumed prior distributions for the basic parameters given a fixed number of clusters . We stated our assumptions about these parameters at the end of Section IV.3, but we provide here explicit expressions of their probability densities:

1. The density parameters — The density parameter is assumed to follow a gamma distribution with the shape parameter and the rate parameter :

 P(δm|αδ,βδ)=βαδδδαδ−1mΓ(αδ)exp(−βδδm). (29)
2. The dimension parameters — The dimension parameter is assumed to follow a gamma distribution and :

 P(dm|αd,βd)=βαdddαd−1mΓ(αd)exp(−βddm). (30)
3. The weight parameters — The weight parameter follows a Dirichlet distribution with the exponents :

 P(θ|γ)=Γ(∑Mm=1γm)M∏m=1Γ(γm)M∏m=1θγm−1m. (31)
4. The cluster indicators — The cluster indicators follow a multinomial distribution with the probabilities :

 P(Zi∣θ)=M∏m=1θZimm (32)

Finally, the conditional probability of drawing the samples given a set of basic variables is

 P(RN∣ΘM)=