I Introduction
Dimension has become an important tool in the study of dynamical systems.
Its importance arises from the attempt to understand dynamical sytems by their attractors.
The dimensions of these attractors provide useful invariants of the systems in question.
There are many notions of dimension that one can consider in this context.
The most wellknown of these is Hausdorff dimension. Hausdorff dimension, however, is difficult to estimate numerically and this has given rise to many of the other conceptions.
Often, one can associate with ergodic systems certain probability measures supported on attractors of interest. There is a dimension theory for probability measures, roughly analogous to the classical dimension theory for sets, which one can bring to bear in the context of such systems. This is the setting for this paper. In particular, we focus on the problem of estimating the
pointwise dimension distribution of a probability measure by sampling from it.In Section II, we discuss various mathematical notions of dimension and the relationships between them.
In Section III, we discuss the current most popular technique for estimating fractal dimension – that of Grassberger and Procaccia Grassberger and Procaccia (1983).
This discussion motivates our derivation of our estimator of pointwise dimension, which we present in Section IV.
In Section V, we establish a certain degree of reliability of the estimator proposed in Section IV by performing a dimensional analysis of some test data.
Finally, in Section VI, we discuss potential applications of our estimator as well as certain theoretical questions arising from our considerations.
Ii Mathematical Background
ii.1 Pointwise Dimension
Very generally, the purpose of any notion of dimension is to tell apart certain types of objects given a certain notion of equivalence of such objects.
Pointwise dimension tells apart Borel probability measure on metric spaces which are not equivalent by a locally biLipschitz injection.
To be precise let and be Borel probability measures on the metric spaces and respectively. For a point in a metric space and a number , let us write
A map is called locally biLipschitz if for all there exists and a constant such that for all we have
Let us say that the measures and are equivalent if there exists a locally biLipschitz injection such that for all measurable we have
(1) 
Pointwise dimension is a source of invariants of such probability measures under this notion of equivalence.
Rather than prove this general statement, we will illustrate this in a very simple situation which motivates the concept of pointwise dimension.
The illustration will contain all the ideas require for the general proof.
Consider the unit interval along with its uniform measure and the unit square along with its uniform measure , both under their natural Euclidean metrics. The “pointwise dimension” method of telling these measures apart is to observe that, for , we have
whereas, for , we have
(2) 
If there were an equivalence between and , then we would have
This would force
which would contradict (2).
For a general Borel probability measure, growth rates such as the ones we computed above for and may not exist at all points or even at any point in the corresponding metric space. This motivates the following definition:
Definition 1.
Let be a Borel probability measure on a metric space . The lower pointwise dimension of at is defined to be
(3) 
The upper pointwise dimension of at is defined to be
(4) 
If these two quantities agree, we call the common value the pointwise dimension of at and denote it .
The calculations we performed highlight two properties of pointwise dimension:

Pointwise dimension is invariant under biLipschitz injections.

If is absolutely continuous with respect to Lebesgue measure on , then the pointwise dimension of at any point in its support is .
Let us now see how pointwise dimension behaves for other classes of measures. We begin by considering the Cantor measure – this is the measure resulting from the Cantor middlethirds construction if at each scale we choose one of the remaining intervals uniformly. To be precise, for an interval , let denote the uniform measure on . At the stage of the middle third construction, there are admissible interval . Let us write
There is a limiting measure for the family which we call the Cantor measure.
is supported on the Cantor set .
For and , let us first calculate . Note that the scale of this ball in terms of the middlethirds construction is given by . This means that
This shows that
which in turn proves that exists and is equal to .
Of course, the Cantor measure is a special case of a much more general class of measures which we call Cantorlike measures. This class of measures was first studied by Billingsley in the context of dimension theory Billingsley (1960, 1961).
We restrict our attention here to the case of Borel probability measure on and essentially reproduce Billingsley’s definition here. It is possible to give a more modern definition in terms of iterated function systems.
Let be an integer and let
be a probability vector. Divide the unit interval
into subintervals of equal length . Using our notation for the uniform measure from before, we writeWe can further subdivide each of the intervals into subintervals of equal length . This allows us define
Iterating this procedure times yields a probability measure .
The family has a limiting measure which we denote and which generalizes our earlier construction of the Cantor measure. We will now calculate the pointwise dimension of such a measure at a point .
A point is uniquely specified by a sequence of integers as
Let . The scale of the ball in terms of the construction of is given by . In fact, if we write and
(5) 
Let us count, for ,
From (5), we see that
(6) 
Let us denote by the set of for which, as ,
Then
by the strong law of large numbers. Therefore, almost everywhere with respect to the measure
, we have(7) 
Furthermore, from (6), we can see that there are infinitely many points in the support of at which its pointwise dimension is not defined. We illustrate this idea with the following example:
Put . For each integer , define
Let denote the vector space of measures generated by for in . Let us define by
Let us define by
Finally, put
and
Then each measure is a Borel probability measure and the family converges to a Borel probability measure . It is easy to see that, for every point in the support of , we have
whereas
Therefore, at no point in the support of does it have a welldefined pointwise dimension.
More serious examples of such pathological behavior have been given by Ledrappier and Misiurewicz Ledrappier and Misiurewicz (1985), and Cutler Cutler and Dawson (1990a).
Although there is much left to be said on the matter of pointwise dimension, we end this section here hoping that we have given the readers enough of sense of this quantity that he is comfortable working with it.
ii.2 Other notions of dimension
ii.2.1 Hausdorff dimension
One of the most wellknown notions of dimension is that of Hausdorff Hausdorff (1918). The notion of Hausdorff dimension is built upon the notion of a Hausdorff measure.
Let , and let . An covering of is a countable collection of sets such that and . For , put
(8) 
where the infimum is over coverings of . Then Hausdorff measure () is defined by
(9) 
Note that this limit either exists or is because is a nonincreasing function of .
Observe that, for , we have the inequality
Suppose that . Then allowing in the above inequality forces . Therefore there exists a unique critical exponent such that
(10) 
Definition 2.
For a set , we call the unique exponent satisfying (10) its Hausdorff dimension, and denote it by .
This exponent is defined to be the Hausdorff dimension of the set . Hausdorff dimension has the following properties:

Hausdorff dimension is invariant under locally biLipschitz injections,

for a decomposition of into countably many sets , we have ,

the Hausdorff dimension of a ball in is .
Broadly speaking, Hausdorff dimension provides a measure of complexity for sets which is more finely stratified than, say, Lebesgue measure. For example, the Cantor set has Lebesgue meaure zero but Hausdorff dimension . This makes Hausdorff dimension a very useful tool in the sciences.
We now discuss certain developments around Hausdorff dimension in the numerical study of dynamical systems.
Although there is no rigorous criterion by which one may classify a set as fractal, a fractional Hausdorff dimension is one important indicator of such structure. As a consequence, Hausdorff dimension has long been used to detect chaotic behavior in dyanamical systems (see, for example, Ott
Ott (2002)). For example, the Hausdorff dimension of a smooth attractor is a geometric invariant and therefore one may use it to classify such attractors.In practice, Hausdorff dimension is difficult to estimate from data. One of the reasons for this is the definition of the Hausdorff measures themselves (Eq. 9). Given a set , effective estimation of requires one to produce an covering of which is close to optimal. The problem with this is that at different places in such an covering would employ sets of different diameters.
Furthermore, it is difficult to identify the critical exponent with only estimates of the measures.
This is one reason that Hausdorff dimension is difficult to estimate.
Another reason comes from property 2 of Hausdorff dimension listed above. The set of points in which dictate its Hausdorff dimension may be relatively “small” which presents difficulty in the estimation.
Most numerical estimators of dimension have sought to evade these difficulties by estimating instead quantities which merely provides approximations to Hausdorff dimension. The most wellknown of these approximations are capacity or boxcounting dimension, correlation dimension, and information dimension. We now present each of these in turn along with some estimators.
ii.2.2 Boxcounting dimension
The idea behind boxcounting dimension is that the dimension of a set is related to asymptotic behavior of the number of cubes of side length required to cover it as . For example, for , one require cubes of side length to cover the unit cube in . This allows us to recover the dimension as
This motivates the following general definition:
Definition 3.
Let . Let us denote by the minimal number of cubes (or ‘boxes’) of side length require to cover .
We define the lower boxcounting dimension of by
Similarly, we define the upper boxcounting dimension of by
If these quantities agree, we call the common value the boxcounting dimension of , and denote it by .
If, in the definition of Hausdorff dimension, we take the infimum in (8) over covers of where each has a diameter exactly , and if the corresponding limit (9) exists, we recover the boxcounting dimension of .
Incidentally, the above relationship between the definitions of Hausdorff dimension and boxcounting dimension shows that, for any ,
(11) 
Frequently, it is the case that the inequality in (11) is an equality. In a sense, it is the situations in which the inequality is strict which make boxcounting dimension a poor notion of dimension. For example, boxcounting dimension does not satisfy anything like Property (2) of Hausdorff dimension – one cannot calculate the boxcounting dimension of a set using a countable decomposition of that set. As an illustration, let . As is dense in , any covering of by balls must also cover . Therefore, . On the other hand, if we enumerate and write
we have for all .
Despite its flaws, boxcounting dimension remains popular because it is so easy to estimate. We will discuss the estimation of boxcounting dimension in Section III.
ii.2.3 Correlation Dimension
Grassberger and Procaccia, in Grassberger and Procaccia (1983), proposed a method of estimation of fractal dimension.
However, it was not originally clear what kind of dimension they were estimating. Since then this notion has been formalized to produce what we know today as correlation dimension – see, for example, Pesin Pesin (1993).
In this section, we give a brief account of this historical development.
The basic idea of Grassberger and Procaccia was very similar to the idea which motivated our description of pointwise dimension in Section II.1.
The problem that they considered was that of estimating some kind of fractal dimension for an attractor in a dynamical system given a finite number of iterates of some initial values.
They reasoned that the rate at which the cardinality of intersections of balls around the data points with the data decayed would provide a reasonable notion of dimension provided that such a rate was welldefined.
To be precise, given the data
let us fix a norm on , and define the correlation sum
(12) 
where
denotes the characteristic function of the interval
. Grassberger and Procaccia Grassberger and Procaccia (1983) suggested that estimating the rate of decay of with would provide a good notion of dimension for the set . While this technique was in some sense very easy to apply numerically, it lacked a rigorous mathematical foundation – there was no known notion of dimension that the GrassbergerProcaccia method (GP) could be proved to provide an estimate of in general. Implicitly, Grassberger and Procaccia had introduced a completely new notion of dimension – correlation dimension. The previously cited paper of Pesin Pesin (1993) develops correlation dimension from a vague concept to a rigorous quantity.One difficulty in formalizing the concept of correlation dimension is that correlation sums (12)
are defined for finite sets of data.
For example, this prevents us from defining the correlation dimension as simply a limit of the correlation sums as . In fact, this particular point introduces much difficulty even in numerical estimation. We will discuss this in detail in Secton III.
For the purposes of rigorous formulation, another problem with the finiteness of data is that it is not clear what kind of underlying process of data generation one should consider when trying to define correlation dimension.
In this paper, we will assume that the data is sampled according to a probability measure , and therefore we define the correlation dimension of .
Let be a probability measure, and let
be a sequence of independent random variables with distribution
. Let us defineThen, for a fixed , the limit
(13) 
exists almost surely by the strong law of large numbers, and is equal to the probability that .
Definition 4.
The limit
if it exists, is called the correlation dimension of .
If is a Borel probability measure on , then let us write for
In this case, we have
and so
This suggests that there is an entire family of dimension related to the correlation dimension. These dimensions are called the generalized dimensions, were first introduced by Hentschel and Procaccia Hentschel and Procaccia (1983) in a manner similar to that in which correlation dimension was introduced by Grassberger and Procaccia Grassberger and Procaccia (1983). Formally, we define the generalized dimension of , if it exists, to be
(14) 
We take the norm in above in order to be consistent with the notation used in the numerical literature.
The reason for this convention is apparent from the derivation of Hentschel and Procaccia Hentschel and Procaccia (1983).
In this paper, we will be concerned mainly with the correlation dimension although much of our discussion of correlation dimension carries over directly to the case of the generalized dimension.
Note that, in definition, there is quite a bit of similarity between the correlation dimension of (Definition 4) and its pointwise dimension at some generic point in its support (Definition 1). For example, Cutler Cutler (1993) defines the average pointwise dimension of a Borel probability measure on by
She points out that, while reflects the average of the local rates of decay of the measure of balls around points in the support of , reflects the rate of decay of the average measure of balls around points in the support of .
This seems to have been a source of confusion in the past.
We urge the reader to be wary both in reading about and applying these concepts.
To expand upon this, suppose that for each we have a Borel probability measure on with associated such that for each we have
Let us further assume that the supports of and are contained in disjoint open sets in and that . Finally, let us define
As we will see in (16), for each , we have
Now, we also have, for ,
Thus, for , we have
Therefore,
(15) 
This already highlights one of the major problems with correlation dimension – it cannot even be used to tell apart a simple mixture of measures such as from its lowest dimensional component! We call this the dimension blindness of correlation dimension. This poses a serious problem with relying upon estimators of correlation dimension in the analysis of data. We seek to address this problem with our estimator.
ii.3 Dimensions of Measures
Although Hausdorff dimension provides a measure of complexity of sets, as we have defined it, it is easy to derive from it a notion of dimension for measures. We follow here the development of Cutler Cutler (1993).
Definition 5.
Let be a Borel probability measure on . We define the Hausdorff dimension distribution of , which is a Borel probability measure on , by
We define the Hausdorff dimension of to be
At this point, we have described in some detail three notions of dimension associated with a Borel probability measure on – pointwise dimension, correlation dimension, and Hausdorff dimension. A lot is known about the relationships between these three dimensions, particularly when has constant pointwise dimension
for almost every in its support. Young Young (1982) proved that, when this is the case, we have
Pesin Pesin (1993) proved that, in the same case, we have
(16) 
In fact, the results of Young and Pesin are a bit more general. Cutler has explored in greater detail the relationship between the distribution of pointwise dimensions of and its corresponding Hausdorff dimension distribution :
Theorem 1 (Cutler Cutler (1993), Theorem 3.2.2).
Let be a Borel probability measure on . Let us consider the following subset of the support of :
Then , and .
To give a rough idea of the kind of reasoning required to prove these results, let us consider (16). Suppose that satisfies the following condition: there exist constants and such that, for all points in the support of and all ,
(17) 
Then it is easy to see that as, for , we have
For a general Borel probability measure with almost everywhere constant pointwise dimension , there is no such garantee of uniformity in local scaling.
The trick to proving the general result is to observe that any such measure can be written as a limit of measures satisfying the condition (17).
To see this, we can restrict to those points in its support at which the condition is satisfied for various values of and . is a limit of such restrictions.
The condition that has constant pointwise dimension almost everywhere in its support is not an exceptional one as far as dynamical data is concerned – for example, if is ergodic for a map , then this condition is satisfied under quite general constraints on . For more details, refer to Cutler (Cutler and Dawson (1990a); Cutler (1992a)). We state here a simplified version of Cutler’s results:
Proposition 1.
Let be a locally biLipschitz, ergodic map with ergodic measure for which the pointwise dimension exists almost everywhere. Then has constant pointwise dimension almost everywhere in its support.
Proof.
This can form the basis for a test of ergodicity. We will discuss this in Section VI.3.
Iii Numerical Estimation
In this section, we demonstrate how one may use the idea of Grassberger and Procaccia Grassberger and Procaccia (1983) to estimate the correlation dimension of a measure. We call such methods GP estimators.
Currently, GP estimators seem to be the most frequently used estimators of fractal dimension in the applied sciences. We begin by presenting the data to which we will apply the estimator, along with a theoretical analysis.
We then describe how one would design a GP estimator for this data.
We present the results of an analysis of the data with such an estimator.
Finally, we conclude by discussing some of the drawbacks of GP estimators.
This will motivate our estimator of pointwise dimension, which we present in the next section.
iii.1 The data
We will use three measures in our demonstration of the GP estimator. The first probability measure that we use is the Cantor measure , which we defined in Section II.1. The second measure we use is
where is the measure derived from by mapping the interval linearly onto . This represents the image of
under a piecewise linear transformation. The third measure that we use is a mixture of two measures with different dimensional behavior:
where denotes the pointmass at .
Before we begin our analysis, we must calculate the correlation dimensions of each of these measures. The calculation for the first two measures follows from (16) upon observing that the pointwise dimension of those measures is equal to almost everywhere in their supports. For the measure , observe that
and
The calculation of the correlation dimension of essentially follows the calculation (15), and we have
Once again, dimension blindness rears its ugly head. Still, it is interesting to see how it manifests itself in numerical estimation.
iii.2 GrassbergerProcaccia Estimators
Suppose that we are given a dataset of points sampled from a Borel probability measure on , and that we wish to use this data to estimate the correlation dimension of .
The basic idea in any GP estimator of correlation dimension is to use the correlation sums
defined in (12) at various scales to perform this estimation according to Definition 4.
Since the data is only available at a coarse resolution, some delicacy is required in inferring the limiting behavior of the correlation sums from the few which are available from the data.
What one often does is evaluate correlation sums at various scales, decides which range of scales contains the most information about the correlation dimension, and finally use a linear model to estimate the quantity
over this range. This burden on the analyst to choose an informative range of scales is problematic as it can potentially be a large source of bias in the analysis. We will experience the difficulty of making this choice in our analysis of the test data.
iii.3 Analysis
In Figure 1, we present graphs of as functions of for each of the measure of Section III.1.
We sampled 10,000 points from each measure up to a precision of 30 ternary digits for this analysis.
We have indicated our choices of informative regions on each graph by shading them.
The horizontal line reflects the true correlation dimension of in each graph.
Observe that, after taking logs of the correlation sums and the parameters , the informative range of scales is the one over which the regression line for the data is closest to horizontal and the total least square error is relatively small. It is difficult to state more precise criterion for choosing this region.
For example, in Figure 1(b), there seems to be no clear choice of informative scaling region.
From Figure 1, we obtain the following estimates (the ground truth is ):

,

,

.
From Figure 1(c), one may be tempted to claim that the effects of the mixture are observable – there is a distinct bump in the graph over the range .
Such a claim is not easily justifiable, however, for the same behavior is present in Figure 1(a).
Figure 1(b) is of particular interest, for it shows that the GP estimates of correlation dimension are not sensitive even to piecewise linear transformations of the data.
Iv The Estimator
iv.1 Guiding Principles
We observed, in Section III, some of difficulties involved in estimating dimension using a GP estimator. Broadly, these difficulties arise from two sources:

The dimension blindness of correlation dimension.

The onus on the user to choose correct scale in analysis.
These are important issues to address when designing a new estimator.
Bearing in mind the dimension blindness of correlation dimension, it seems desirable to instead estimate a dimension which is more sensitive to the local structure of a given probability measure.
Therefore, we choose to estimate pointwise dimension.
This enables us to treat the dimension as a distribution, which has certain statistical advantages beyond the elimination of blindness of varying dimensions. This will become clear in our numerical analysis of the estimator.
Overcoming the problem of the sensitivity of data to scale is much more difficult.
The reason that this sensitivity is such a problem in GP estimators is that Grassberger and Procaccia Grassberger and Procaccia (1983) explicitly depend upon limiting behavior to produce good estimates.
Addressing this problem in the case of an estimator of pointwise dimension is not easy because pointwise dimension is fundamentally a local quantity.
We utilize the idea that distances of data points to their nearest neighbors contain information about local dimensions of the measure that the data is sampled from.
We expand upon this idea in more detail in the next section.
Finally, attempting to estimate pointwise dimension carries with it certain difficulties which are not present when estimating correlation dimension.
The principal problem is that to get a true sense of distribution of pointwise dimensions over the data can be computationally very expensive.
As a result of this computational difficulty, although schemes to estimate pointwise dimension have previously been suggested (for example, Cutler Cutler (1993)), there has been no large scale implementation that we are aware of. In our estimator, we utilize clustering techniques to mitigate this cost.
The idea is to identify data points at which the underlying measure seems to have similar scaling properties.
We discuss this in greater detail in Section IV.3 and IV.4.
Finally, one situation in which it was difficult to choose scales for analysis was when the data had been transformed by a locally biLipschitz injection.
Although we have not emphasized this point very much in our discussion so far, it was a primary motivation in developing our estimator.
As pointwise dimension is purely local quantity, an estimator of pointwise dimension should naturally be less sensitive to the effects of such transformations.
In fact, we incorporate such insensitivity to local scaling in our estimator in the form of local density parameters,
this is certainly something that the reader should be aware of as the proceed through this paper.
To summarize, there are four points that the reader should keep in mind throughout the rest of this paper:

Pointwise dimension: We estimate pointwise dimension, which is sensitive to differences in scaling behavior over the data.

Limitfree description: We utilize a limitfree description of pointwise dimension, which makes estimation more effective.

Transformation invariance: We introduce a local density parameter which gives our estimator the flexibility to cope with data which has been transformed by a locally biLipschitz injection.

Clustering: We use clustering techniques to identify data points with similar pointwise dimension characteristics, which saves on computational cost.
iv.2 A limitfree description of pointwise dimension
In this section, we develop an estimator of pointwise dimension which utilizes the distances from data points to their nearest neighbors.
The distributions of these distances are approximated by waiting time distributions for Poisson processes.
Such approximations provide an effective means of estimation for a large class of generating measures .
Cutler and Dawson Cutler and Dawson (1989, 1990b) have similar results for different classes of measures using very different techniques.
Let us begin by considering a Borel probability measure which satisfies the following local uniformity condition:
Suppose that the pointwise dimension is defined at every point in the support of . Suppose further that, at every such point , there is a constant and such that, for all ,
(18) 
The main difficulty in obtaining the limitfree description of is
that it is difficult to make general statements about the infinitesimal behavior of at .
We will show that,
as far as measures satisfying the local uniformity condition are concerned,
one may obtain effective estimates of pointwise dimension by assuming
that the infinitesimal behavior is that of a Poisson process.
Let be a Borel probability measure on satisfying the local uniformity condition (18) above, and suppose we would like to estimate given some data sampled from . Let us say that the question of whether or not there is a data point at distance from is determined by a Poisson process with rate . Explicitly counts the number of data points at distance up to from , and has density function.
(19) 
Let us define, for , the waiting times
These are the distances from to its nearest data points. The probability densities of the random variables can be easily calculated from that of as
(20) 
The special case
has been derived in Cutler as a log gamma distribution (
Cutler (1992b); See also Cutler (1993) Definition 7.2.1) and Badii and Politi (1985). The essentially same form as (19) is also used by Nerenberg and Essex Nerenberg and Essex (1990) in the context of error analysis correlation dimension.Since satisfies the local uniformity condition, it looks like a uniform measure in a small ball around .
Specifically, it looks like a uniform measure which arises from the Poisson process described above by conditioning on the region where the data is sampled.
As a result, the distribution corresponding to (20) is a good approximation to the true nearest neighbor distribution at small scales.
This means that, if the data is fine enough, (20) provides an effective way of estimating the pointwise dimension.
As we will see in our analysis of our estimator, this sensitivity to scale of the approximation is not very marked.
We end this section by showing how the limitfree description we have given above corresponds to a more conventional derivation of pointwise dimension.
Let us denote the distribution function of (20) by . Specifically, it is given by an incomplete gamma function:
(21) 
In fact, gives the same asymptotic scaling behavior as . Namely, using LHopital’s rule, we get
(22) 
As the densities are allowed to vary, and since we are trying to estimate the distribution of pointwise dimensions of , it will be useful for us to distinguish between the distributions corresponding to different choices of parameters in (18). Therefore, we consider the conditional distributions
We denote their corresponding probability density functions by
In practice, the sample probability density is often a better approximation to the true probability density of the distribution being sampled from than the sample distribution is of the distribution.
Thus we seek to approximate the probability density functions
from the data given to us.
For a given ,
we call the estimate this gives us for the nearestneighbor dimension.
As we have seen, the nearestneighbor dimensions provide good estimates to pointwise dimensions for measures satisfying the local uniformity condition (18).
We remark that there are certainly pathological measures for which this condition does not hold – for example, the measures defined in Billingsley Billingsley (1961), which Cutler Cutler (1993) terms “fractal measures”.
It is an interesting problem to determine how one may give a limitfree description of pointwise dimension in the case of such measures.
As a final point of note, it is the densities in this description which enables us to build into our algorithm the desired insensitivity to locally biLipschitz injections of the data which we discussed in the previous section.
iv.3 Algorithm Overview
Suppose that we are given a sequence
in of data sampled from a Borel probablity measure . For each and each , let us denote by the distance from to its closest neighbor in . For each , we define the sequence
In our algorithm, a parameter is specified at the outset. Our goal will be to use the nearest neighbor distances of the data in order to obtain an approximation
(23) 
where

, .

Each is a Borel probability measure satisfying the local uniformity condition of (18).

Each measure has constant pointwise dimension and constant density for all points in its support.

The measures have disjoint supports.
Given (23), we can partition the data into clusters defined by the supports of the measures . represents the proportion of the data in the cluster corresponding to .
The quality of an approximation of the type specified by (23) depends on the choice of .
For example, if , then any dimension estimate derived from the approximation reflects an aggregate of the local scaling behaviors of around the data points. On the other hand, if (the size of the data ), then one obtains a clear picture of the individual scaling behaviors, but at great cost. Often it is not necessary to go to this latter extreme in order to estimate the distribution of pointwise dimensions of at the data points.
An important feature of our algorithm is that it chooses the clusters and the number of clusters adaptively, balancing concerns of efficiency with those of clarity.
We will describe this in greater detail in Section IV.4.5.
The bigger problem is actually producing the measures .
In order to do this, we will use a Variational Bayesian method.
The idea is that, since each satisfies the local uniformity condition (18) with constant dimension and density,
the nearest neighbor distribution for each cluster of the data should have
a probability density function which is approximately from (20)
for the appropriate choice of parameters and .
As a part of this procedure, we use the density functions to produce posterior distributions for the parameters and having made a choice of prior distributions.
It is important to note that this forces us to treat the dimension (as well as the other parameters) implicitly as random variables. It is also worth noting that the choice of prior distributions for the relevant parameters can have a marked impact on the performance of the estimator when the data is sparse.
The cluster parameters – the number of clusters , the cluster assignments , and consequently the weights – are estimated concurrently with the parameters and . In order to do this, we make use of the latent cluster assignment variables
Most of the optimization in our algorithm is conditioned on the number of clusters .
In these steps, the variables that we seek to estimate from the data are , and .
Our estimates for the variables are derived from those.
Thus we require prior distributions for , and .
We assume that the variables and follow independent gamma distribution with the parameter and respectively.
For each integer , we assume that the vector follows a Dirichlet distribution with the parameter .
Figure 2 provides a graphical model depicting the dependency between these variables.
iv.3.1 Glossary
Data:

— a sequence of data points in sampled from the Borel probability measure for which we are trying to estimate the distribution of pointwise dimensions.

— the sequence of distances from each of the data points in to its nearest neighbor.
Basic variables used in estimation:
Hyperparameters:

— the parameters for the gamma prior of the dimension parameter .

— the parameters for the gamma prior of the density parameter .

— the parameters for the Dirichlet prior of the weight parameters .
Some shorthand:

— a list of the basic variables apart from .

— the list of all the basic variables.

— the list of hyperparameters for a fixed number of clusters.

— here, the number of clusters is allowed to vary.
iv.3.2 The Objective
The goal of the procedure described above is to compute the posterior distribution for the list of variables given and :
(24) 
iv.4 The Variational Bayesian Algorithm
iv.4.1 Phases
As we have a clean description (20) of the relationship between the basic variables, we can feasibly use the variational Bayesian method presented by Jordan et al. Jordan et al. (1997); Attias (1999); Beal (2003); Ghahramani and Beal (1999) to estimate the posterior distribution (24). There are two distinct phases in our application of this method:

The fixed phase — here, we estimate the posterior distribution conditioned upon the number of clusters .
Although we treat these phases separately, there is considerable between them as we iteratively update our estimate of the posterior (24).
In the following section, we describe the connection between the two phases of our procedure – the objective function that our estimator seeks to maximize.
iv.4.2 The Objective Function
At each step, and for each component of , we distinguish between the true marginal posterior distribution derived from (24) and our estimates of this distribution , which we also denote for convenience as we are not varying the hyperparameters . Finally, this is crucial, we assume that these marginal distributions of are independent for each fixed number of clusters :
(25) 
We can estimate the quality of the approximation by considering the likelihood which measures how well the data is determined by the hyperparameters . To begin with,
Let us write
(26) 
and let us denote by
the KullbackLeibler divergence
(27) 
We have the decomposition
(28) 
Our goal, in the fixed phase of the procedure, is to find the distribution which maximizes the objective function . This is equivalent from the above decomposition to minimizing the KullbackLeibler divergence , which measures the error in our approximation.
In our implementation, we use the explicit description of the objective functions to perform the optimization. We will describe the details for each of the phases separately once we have presented explicit descriptions of the relevant prior distributions, which the following section is devoted to.
iv.4.3 The Prior Distributions
In this section, we present expressions for the prior distributions which we will make use of in our implementation. We also describe the distribution of the data conditioned on the basic variables which we use in conjunction with the priors to derive the posterior distribution .
We begin by describing our assumed prior distributions for the basic parameters given a fixed number of clusters . We stated our assumptions about these parameters at the end of Section IV.3, but we provide here explicit expressions of their probability densities:

The density parameters — The density parameter is assumed to follow a gamma distribution with the shape parameter and the rate parameter :
(29) 
The dimension parameters — The dimension parameter is assumed to follow a gamma distribution and :
(30) 
The weight parameters — The weight parameter follows a Dirichlet distribution with the exponents :
(31) 
The cluster indicators — The cluster indicators follow a multinomial distribution with the probabilities :
(32)
Finally, the conditional probability of drawing the samples given a set of basic variables is
Comments
There are no comments yet.