Multiscale geometric feature extraction for high-dimensional and non-Euclidean data with application

11/26/2018 ∙ by Gabriel Chandler, et al. ∙ 0

A method for extracting multiscale geometric features from a data cloud is proposed and analyzed. The basic idea is to map each pair of data points into a real-valued feature function defined on [0,1]. The construction of these feature functions is heavily based on geometric considerations, which has the benefits of enhancing interpretability. Further statistical analysis is then based on the collection of the feature functions. The potential of the method is illustrated by different applications, including classification of high-dimensional and non-Euclidean data. For continuous data in Euclidean space, our feature functions contain information about the underlying density at a given base point (small scale features), and also about the depth of the base point (large scale feature). As shown by our theoretical investigations, the method combats the curse of dimensionality, and also shows some adaptiveness towards sparsity. Connections to other concepts, such as random set theory, localized depth measures and nonlinear multidimensional scaling, are also explored.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Extracting qualitative features from a multivariate data cloud is an important task that has received a lot of attention in the literature. A popular approach for the construction of such methods is based on the kernel-trick (e.g. see Shawe-Taylor and Cristianini, 2004) that is mapping the observed data points into a usually infinite dimensional feature space. In this work, we propose a new feature extraction method, that, similar to the kernel trick, also maps points to an infinite dimensional feature space. However, the starting point of our method is the construction of a matrix of feature functions: A feature function is constructed for each pair of data points. By construction, these feature functions encode certain information about the geometry and shape of the point cloud. This geometric aspect underlying our construction infuses intuition to the resulting methodology that enhances interpretability. Moreover, our feature functions are real-valued functions of a one-dimensional variable, even for multivariate or infinite-dimensional data. Consequently, these functions can be plotted, resulting in a visualization tool. The contributions of this work can be summarized as follows:

  • Present a methodology for extracting information about the shape of the underlying distribution based on an iid sample of size from a Hilbert space, and accomplish this in a multi-scale fashion;

  • Provide supporting visualization tools;

  • Introduce the novel idea of a distribution of local depth values of a given point.

  • Provide supporting theory for the case of Euclidean data, which in particular shows that the estimation approach to a certain extend combats the curse of dimensionality, and that there is some adaptivity to sparsity;


  • Show usefulness of the extracted features through various applications, including classification;

  • Implicitly, some fundamental challenges in non-parametric statistics are adderssed: How to choose smoothing/tuning parameters? What is the ‘right’ scale in high dimensions? How to find informative regions or directions in high dimensions?

To heuristically describe the underlying novel idea, assume that the data lie in

For each point , we will construct a distribution

of depths values, and the corresponding quantile function will be the feature function associated with

. The distribution of depths is obtained by randomly selecting subsets of containing , and by then finding the depths of within the subsets. In order to avoid the curse of dimensionality, the random sets considered here are parametrized by a finite dimensional parameter. The notion of depth used here is specified below. The consideration of distributions of depths was motivated by the work of Ting et al. (2012), who consider random split points in the context of what he called ‘mass estimation’.

By construction, our depth distribution is such that small quantiles (small scales) provide information about the value of the density, and large quantiles (large scales) contain information about the global depth (cf. Lemma 3.2). Intermediate scales might perhaps be most important for feature extraction, in particular in high dimensions. A related idea multi-scale approach to classification is described in Dutta and Ghosh (2015). Their approach is also based on the idea of local depth.

The relation between (local) depth and density warrants some more discussion. First, a well-known conceptual difference between a density and a classical depth measure is that the former is local and the latter is a global quantity. The global nature of a depth function has the advantage of leading to fast (parametric) rates of convergence. On the other hand, depth contours corresponding to standard multivariate depth functions are known to be convex, no matter what the underlying density looks like. While this ‘insensitivity’ of the shape of the depth contours is conceptually well motivated, and also suggests some robustness properties, the inflexibility of depth contours means that depth-based methods do not adapt well to the underlying distributions. As a result, statistical procedures defined by using depth functionals, such as depth classifiers, often are non-optimal. This has lead to efforts to consider refined notions of data depth in order to increase flexibility and to allow for non-convex depth contours. Conceptually these efforts can be considered as ‘localizing depth’. For recent work in this direction see Agostinelli and Ramanazzi (2011), Paindaveine and van Bever (2012), and Dutta and Gosh (2015). There, depth is defined within neighborhoods of a given point, and it might therefore be not too surprising that, in the limit, as the neighborhood size tends to zero, the depth measure is related to density.

A similar phenomenon comes into play in our approach, even though our ‘neighborhoods’ are not local. This is due to a combination of the choice of subsets and the notion of depth that we are using. Another important point to make here is that we do not consider the quantile level as a tuning parameter. Instead we consider the entire quantile function. In some sense this is similar to the concepts underlying the mode tree (Minotte and Scott, 1993), or the siZer (Chaudhuri and Marron 2000), or to the Betti-curves considered in topological data analysis (TDA). These latter methods also do not choose a tuning (smoothing) parameter, but consider all possible values of the smoothing parameter. However, in these situations, the tuning parameter is the bandwidth of a kernel density estimator, and for both small and large values of the bandwidth, the kernel estimators do not contain useful information as they are degenerate. This is in stark contrast to the methodology presented here, where all the values of the tuning parameter lead to non-degenerate estimators, and thus to useful information. Also, the other methods just mentioned summarize the entire data cloud in essentially one function (or in a few as in the case of TDA). In contrast to that, our approach constructs functions for (pairs of) individual data points.

As hinted above, in contrast to the kernel trick, our construction is not related to an RKHS. Instead, our construction is more directly guided by the objective to extract geometric information. The computational burden might be somewhat higher than for RKHS-based methodologies, but given the currently available computing power, it is manageable. The algorithmic complexity of our procedure is quadratic in the sample size and linear in the dimension, meaning that a high dimension is not the major computational challenge. It turns out that, to a certain extent, the same applies to the statistical behavior.

The remainder of the paper is organized as follows. The construction of our feature functions is described in section 2. Some visualization aspects are also discussed there. Section 3 introduces population versions of our feature functions and discusses some of their properties. Several application of our approach are presented in section 4.2. Some relations to random set theory are discussed in section 5. Section tuning discusses some aspects underlying the choice of tuning parameters. Section 7 presents a theoretical analysis of the behavior of our feature functions, including the robustness to the curse of dimensionality, some adaptation to sparsity, and we also derive their asymptotic distribution, both pointwise and as a process. Proofs can be found in section 8.

2 Construction of the feature functions for data in

Our feature functions are quantile functions of a distribution of depths of a given anchor point based on randomly selecting subsets of containing . In an attempt to balance complexity and computational cost, the usual challenge in high-dimensional situations, our approach uses cones as subsets. The choice of cones turns out to have other advantages as well (see section 2.1 below.)

Given data we construct our feature functions as follows.

  • For each data pair , let be the direction given by and , and let be the anchor point. The anchor point and the direction define the line passing through and

  • Fix , and let denote the class of all (right spherical) cones with opening angle and their axis of symmetry. A cone in with tip in is denoted by . Notice that, for , the Euclidean distance of the tip to the anchor point equals Also notice that and point in opposite directions.

  • For each cone with project all the data inside onto and find the (one-dimensional) Tukey depth of among these projections. Formally,

    (2.1)
  • By picking the tip randomly on according to a known distribution on the real line, we obtain a distribution of empirical depths (given the data) for the given anchor point. The corresponding quantile function then is our feature function. Formally,

    (2.2)

Note that the there are three ingredients to the procedure that require tuning: the angle ; the distribution on to select the tip; and the rule to choose the anchor point . The choice of these ingredients can to a certain extend be guided by the statistical application one has in mind. We illustrate this below by considering classification.

Observe that by our definition, the depth functions and are mirrored versions of each other, but they lead to the same quantile functions, i.e. .

Figure 1: Illustration of the construction of in dimension The two panels show one positive and one negative value of . The dashed line is the line ; the anchor point is shown in red; the cones are shown in blue; projection of the data inside the cone onto are indicated.

Relation to local depth. One of the basic building blocks of our methodology are certain projection depths of the anchor points within cones. These depths can, to a certain extent, be considered as versions of a local depth. However, localization is not obtain through choosing small neighborhoods. This will be discussed in some more details below.

2.1 Depth quantile functions for object data, the kernel-trick, and non-linear multi-dimensional scaling

2.1.1 Object data and the kernel trick

The approach discussed above can also be applied to object data in a Hilbert space. The simple, but key underlying observation here is that determining our depth quantile functions only requires the ability to define cones, to calculate which data points fall into the cones, and to find distances between points. All of this, of course, hinges on the existence of an inner-product. So, if our underlying data are objects lying in a Hilbert space equipped with an inner-product then we can simply apply our methodology outlined above to the object data with this inner-product.

Interestingly, applying our methodology to functional data in will map functional data into a different set of functional data, and it still needs to be investigated when such an approach might perform better (or worse) than working with the functional data directly.

Another extension of our methodology is an application to other object data in conjunction with the kernel trick. The kernel trick consists of mapping data into an RKHS via a feature map , say, such that the dot-product satisfies for some kernel . Then we can simply apply our methodology outlined above to the transformed data

with the corresponding dot-product. Kernel methods are available for a variety of objects, including trees, graphs, matrices, strings, tensors, functions, persistence diagrams, etc. (e.g. see Genton 2001, Shawe-Taylor and Cristianini, 2004, Cuturi 2010). Since the elements in the feature space

lie in a Hilbert space, we can apply our methodology outlined above. This can be used to investigate and to compare the geometry of the data in feature spaces.

2.1.2 Relation to non-linear multidimensional scaling

Given a pair of data, say, with passing through and , and the anchor point all we need to determine the depth functions , or the quantile functions , are the two-dimensional points

The value gives the (signed) distance between the projection of onto and and is the (minimum) distance of from the line Given an opening angle of our cones, tells us whether there exists a cone with tip and axis of symmetry containing , while is used to determine whether , lies in the same cone as .

So the pairs fully determine and thus (once an angle is fixed, and the distribution for selecting the random tips of the cone is chosen). A plot of the pairs , , is what we call a --plot. It has the flavor of a non-linear projection onto two dimensions, which is in the spirit of multidimensional scaling. It provides a visual impression of the geometry of the data cloud relevant to our methodology, if viewed in direction

. As an example, this plot allows us to visualize aspects of the underlying infinite dimensional feature space induced by the use of the popular radial basis function (RBF) kernel

. Figure 2 shows the change in the underlying geometry as is increased from 0 for the first two classes (Setiva vs Versicolor) of Fisher’s famous Iris data. Here we consider . When , all points are essentially orthogonal as for , and thus approximately map to . As , the points converge to (0,0).

If good classification is possible based on the value of , this means that the class with smaller values of tends to live within a cylinder centered at the line and the other class tends to live outside of this cylinder. Figure 2 shows this is not true for this particular pair of points, suggesting that this direction is looking roughly in the direction formed by connecting the modes of each class.

Figure 2: Aggregated - plot visualizing point cloud geometry from RBF kernel with increasing parameter value for two classes of the Iris data (color coded). Larger values of correspond to larger individual points.

Of course, this plot changes with the choice of the line, i.e. we have different such plots. Inspecting all of these plots visually is not feasible. Instead, one needs to extract relevant information from these point clouds. Our depth quantile functions can be considered as summarizing the geometry of these point clouds into a class of functions, which can be handled by more traditional statistical methodology.

3 Interpretation of depth quantiles via their population versions

Let be a distribution on with Lebesgue density . A line is given by , where , the unit sphere in , and
As above, population versions of depth quantile functions, for a fixed anchor point lying on , are defined by using cones that all have the same axis of symmetry, given by . For cones reduce to intervals, and geometric considerations are somewhat different in this case. In what follows, we therefore only consider the case . The case will be discussed in separate work. For the following discussion, we fix a line .
For and we denote by the cone of fixed opening angle containing and having as its axis of symmetry, and with tip at distance to . Note that this includes cones opening to both sides, where the orientation of the cone depends on the sign of . Also, if , then . Now, with , we define the depth of within as

(3.1)

In words, is the Tukey depth of with respect to the (in general improper) distribution on , defined by

where denotes the (orthogonal) projection of onto the line . The empirical version defined above, is with being the empirical distribution of data falling into the cone with tip at distance from , where and are both random.
A different representation of that will be convenient is as follows. For , let

denote the hyperplane passing through

with normal direction , and let and denote the two halfspaces defined by . For a cone let

In other words, the hyperplane splits the cone into two closed subsets and with By definition, is defined such that it contains the tip , i.e. is itself a cone with base , and is a frustum. We can now write

(3.2)

Now assume that is randomly chosen according to a distribution on , meaning that we choose the tip of the cone randomly on a given axis of symmetry. Let be chosen independently of . Now define as the quantile function of the resulting depth distribution of , i.e.

(3.3)

The distribution can be considered to be a tuning parameter, and its choice will be discussed below. Throughout this paper, we will assume to be a continuous distribution.
The following two lemmas describes some properties of the depth function and the corresponding depth quantile function . We denote by the sublevel sets of , i.e.

Lemma 3.1

Let be fixed.

  • The function is U-shaped, non-increasing for and non-decreasing for Thus, its sublevel sets are intervals containing .

  • If is continuous with positive Lebesgue density, the function is continuous and strictly U-shaped with unique minimum value attained at The intervals are thus all closed.

Lemma 3.2

Let be fixed.

  • The function is non-decreasing on

  • If both and have positive Lebesgue densities and , respectively. Then we have with that

    (3.4)

    Moreover, is strictly increasing with and we have:

    Large scales:   is the global Tukey depth of
    (3.5)
    Small scales:   , where is a known constant. (3.6)

Property (3.6) makes precise the above statement that, for small values of the depth quantile function contains information about the density at . Observe that (3.6) does depend on only through . Also observe that this localization is achieved even though we are not using local neighborhoods of . Instead, localization follows from a combination of the notion of Tukey depth, and the choice of cones to define the distribution of depths. (It is the acute angle of the cone that is important here.) The effect of this combination is that for small quantile levels, the depth is attained at a small cone (with lying in the base of the cone, and with tip close to

), and the depth itself is then given by the probability content of this small cone. This cone is a certain type of local neighborhood of

, the shape of which is the same for each point , because the opening angle of the cones under consideration is fixed. So dividing by the area of this small cone, and letting the quantile level tending to zero will give the density at up to a fixed constant.

4 Applications

4.1 Topological Data Analysis

111R code for computing the depth quantile functions is available at https://github.com/GabeChandler/depthity/

Consider observing a point cloud in where the data is either generated uniformly in a ball of fixed radius or is generated uniformly in this ball with a smaller middle ball removed. We call these two supports the 8-ball and 8-annulus respectively. Specifically, our larger ball has radius 1.5 and the inner ball has radius 1.25. We consider a sample size of =100 and note that due to the dimensionality the inner ball only accounts for about 25 percent of the total volume. We wonder if it is possible to tell if a hole is present. We consider four approaches: multi-dimensional scaling, examination of the pairwise distances through their empirical distribution function, looking at the depth quantile functions of the midpoints, and only considering their right hand limit. Figure 3 shows that more information is provided by considering the entire functions rather than a single point.

(a) MDS for 8-ball and 8-annulus
(b) Depth Quantile Functions
(c) Global depth performance vs .39 quantile
Figure 3: Comparison of observation level information for distinguishing between 8-ball and 8-annulus.

Figure 3 shows that classification in this particular case is possible by considering only a single observation and how it relates to all the other observations in the data set. Unsurprisingly, using all the information together yields improved results. Figure 4 considers taking the average of all depth quantile averages (thus converting all functions into a single function), which outperforms the EDF of the pairwise distances, itself a single function computed from all pairwise comparisons (even based on considering the function evaluated at a single point). Multidimensional scaling seems to produce plots that do not distinguish between the two cases.

(a) EDF of pairwise distances
(b) Average Depth Quantile Functions
Figure 4: Data set level information for 8-dimensional simulated data

4.2 Classification

As the quantile functions we generate move monotonically from a local measure (density) to a global measure (depth), we observe that the behavior of the functions, particularly at intermediate values, provide interesting information about the geometry of the underlying point cloud at given locations. Based on the previous example, our method seems particularly attuned at finding holes or anti-modes in point clouds, partly due to the choice of the midpoint as the anchor point. A classical application in statistics is the classification problem, where our task is to guess the class of an observation given covariate values. In the case of model clustering, where an anti-modal (low density) region exists between the densities with convex model regions of the two classes, we might suspect that within-class comparisons will produce different functions than between-class comparisons. We expect that within-group comparisons, with midpoints tending to live in modal regions, will tend to have higher density, though relatively lower depth, in contrast with points living in the valley between the two modes, which will tend to have comparatively lower density, but higher depth as they will tend to live between the two point clouds. Of course, so long as the geometry of within and between class comparisons differ, classification information should be available.

We consider using empirical depth quantile functions to perform binary classification of observations in . The collection of depth quantile functions is considered as samples of random functions, and ideas from multivariate functional data analysis are used to construct the classifiers. For each point in the training data, depth quantile function are computed with base points being the midpoints formed with every other point in the training set. The distribution is chosen to be uniform, centered at this midpoint with support extending far enough to ensure the global depth is obtained at the boundary of the support. This produces a total of functions. Our heuristic then computes two functions for every observation: the average of the depth quantile functions for points within the same class, and the average for points in the opposite class. This yields a total of

functions. Functional principle component analysis (fPCA) is then performed on this collection of functions, and the loadings associated with the first four principle functions is retrieved. Thus, each data point in our original data set is associated with an 8-dimensional vector (a concatenation of the loadings of each function).

We note that there are many tuning parameters in this method, from the choice of and the opening angle , to the several that are inherent in the pre-existing tools that we then use (from the construction of function objects from our discrete data, to the number of loadings retained, as well as the choices for the support vector classifier). We offer no claim that this heuristic might be a preferred classification tool compared to preexisting techniques in the literature specifically engineered for this task, rather that these functions offer interesting information that is visualizable and is reasonably competitive in the classification domain, as well as others.

4.3 Outlier Detection

We also conjecture that points who’s quantile functions are unusual likely correspond to points that might be deemed outliers, or are at least unusual with respect to the overall point cloud. As a brief exploration of this possibility, we consider the 8-annulus data from the example at the start of this section and compare the distance of the point from the center to the

distance of the observation level average depth quantile function to the overall average depth quantile function. We see in figure 5 that points that exhibit large distances from the overall tend near one of the hyper-spheres that correspond to the boundaries of the support.

Figure 5: distance of observation level average to overall average depth quantile functions against norm of observation from 8-annulus simulated data

4.3.1 Simulated Data

The first data set we consider suggests that while our original thought involved convex modal regions, in so long as the geometry with respect to the point clouds differs between midpoints of same group comparisons with between group comparisons, discriminant information exists. Figure 6 considers a data set in which the supports of the two classes are a disc and a concentric annulus, such that a third annulus exists between the two with density zero. Comparisons between classes tend to yield midpoints in this zero density region, and thus valuable information seems to be revealed by small values. The kernel makes the data linearly separable though causes the support to live on a elliptic paraboloid and thus all midpoints live in a zero density region. Interestingly, it is the intermediate values of that now reveal useful information (see figure 5(c)).

(a) based on
(b) Data
(c) based on
Figure 6: (a) A simulated data set of size , (b), the averaged quantile functions (blue lines represent between class comparisions, red within the inside group, and pink within the outside group, and (c) for the data in the feature space.

4.4 Real Data Applications

We illustrate the performance of the our heuristic classification technique on several well known data sets. First is Fisher’s Iris data, consisting of 3 classes of observations each, with (note that our heuristic actually performs the classification task in double the dimension in this example). We evaluate the leave-one-out performance of the method, and find a correct classification rate of 97.33%. See figure 7

for two plots of the depth quantile functions, corresponding to the easiest and hardest pairwise comparison, as well as the confusion matrix. Additional information afforded by the graphic nature of our method is that the point clouds of the Setosa and Versicolor varieties differ (compare the red and pink curves in sub-figure

6(a)), which is also suggested by examining their correlation matrices.

(a) : Setosa vs Versicolor
True Class
Se Ve Vi
Se 50 0 0
Pred. Ve 0 48 2
Vi 0 2 48
(a) Confusion Matrix
(b) : Versicolor vs Virginica
Figure 7: Fisher’s Iris Data Set: functions and confusion matrix

Our second real data set is the Wine data set, available from the UCI Machine Learning database. There are 178 observations over 3 classes in 13 dimensions. Our heuristic yields a leave-one-out correct classification rate of 96%. Again, as is shown in figure

7(a), the difference in geometries between the two classes is quite apparent.

(a) : Class 1 vs Class 3
True Class
A B C
A 56 1 0
Pred. B 2 67 0
C 1 3 48
(b) Confusion Matrix
(b) : Class 2 vs Class 3
Figure 8: Wine Data: functions and confusion matrix

We next consider a gene expression data set used for colon cancer detection (Tan and Gilbert, 2003), which consists of 62 observations on 2000 variables. After scaling the data so that each variable has mean 0 and standard deviation 1. Due to the dimensionality, we use a larger angle of 50 degrees on our cones. Based on the depth quantile plots (figure

9, it is suggested that the observations corresponding to the presence of cancer live in a higher density region (early initial dominance of the purple curve) near the boundary of the point cloud (small depth values, despite nearly 2/3 of the observations corresponding to this class). A multidimensional scaling of the loadings from an fPCA of these curves (first 4 loadings correspond to class 0 comparisons, last 4 to class 1 comparisons) yields more structure for the classification problem than that based on the distance matrix of the raw data. This structure corresponds with results seen in the literature that shows that information is available. In fact, a leave-one-out cross validation based on the loadings of the fPCA yielded a correct classification rate of nearly 84%.

Figure 9: Depth Quantile functions (left), and MDS plots based on depth quantile functions (center) and raw data (right) In left plot (purple corresponds to cancer-cancer comparisons, blue to within group comparisons for no cancer. Between group comparisons are red and pink. Group level averages are added in bold.

Finally, as an infinite dimensional example, we study Spellman’s yeast cell expression data (Spellman et al. 1998), a continuous valued time series. We subsample the data so that only =20 observations are available for each class. Figure 10 shows the raw data, as well as an MDS plot of the data. The second panel considers an easier pairwise comparison (G1 vs M) based on MDS, as well as one that is suggested to be more difficult (S vs G2). In this comparison, we can see that the geometry of the within class comparisons for one group looks nearly identical to the between class comparisons between the two groups.

(a) MDS and raw data of Spellman data
(b) Depth Quantile functions of easy and hard classification problems - blue, purple are within class comparisons, thicker lines are average functions.
Figure 10: From MDS, G1 and M (red vs cyan in MDS plot) are easy to discriminate between, while S and G2 (green vs blue) is more difficult.

5 Interpreting average depth quantiles via random set theory

In this section we use random set theory to provide another point of view to the construction of our feature functions. Let be such that

(5.1)

Recall that denotes the anchor point of which to calculate the depths of, and determines the direction of the line passing through . In the empirical version discussed above, both the line and the anchor point are random, determined by a pair of data points , say. In this case, we have

where Here with and Below, we will study the behavior of and the corresponding as estimators of (and the corresponding ), where

with . For this discussion we assume that this minimizer is unique, i.e. Note that

is a random function that is not observable. The joint distribution of

induces the distribution of this random function. For simplicity, we set for the rest of this section.

Consider as a random closed set. Then, for fixed

, the random variable

can be written as

where independent of This leads to the corresponding random depth quantile function

where is as in Lemma 3.2 with, again, replaced by . Mimicking the construction of the average empirical feature functions used in the classification context, we are fixing , and take expectation over the random variable . This leads to

The (random) function

is some version of a hitting function of the random set given . The distribution of this random set depends on the distribution of Assuming that we consider a binary classification problem, we consider three cases, corresponding to three different functions : two of them correspond to both , and lying in the same class, i.e. the two random variables have the same distribution, and the third case is that and have different distributions, or they come from different classes. In order for our classification procedure to work well, these three cases should result in significant differences in the random functions . In fact, our classification procedure only considers the expected values .

Note that if, for some , the set is not uniquely defined (see discussion right below (3.2)), then, while the function does depend on the actual choice of , does not.

Because of the dependence of on , we effectively compare an entire class of hitting probabilities. The randomness still present in (through ), makes these functions in random functions. This motivates the use of FDA methodology. Of course, the random functions are unknown, and we need to estimate them. This is what is done implicitly by the proposed methodology.

6 Discussion of choice of tuning parameters

The proposed procedure has three ingredients that can be considered as tuning parameters. The most obvious one is the angle of the cones that are being used, then there is the distribution under which to choose the distance of the tip of the cone to the base point, and last, but not least, there is the choice of the base point. We briefly discuss some aspects of the choice of these ingredients.

Choice of base point: In this paper, we are always using , with defining . This is motivated by the application to classification, as discussed in Section 4.2. Other applications might warrant a different choice, as for instance one of the two data points

Choice of opening angle . Theoretically, the opening angle of the cones should depend on the dimension . In fact, as increases, the opening angle needs to tend to to ensure that there is a non-negligible mass within the cones to ensure that the depth functions are not degenerate. In our simulation studies, however, the choice of an angle of seems to work well for a large range of dimensions. For non-Euclidean data the choice of appears more challenging. Note, however, that for the case of our observations being radial basis functions, the norm of all these observations in the corresponding RKHS equals , meaning that the observations all lie on the unit circle. In terms of a good choice of , this case in our simulations surprisingly behaved similar to a low-dimensional Euclidean space.

Choice of the distribution . The choice of

essentially amounts to choosing a non-linear transformation to the horizontal axis, amounting to emphasizing different regions of the depth quantile plots. Choosing

to be unimodal about for instance, would put higher weight on points close to the base-points. In our applications, we always chose

to be the uniform distribution on an interval chosen such that all the projections used to determine the empirical depth quantiles fall into the support of

. The motivation for this choice is similar to the choice of a non-informative prior.

7 Estimation of depth quantile curves

Given iid data of from , an empirical counterpart of is constructed as follows. First we replace, in (3.2), the distribution by the empirical distribution based on the observed data. This results in the empirical depths of in the cone defined as

with and as above.

Picking the values of randomly from , i.e. picking the tip of the cone randomly on the line gives the empirical depth quantile function for a given line

(7.1)

Observe that, if the line and the anchor point are based on a pair as above, i.e. with and , then we obtain the depth quantile function considered above.
The functions and the corresponding empirical quantiles have similar properties as their population counterparts (see Lemma 10.1 presented in the appendix).

7.1 Concentration and large sample behavior

Recall that . For measurable, define the pseudometric , where We let

be the set of all cones (with opening angle ) with as their axis of symmetry. If we let denote the class of all halfspaces with normal vector , then equals the set Both and are VC-classes with VC-dimension (both are unions of two nested classes of sets), and thus is a VC-class with VC-dimension bounded by . The importance here is that the VC-dimension of does not vary with the dimension, which is the reason for the estimators of the depth quantile functions not suffering from the curse of dimensionality, as is shown in the following result. Furthermore, if we let

then, with and we have Since both and are VC-classes, also is a VC-class. Notice, however, that its VC-dimension depends on the dimension . In fact, it is .

7.1.1 Results for empirical depth functions

Proposition 7.1

For every given line there exists a constant not depending on the dimension and the opening angle such that, for all

(7.2)

Now recall that our methodology depends on the functions We show that these functions are uniformly close to the following population based functions

While depends in the population distribution, it is still a random quantity, but only through the choice of the line and the base point. We now have the following result:

Proposition 7.2

With and , both nonrandom sequences, we have,

(7.3)

Moreover, if the distribution is concentrated on a -dimensional hyperplane in , then (7.3) holds with replaced by

Remark. The ‘min’ appearing in the rate tells us that is a dimension independent uniform bound for the rate of the deviations of the functions from their population based counterparts The asserted adaptivity to some kind of ‘sparsity’ is due to some geometry: Consider a cone in with tip in zero and with axis of symmetry being one of the coordinate axes. The intersection of such a cone with a linear subspace of dimension is a cone in .

Our next result concerns the asymptotic distribution of as a process in . The behavior of the process depends on whether the minimizing set in the definition of is unique or not (see Corollary 7.1). Also, we are facing a challenge in small neighborhoods of points with non-unique minimizers. This is due to the fact the empirical minimizer, i.e. the sets minimizing , might not stabilize in such neighborhoods. Moreover, suppose that is a point with and consider in a small neighborhood of . If the minimizers corresponding to are and for they switch to , then the map is not continuous at (w.r.t. the -norm), because the sets and are not nested. In fact, in this case we have as .

The following definition is to exclude such values . For denote and for let

(7.4)

Note that means that the minimizer in the definition of is not unique (both and are minimizers). We also introduce the notation

Observe that this set of minimizers either consists of exactly one, or of exactly two elements. The latter holds if and only if

Theorem 7.1

For set , and let be a sequence of real numbers with . Then, on an appropriate probability space, there exists a sequence of -bridges , i.e. tight, mean zero Gaussian processes with and almost surely continuous sample paths, such that, for any given line , we have with that

where the -term does not depend on and . Moreover, we have

where the -term does not depend on , but it does depend on .

Remark. In the somewhat related context of minimum volume sets (or generalized quantile processes), a result of similar type has been obtained in Einmahl and Mason (1992). There the approximating process (to the generalized quantile process) is a maximum of -bridges taken over the, in general non-unique, generalized quantiles. Also, Massé (2004) showed convergence of the multivariate Tukey depth process to a similar type of limit process.

Corollary 7.1

For set