A local depth measure for general data

We herein introduce a general local depth measure for data in a Banach space, based on the use of one-dimensional projections. Theoretical properties of the local depth measure are studied, as well as, strong consistency results of the local depth measure and also of the local depth regions. In addition, we propose a clustering procedure based on local depths. Applications of the clustering procedure are illustrated on some artificial and real data sets for multivariate, functional and multifunctional data, obtaining very promising results.

READ FULL TEXT VIEW PDF

Authors

page 29

page 33

07/29/2021

Statistical depth in abstract metric spaces

The concept of depth has proved very important for multivariate and func...
07/31/2021

Functional clustering via multivariate clustering

Clustering techniques applied to multivariate data are a very useful too...
07/20/2012

Fast nonparametric classification based on data depth

A new procedure, called DDa-procedure, is developed to solve the problem...
06/21/2022

Depth-based clustering analysis of directional data

A new depth-based clustering procedure for directional data is proposed....
04/24/2021

The GLD-plot: A depth-based plot to investigate unimodality of directional data

A graphical tool for investigating unimodality of hyperspherical data is...
09/28/2018

Data depth and floating body

Little known relations of the renown concept of the halfspace depth for ...
05/31/2021

Halfspace depth for general measures: The ray basis theorem and its consequences

The halfspace depth is a prominent tool of nonparametric multivariate an...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data depth measures play an important role when analyzing complex data sets, such as functional or high dimensional data. The main goal of depth measures is to provide a center-outer ordering of the data, generalizing the concept of median. Depth measures are also useful for describing different features of the underlying distribution of the data. Moreover, depth measures are powerful tools to deal with several inference problems such as, location and symmetry tests, classification, outlier detection, etc.

Nonetheless, since one of their major characteristics is that the depth values decrease along any half-line ray from the center, they are not suitable for capturing characteristics of the distribution when data is multimodal. Hence, over the last few years, there have been introduced several definitions of local depth, with the aim of revealing the local features of the underlying distribution. The basic idea is to restrict a global depth measure to a neighborhood of each point of the space. In this way, a local depth measure should behave as a global depth measure with respect to the neighborhoods of the different points. Agostinelli and Romanazzi (2011) gave the first definition of local depth for the case of multivariate data. They extended the concepts of simplicial and half-space depth so as to allow recording the local space geometry near a given point. For simplicial depth, they consider only random simplices with sizes no greater than a certain threshold, while for half-space depth, the half-spaces are replaced by infinite slabs with finite width. Both definitions strongly rely on a tuning parameter, which retains a constant size neighborhood of every point of the space, something which plays an analogous role to that of bandwidth in the problem of density estimation. Desirable statistical theoretical properties are attained for the case of univariate absolutely continuous distributions. Paindaveine and Van Bever (2013) introduce a general procedure for multivariate data that allows converting any global depth into a local depth. The main idea of their definition is to study local environments. This means regarding the local depth as a global depth restricted to some neighborhood of the point of interest. They obtain strong consistency results of the sample version with its population counterpart. All the proposals provide a continuum between definitions of local and global depth. More recently, for the case of functional data, Agostinelli (2016) gives a definition of local depth extending the ideas introduced by Lopez-Pintado and Romo (2011) of a half-region space. This definition is also suitable for finite large dimensional datasets. Asymptotic results are obtained.

Our goal is to give a general definition of local depth for random elements in a Banach space, extending the definition of global depth given by Cuevas and Fraiman (2009), where they introduce the Integrated Dual Depth (IDD). The main idea of IDD is based on combining one-dimensional projections and the notion of one-dimensional depth. Let

be a probability space and

a separable Banach space. Denote by the separable dual space. Let be a random element in with distribution and a probability measure in independient of The IDD is defined as,

(1)

where is an univariate depth (for instance, simplicial or Tukey depth), and is the univariate distribution of

In the present paper we define the Integrated Dual Local Depth (IDLD). The main idea is to replace the global depth measure in Equation (1) by a local one dimensional depth measure following the definition given in (2013). We study how the classical properties, introduced by Zou and Serfling (2000), should be analyzed within the framework of local depth. We prove, under mild regularity conditions, that our proposal enjoys those properties. Moreover, uniform strong consistency results are exhibited for the definition of the empirical local depth of to the population counterpart, and also for the local depth regions. The main advantages of our proposals are its flexibility in dealing with general data and also its low computational cost, which enables it to work with high-dimensional data. As a natural application, we propose a clustering procedure based on local depths, and illustrate its performance with synthetic and real data, for different kind of data.

The remainder of the paper is organized as follows. In Section 2 we define the integrated dual local depth, and study its basic properties. Section 3 is devoted to the asymptotic study of the proposed local depth measure. In Section 4 the local depth regions are defined and the consistency results are exhibited. A clustering procedure based on local depth regions is proposed in Section 5. Simulations and real data examples are given in Section 6. Some concluding remarks are given in Section 7. All the proofs appear in the Appendix.

2 General Framework and Definitions

In this section, we first review the concept of local depth for the univariate case. Then we define the Integrated Dual Local Depth, and we finally show that, under mild regularity assumptions, our proposal has good theoretical properties that correspond to those established in Paindavaine and Van Bever (2013).

Let be a a probability measure on and Let be the local depth measure of with respect to , for example, the univariate simplicial depth, that is

(2)

where

is the cumulative distribution function of

and is the neighborhood width defined as follows.

Definition 1.

Let be a univariate cumulative distribution function and Then, for we define the neighborhood width by

(3)

where is the locality level.

Remark 1.

If is absolutely continuous, the infimum in Equation (3) is attained and hence,

Even more, it is clear that if then

The locality level is a tuning parameter that determines the centralness of the point of the space conditional to a given window around If the value is high it approaches the regular value of the point depth whereas if it is low it will only describe the centralness in a small neighborhood of . As tends to one, the local depth measure tends to the depth measure.

We can also define, in an analogous way, the Tukey univariate local depth,

In what follows, without loss of generality, we restrict our attention to the case of simplicial local depth, .

2.1 Integrated Dual Local Depth

Our aim in this section is to extend the IDD introduced by Cuevas and Fraiman (2009), to the local setting. The IDD is a depth measure defined for random elements in a general Banach space. The idea is to project the data according to random directions and compute the univariate depth measure of the projected unidimensional data. To obtain a global depth measure, these univariate depths measures are integrated. Under mild regularity conditions, the IDD satisfies the basic properties of depth measures described by Zou and Serfling (2000), and it is strongly consistent. In addition, it is important to remark that its computational cost is low, even in high dimensions, since it is based on the repeated computation of one dimensional projections.

Let be a probability space and a separable Banach space, with its separable dual space. Let be a random element in with distribution a probability measure in independent of , and . We define the Integrated Dual Local Depth (IDLD),

(4)

where is the univariate local depth given in Equation (2), and is the univariate distribution of As suggested by Cuevas and Fraiman, in the infinite dimensional setting

may be chosen to be a non-degenerate Gaussian measure and in the multivariate setting as a uniform distribution in the unitary sphere. With a slight abuse of notation, we write

for the cumulative distribution function of Specifically, it reduces to

It is clear that the IDLD is well-defined, since it is bounded by and non-negative.

Zou and Serfling (2000) established the general properties that depth measures should satisfy (P. 1 - P. 6). Paindavaine and Van Bever (2013) extend those properties to the local depth framework. We describe the properties satisfied by IDLD.

The first property deals with the invariance of the local depths. For the finite dimensional case, IDLD is independent of the coordinate system. This property is inherited from the IDD. Since IDLD is a generalization of IDD, which is not in general affine invariant (i.e., let

be a non-singular linear transformation in

and denote the distribution of then is not equal to ), neither is IDLD. It is clear that IDLD is also invariant under translations and changes of scale.

P. 1. (affine-invariance).

Let by a finite dimensional Banach space,

a random vector,

the Haar measure on the unit sphere of independent of Let be a linear transformation such that , and Then

The proof appears in the Appendix A.

Remark 2.

It is well known that the spatial median is not affine invariant, hence, transformation and retransformation methods have been designed to construct affine equivariant multivariate medians (Chakraborty, B. and Chaudhuri 1996, 1998)). IDLD can be modified following the ideas of Kotík and Hlubinka (2017) to attain this property.

Depth measures are powerful analytical tools, especially in cases where the random element enjoy symmetry properties. Local depths should locally (restricted to certain neighborhoods) inherit these properties. Hence we give an appropriate definition of local symmetry.

Definition 2.

Let

be a real random variable and

Then is said to be -symmetric about if the cumulative function distribution satisfies

(5)

A random element in a Banach space is -symmetric about if for every is -symmetric.

The notion of -symmetry aims to locally capture the behavior of a unimodal random variable on a neighborhood of probability about the locally deepest point. Figure 1(a) and (b) exhibit a bimodal distribution, with modes at and On the former, both modes are local symmetry points for , while on the latter is a local symmetry point for but is not a local symmetry point for the shaded area around is non-symmetrical.

(a) and are local symmetry points with locality level
(b) is a local symmetry point with locality level while it it not a local symmetry point at local level
Figure 1: Local symmetry points.

An important property of depth measures is maximality at the center, meaning that if is symmetric about then attains its maximum value at that point. This property should be inherited by local depths if the distribution of is unimodal and convex. Local depths are relevant for detecting local features, for instance local centers, hence our aim is to extend the property of maximality at the center to each point that is -symmetry.

P. 2. (maximality at the center).

Let be a random continuous element -symmetric about For we have that

(6)

The proof appears in the Appendix A.

Proposition 1 bridges the definition of -symmetry with the usual definition of -symmetry (see Zhou and Serfling 2000).

Proposition 1.

Let be a random continuous element -symmetric about Then is -symmetric about for each

The proof appears in the Appendix A.

Proposition 2 describes the -symmetry points of

Proposition 2.

Let be a -symmetric random element in and such that for every Then is a -symmetry point.

The proof appears in the Appendix A.

P. 3 establishes that the local simplicial depth is monotone relative to the deepest point. Several auxiliary results that appear in the Appendix A must be stated before proving this property.

P. 3. (monotonicity relative to the deepest point).

Let be a separable Banach space and the corresponding dual separable space. Let be a random -symmetric element about with probability measure Let be a probability measure in independent of and assume that for every has unimodal density function about and fulfills

(7)

Then, for every and

The proof appears in the Appendix A.

Remark 3.

It is easy to see that Inequality (15

) holds for the standard normal distribution. Hence, the projections of a Gaussian process fulfill

P. 3.

In what follows, we show that IDLD vanishes at infinity, under mild regularity conditions.

P. 4. (vanishing at infinity).

Assume that

where is a function such that

The proof appears in the Appendix A.

Proposition P. 5 shows that is continuous as a function of

P. 5. (continuous as a function of ).

Let be a random continuous element and Then is continuous.

The proof appears in the Appendix A.

Finally, we prove that is continuous as a functional of

P. 6. (continuous as a functional of ).

For every is continuous as a functional of

The proof appears in the Appendix A.

3 Empirical Version and Asymptotic Results

In this section we introduce the empirical counterpart of the IDLD and give the main asymptotic results.

First of all, recall the definition of Paindavaine and Van Bever (2013) of the empirical local unidimensional simplicial depth Let Then

where

Remark 4 entails the well-definedness of the empirical neighborhood width,

Remark 4.

Let and be a random sample of iid variables with distribution Given put, for each and let denote the th order statistics of Let where is the integer part function. It is clear that Hence, and son the empirical neighborhood width is

Then the empirical counterpart of IDLD is given as follows.

Definition 3.

Let be a continuous random element and a random sample with the same distribution as Let For each and define

(8)

Let The empirical version of IDLD of locality level is

(9)

In order to establish the uniform strong convergence of the one dimensional simplicial local depth, the following lemmas must be proved in advance.

Lemma 1.

Let

be an absolutely continuous random variable with distribution

Suppose given iid random variables, also with distribution . Let

be the quantile

from and the quantile from which is the empirical cumulative distribution function of Then,

Lemma 2.

Let be a real random sample with cumulative distribution function Let and Then,

(10)

The proof appears in the Appendix B.

The theorems below establish the uniform strong convergence of the empirical counterpart of the univariate simplicial local depth to the population counterpart.

Theorem 1.

Let be a separable Banach space with a dual separable space Suppose given a random sample of elements on with probability measure and Then, we have

  1. (11)
  2. (12)

The proof appears in the Appendix B.

Theorem 2.

Let be a random element on a separable Banach space with associated probability measure such that Let be a random sample following the same distribution as and Then,

The proof appears in the Appendix B.

4 Local Depth Regions

In this section we define the local depth inner region at locality level which will be instrumental in making applications of local depth functions. Ideally, these central regions will be invariant of the coordinate system and nested. We also study, under mild regularity conditions, the asymptotic behavior.

Denote by a local depth measure and its empirical counterpart. In particular, one can consider the integrated dual local depth defined in Section 4.

Definition 4.

Let be a separable Banach space, let a random element with associated probability measure Fix a locality level, and The local inner region at locality level of level is defined to be

(13)

Let be a random sample of elements on Then the empirical counterpart of is

Throughout this section the locality level will remain fixed, hence we write (respectively, ) for (respectively. ) when no ambiguity is possible.

Remark 5.

If is a finite dimensional space, then is invariant under orthogonal transformations.

Remark 6.

If then

Theorem 3 shows that the empirical local depth inner region at locality level is strongly consistent with its corresponding population counterpart, under mild regularity conditions.

Theorem 3.

Let be a separable Banach space and let be a random element with associated probability measure Assume that

  1. a.s.

Then, for every and sequence :

  1. There exists an such that

  2. If then a.s.

The proof appears in the Appendix C.

5 A Local-Depth Based Clustering Procedure

In this section we introduce a centroid-based clustering procedure based on local depths (LDC). We propose the two-stage partition method described below. The R routines needed to compute the IDLD appear in Appendix D.

Let be a random element in a separable Banach space with distribution

  • Core clustering region.

    • Consider the local depth inner region at locality level defined in Equation (13).

    • Consider a partition of into clusters, such that and for

  • Final clustering allocation.

    Based on the initial clustering configuration for the points in proceed to the final clustering allocation following a minimum distance rule, i.e.

    where

The main idea of the proposal is to determine the center of the cluster as a region of the space rather than a single point, even though, it is well known that there is no a “one size fits all” clustering procedure, and that the election of the clustering procedure relies heavily on the underlying distribution. Our main idea is to have centers with a flexible shape allowing a better capturing of the cluster distribution. Typically, center-based clustering proposals have very good performance under spherical distributions. More flexibility in the shape of the central region should be reflected in a better performance at detecting the true clustering structure under a wide range of distributions, including elliptical distributions. In addition, since depth measures have a close relation with robustness, the core clustering regions are expected to be resistant to the presence of outliers.

In Step 1 part b), any clustering procedure can be considered; for the sake of simplicity in what follows, we use the classical -means algorithm. If the number of clusters, is not given beforehand, it can be estimated using any procedure existing in the literature.

The empirical counterpart of the proposal is given in a straightforward way, employing a classical plug-in procedure.

Let be iid observations in a separable Banach space, with a cluster structure. Denote by the empirical local depth inner region at locality level and let denote the initial partition obtained in Step 1 part b). The final allocation is given by,

where

Remark 7.

The core observations of the clustering procedure can be selected considering any local depth, as long as the procedure is consistent.

6 Simulations and Real Data Examples

In this section we numerically analyze the performance of the clustering procedure introduced in Section 5. Simulations have been done both in the finite and infinite dimensional settings. In addition, real data examples are analyzed. The LDC procedure is implemented using not only the IDLD but also any other proposal available in the literature.

6.1 Simulations: Multivariate data

The main aim of this section is to evaluate the performance of our clustering proposal under a wide range of clustering configurations. Specifically, we will analyze the case where the data presents sparseness, outliers or the sizes of the groups is not balanced. For this end, we will work under fourteen different scenarios. The original variable distribution has been proposed by Witten and Tibshirani (2010) and extended by Kondo et al. (2016). Our proposal will be challenged by several well known clustering procedures, which are briefly described.

In all the cases the data has a three group structure, each group has 300 observations. The data is generated as follows.

Model 1: The data are spherically generated, following for with centers

and the covariance matrix is the identity matrix.

Model 2: The data are ellipsoidally generated, following for with centers and the covariance matrix .

In these two models, the first two variables are informative while the last one is noise.

Model 3, (respectively, Model 4) are five dimensional datasets. The first three variables have the same distribution as Model 1 (respectively, Model 2), the remaining variables are two independent noisy variables, with distribution

We then consider two different contamination settings. In each of them we add five outliers, we only replace one coordinate by a variable generated with uniform distribution in the interval . In the first setting, for Models 5-8, the contamination is done by replacing the first coordinate (which is an informative variable) of the first five observations of the first cluster, while the rest of the distribution remains as in Models 1-4. In Models 9-12, the contamination has the same distribution but is situated in the last coordinate, which is a non-informative variable. The two remaining models, 13 and 14, have clusters with unbalanced sizes, the same distributions are followed as in Models 1 and 2, but instead of having observations each cluster, the first cluster has of the observations, while the two remaining clusters have each.

The benchmark clustering procedures are:

  • The -means algorithm, we consider ten random initializations.

  • The sparse -means clustering procedure (SKM), introduce by Witten and Tibshirani (2010). The tuning parameter, bound is chosen, as suggested in the literature (), and five random initializations are considered.

  • The robust and sparse -means clustering procedure (RSKM), proposed by Kondo et al. (2016). Two tuning parameters must be set. Both of them have been set as suggested in [12]: the parameter that corresponds to the norm is and the trimming proportion is

  • The model-based clustering procedure (MCLUST) proposed by Fraley and Raftery (2002,2009), designed to cluster mixtures of normals distributions.

SKM is designed to cluster observations in a high dimensional setting, with a low proportion of clustering informative variables. RSKM is a robust extension of SKM.

The LDC introduced in Section 5 has been implemented using three definitions of local depth, every case the parameters where chosen following Hennig [8], and the results were very stable.

  • The simplicial local depth procedure (LDCS) introduced by Agostinelli and Romanazzi (2011). We used the R package localdepth, the threshold value for the evaluation of the local depth, was calculated with the quantile.localdepth function, as suggested in the same R package, and the quantile order of the statistic was set to

  • Local version of depth at locality level (LDCPV) according to proposals of Paindaveine and Van Bever (2013), using the R package DepthProc. We set

  • Integrated dual local depth at locality level (LDCI) introduced in Section 4. As with the LDCPV we set and set the number of random projections with standard normal distribution. Routines are available in the Appendix E of the Supplementary Material.

The parameter represents the proportion of data which will contain the core regions of the clusters, if this value is very small the procedure will have a very similar behavior to -means, not being able to capture the shape of the clusters. If it takes high values, the core regions will have observations with moderate local depth, that can lead to errors in the assignments. For these reasons we suggest taking values between and . To set this parameter we perform an analysis of the sensitivity, following the resampling ideas proposed by Hennig [8], from them we could see that in all cases the method is stable, as in most cases showed slightly better performance we settled this value throughout the study. We performed replicates for each model.

There is no commonly accepted criterion for evaluating the performance of a clustering procedure. Nonetheless, since we are dealing with synthetic datasets, we know the real label of each observation, hence in these cases we may use the Correct Classification Rate (CCR). We denote the original clusters by . Let be the group label of each observation, and the class label assigned by the clustering algorithm. Let be the set of permutations over . Then the CCR is given by:

(14)

The results of the simulation are exhibited in Table 1. As expected, all the clustering procedures have an exceptional performance for Models 1 and 3, where all the clusters are spherical without outliers. For Models 2 and 4, where the clusters have an elliptical distribution, MCLUST has an outstanding performance and it is clear that LDC (with any local depth measure) performs better than the other three alternatives. In Models 5 to 12, since -means, SKM and MCLUST are nonrobust procedures, they fail in the classification of the observation, typically the five outliers make up one group and the cluster with mean is usually split into two clusters. LDC and RSKM are based on more robust clustering criteria, hence both methods have a good performance; RSKM seems to perform better under spherical distributions while LDC performs better under elliptical distribution. It is clear, that LDC has a good performance for Models 1 to 12, and that the choice of the local depth is not crucial. Nonetheless, when cluster sizes are unbalanced the only criteria able to correctly detect the cluster structure are MCLUST and LDC considering the integrated dual local depth. It is clear that LDC combined with the other two proposals of local depths is not able to detect the center of the clusters. The remainder of the clustering procedures had a good performance on the spherical case but failed on the elliptical case. In summary, LDCI is the only clustering procedure versatile enough to detect clusters under adverse situations (sparse data, outliers and unbalanced cluster size).

Model -means SKM RSKM MCLUST LDCS LDCPV LDCI
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Table 1: Mean CCR for each clustering criterion and distribution configuration

In what follows we compare the computational times for the three local depths measures. The simulation were based on data generated according to Model 3, but instead of having three noise variables, we added (

) normal independent noise variables centered at the origin with unit standard deviation. Also we considered different sample sizes,

and For ILDL random directions were generated. Since the computational time increases exponentially as the dimension increases, we only performed replicates under each scenario.

p n
5 LDS
LDPV
IDLD
35 LDS
LDPV
IDLD
65 LDS
LDPV
IDLD
Table 2: Mean computer time for LDS, LDPV and IDLD.

From Table 2 we can see that in every case IDLD is the fastest procedure, moreover it is not affected by the dimension of the dataset, while the computational efforts required by LDS and LDPV grow dramatically as increases. LDPV is overall the slowest procedure. Even though all the procedures demand more time as the sample size grows, IDLD is the one with the least pronounced growth rate.

6.2 Simulations: Multivariate functional data

In this section we present the results of a simulation study for multivariate functional data; for such a multivariate setting, there are scarcely any clustering procedures. We will replicate the simulation done by Schumtz et al. (2017). They present three different scenarios. In every case, the data es bivariate.

Model A. Three groups, each of them with observations.

Group 1:
Group 2:
Group 3:

Here

is white noise with variance

and is white noise with variance The curves are generated for equidistant points in the interval

Model B. Four groups, each of them with observations.

Group 1:
Group 2:
Group 3:
Group 4:

Here and is white noise independent of with variance The functions are and where means the positive part. The curves are generated at equidistant points in the interval

Model C. Four groups, each of them with observations.

Group 1:
Group 2:
Group 3:
Group 4:

Here, while and are defined as before. The curves are generated at equidistant points in the interval

As in the original paper, the estimated partition will be compared with the theoretical one via the Adjusted Rand Index (ARI), from the function AdjustedRandIndex from the mclust R package. For each model, 50 replications where carried out. Schmutz et al. (2017), report the ARI for settings settings of their proposal, and also for funclust (2014) as well as - and - which are two proposals introduced by Ieva et al. (2013). In Table 3 we present the maximum value of the ARI for Schmutz et al. and the remainder of the procedures. It is clear that LDCI outperforms by far the rest of the proposals, since it does not misclassify any observation throughout the simulation study.

Model A Model B Model C
LDCI
Best Schmutz
funclust
Table 3: ARI for different clustering procedures for multivariate functional data.

Computational results functional data, considering synthetic and real examples appear in Appendix D.

6.3 Real data examples for mixed-type datasets

Our aim in this Section is to analyze data set AEMET, from the R library fda.esc. This dataset contains series of daily summaries of spanish weather stations selected for the period 1980-2009. We will analyze the clustering structure of the dataset conformed by the variables: mean daily wind speed during between 1980 and 2009 (which is a functional variable) and geographic information of each station: altitud, latitud and height, which are real variables. Analyzing these variables together is relevant given that height influences in the intensity of the winds. Although the sensors are located at the same height, it is possible that phenomena related to the climate of the region generate deformations in the curves given by the intensity of the wind. To apply the LDC clustering criterion, we must be precise in the definition of the IDLD in data sets that have these characteristics. Our proposal is to project the functional variable as we have done in Section 6.2 and the multivariate variables as in Section 6.1. Then, we join those two projections with equal weight, and compute the IDLD. We look for two clusters, the parameters of the clustering procedure are and These parameters have been settled upon visual considerations of the dataset. After performing the clustering analysis we obtained two groups, one of them corresponds to the coastal stations (orange stations) while the other one corresponds to the continental ones (red stations), as it can be seen in Figure 2. This classification corresponds to the well-known fact that the wind speed is more constant over the coastal areas. An example can be found in the use made of wind farms.

Figure 2: Geographical position of each meteorological station. The stations that belong to the coastal group are in orange, while the ones that belong to the continental stations appear in red.

Finally, to understand the conformation of the groups in an integral way, it is convenient to analyze the core regions for the mean speed of the wind and the height of the stations. It can be seen that the stations corresponding to the core continental region are at higher altitudes, suffer more variability in wind intensity, as shown in the left and right panels of Figure 3. However, the coast stations that they are in lower zones have less daily variability and apparently the wind has greater intensity, as can be seen in the central and right panels of Figure 3.

Figure 3: Left: The red curves correspond to the core observations of the mean wind speed for the coast cluster. Center: The yellow curves are the core observations of the mean wind speed for the continental cluster. Right: Grouping conformation for the height, coast cluster in red and continental cluster in yellow.

7 Final remarks

In this paper, we introduced a local depth measure, IDLD, suitable for data in a general Banach space with low computational burden. It is an exploratory data analysis tool, which can be used in any statistical procedure that seeks to study local phenomena. From the theoretical perspective, local depths are expected to be generalizations of a global depth measure. Our proposal has this property. Additionally, they are expected to inherit good properties from global depths: this point has been overlooked for local depths. Strong consistency results for the local depth and local depth regions have been proved.

From the practical point of view, we explored the use of local depth measures in cluster analysis, introducing a simple clustering procedure. The first stage is to split into groups the local inner region. The points are assigned to the closest group of the local inner region. The flexibility of shape of the groups made up by the points in the local inner region, produces a flexibility of the shapes in the groupings of the entire space. Computational experiments reflect this fact by showing an extraordinary performance under a wide range of clustering configurations.

References

  • [1] Agostinelli, C. (2018). “Local half-region depth for functional data.”

    Journal of Multivariate Analysis

    163, 67-79.
  • [2] Agostinelli, C., and M. Romanazzi. (2011). “Local Depth.” Journal of Statistical Planning and Inference, 141, 817-830.
  • [3] Chakraborty, B., and P. Chaudhuri (1996). “On transformation and retransformation technique for constructing an affine equivariant multivariate median.” Proceedings of the American Mathematical Society 124, 2539-2547.
  • [4] Chakraborty, B., and P. Chaudhuri (1998). “Operating transformation retransformation on spatial median angle test.” Statistica Sinica 8, 767-784.
  • [5] Cuevas, A., and R. Fraiman (2009). “On depth measures and dual statistics. A methodology for dealing with general data.” Journal of Multivariate Analysis 100(4), 753-766.
  • [6] Fraley C., and A. E. Raftery (2002). “Model-based clustering, discriminant analysis, and density estimation.” Journal of the American Statistica Association 97, 611–631.
  • [7] Fraley C., and A. E. Raftery (2009). “MCLUST Version 3 for R: Normal Mixture Modeling and Model-based Clustering,” Technical Report No. 504, Department of Statistics, University of Washington.
  • [8] Henning, C (2007). “Cluster-wise assesment of cluster stability.” Computational Statistics and Data Analysis, 52, 258?271.
  • [9] Ieva F., A. M. Paganoni, D. Pigoli, and V. Vitelli (2013). “Multivariate functional clustering for the morphological analysis of electrocardiograph curves.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 62,401-418. doi:10.1111/j.1467-9876.2012.01062.x
  • [10] Jacques, J., and C.Preda (2014). “Model-based clustering of functional data.” Computational Statistics and Data Analysis 71, 92-106. DOI:10.1016/j.csda.2012.12.004
  • [11] Kotík, L., and D. Hlubinka (2017). “A weighted localization of halfspace depth and its properties.” Journal of Multivariate Analysis 157, 53-69.
  • [12]

    Kondo, Y., M. Salibian-Barrera, and R. H. Zamar (2016). “A robust and sparse K-means clustering algorithm.”

    Journal of Statistical Software 72(5).
  • [13] Lopez-Pintado, S., and J. Romo (2011). “A half-region depth for functional data.” Computational Statistics and Data Analysis 55(4), 1679-1695.
  • [14] Paindavaine, D., and G. Van Bever (2013). “From depth to local depth: A focus in centrality.” Journal of the American Statistical Association 108(503), 1105-1119.
  • [15] Schmutz, A., J. Jacques, C. Bouveyron, L. Cheze, and P. Martin (2017). “Clustering Multivariate functional data in group-specific functional subspaces.” (Unpublished) https://hal.inria.fr/hal-01652467/file/Clusteringmultivariatefunctionaldata.pdf
  • [16]

    Witten, D., and R. Tibshirani (2010). “A framework for feature selection in clustering.”

    Journal of the American Statistical Association 105(490), 713-726. on functional data.” Advances in Data Analysis and Classification 11(3), 467-492. DOI 10.1007/s11634-016-0261-y
  • [17] Zou, Y., and R. Serfling (2000). “General Notion of Statistical Depth Function.” The Annals of Statistics 28(2), 461-482.

8 Appendix A: Proofs of properties p.1-6.

Proof: P. 1. (affine - invariance).

Since has finite dimension, without loss of generality we assume that

By the change of variables theorem

Since Haar measure is invariant under unitary linear transformations and we have that .

Proof: P. 2. (maximality at the center).

It is enough to show that for each and

Since the bounded, is attained, we have that has a global maximum at Then, is clear that also is a global maximum. ∎

Proof (Proposition 1).

Let is -symmetric about if for every is symmetric about Then, for every

Finally,

which is what we wanted to show. ∎

Proof (Proposition 2).

First note that, given and

From the definition of is clear that,

Let attains a global maximum at hence attains its maximum when