Statistical depth in abstract metric spaces

The concept of depth has proved very important for multivariate and functional data analysis, as it essentially acts as a surrogate for the notion a ranking of observations which is absent in more than one dimension. Motivated by the rapid development of technology, in particular the advent of `Big Data', we extend here that concept to general metric spaces, propose a natural depth measure and explore its properties as a statistical depth function. Working in a general metric space allows the depth to be tailored to the data at hand and to the ultimate goal of the analysis, a very desirable property given the polymorphic nature of modern data sets. This flexibility is thoroughly illustrated by several real data analyses.

READ FULL TEXT VIEW PDF

Authors

page 32

03/09/2018

A local depth measure for general data

We herein introduce a general local depth measure for data in a Banach s...
05/26/2021

Statistical Depth Meets Machine Learning: Kernel Mean Embeddings and Depth in Functional Data Analysis

Statistical depth is the act of gauging how representative a point is co...
11/23/2020

Level sets of depth measures and central dispersion in abstract spaces

The lens depth of a point have been recently extended to general metric ...
11/22/2020

Weighted lens depth: Some applications to supervised classification

Starting with Tukey's pioneering work in the 1970's, the notion of depth...
09/01/2021

Tukey's Depth for Object Data

We develop a novel exploratory tool for non-Euclidean object data based ...
04/24/2021

The GLD-plot: A depth-based plot to investigate unimodality of directional data

A graphical tool for investigating unimodality of hyperspherical data is...
03/26/2021

Online learning with exponential weights in metric spaces

This paper addresses the problem of online learning in metric spaces usi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Huge parts of statistical theory, especially its nonparametric side, heavily rely on the notion of ranks, see for instance Gibbons and Chakraborti (2010). However, ranks are not well defined in a multivariate framework as there exists no natural ordering in more than one dimension. This fact motivated Tukey (1975) to introduce the notion of statistical depth

as a surrogate for ‘multivariate ranks’. Concretely, a depth is a measure of how central (or how outlying) a given point is with respect to a multivariate probability distribution.

Zuo and Serfling (2000), following some earlier considerations in Liu (1990), formulated the properties that a valid depth measure should satisfy. Since then, depth-based procedures have proved very important tools for robust multivariate statistical analyses, e.g. see Liu et al (1999) and Li and Liu (2004, 2008). Serfling (2006) and Mosler (2013) offer excellent short reviews of the ideas surrounding the concept of depth, while Hallin et al (2021) recently shed new light on the problem of ‘multivariate ranks’.

The early 21st century has also seen such technological progress in recording devices and memory capacity, that any spatio-temporal phenomenon can now be recorded essentially in continuous time or space, giving rise to ‘functional’ random objects. As a result, a solid theory for Functional Data Analysis (FDA) has been developed as well, allowing the extension of most of the classical problems of statistical inference from the multivariate context to the inherently infinite-dimensional functional case. In particular, functional versions of statistical depth have been investigated (Fraiman and Muniz, 2001, Cuevas et al, 2007, López-Pintado and Romo, 2009, Dutta et al, 2011, López-Pintado and Romo, 2011, Sguera et al, 2013, Chakraborty and Chaudhuri, 2014, Hlubinka et al, 2015, Nieto-Reyes and Battey, 2021, Nieto-Reyes et al, 2021). It is worth noting that an infinite-dimensional environment implies specific theoretical and practical challenges, making the extension from ‘multivariate’ to ‘functional’ a non-trivial one (Nieto-Reyes and Battey, 2016).

In this paper, we carry on with this gradual extension process by defining the statistical depth for complex random objects living in abstract metric spaces. Again, this extension is motivated by the rapid development of technology. Indeed, this is the ‘Big Data

’ era, in which digital data is recorded everywhere, all the time. The information that this huge amount of data contain may enable next-generation scientific breakthroughs, drive business forward or hold governments accountable. However, this is conditional on the existence of a statistical toolbox suitable for such Big Data, the profusion and nature of which inducing commensurate challenges. Indeed those data consist of objects as various as high-dimensional/infinite-dimensional vectors, matrices or functions representing images, shapes, movies, texts, handwriting or speech (to cite a few); and live streaming series thereof – this is often summarised as ‘3V’ (

Volume, Variety and Velocity).

Mainstream statistical techniques often fall short for analysing such complex mathematical objects. Yet, it remains true that any statistical analysis requires a sense of how close two instances of the object of interest are to one another. It is then only natural to assume that they live in a space where distances can be defined – that is, in a certain metric space (Snášel et al, 2017). This motivates the need for a statistical depth defined in an abstract metric space; hence our proposal of a ‘metric depth’.

The idea that the concept of multivariate statistical depth could be extended to general non-Euclidean settings can be traced back to Carrizosa (1996, Section 3.1). Later, Li et al (2011) were considering a depth-based procedure for analysing abundance data, which are typically high-dimensional discrete data with many observed 0’s. Because of that particular structure, the classical Euclidean distance is not optimal for quantifying (dis)similarities between observations, and analysts in the field usually prefer more specific metrics such as the Bray-Curtis distance111The Bray-Curtis ‘distance’ does not satisfy triangle inequality, hence it is rather a semi-distance. (Bray and Curtis, 1957). In consequence, inspired by earlier works by Maa et al (1996) and Bartoszynski et al (1997), Li et al (2011) devised a depth measure which could allow the proximity between observations to be quantified by a specific, user-chosen distance/dissimilarity measure.

This flexibility appears even more desirable when dealing with the polymorphous objects commonly found in modern data sets, as described above. For instance, functional objects are much richer than just infinite-dimensional vectors, and they can be compared on many different grounds: general appearance, short- or long-range variation, oscillating behaviour, etc.; which makes the choice of the ‘proximity measure’ between two such objects a very crucial one (Ferraty and Vieu, 2006, Chapter 3)

. On a more theoretical basis, an appropriate choice of such ‘proximity measure’ sometimes allows one to get around issues caused by the ‘Curse of Dimensionality

(Geenens, 2011a).

Quantifying (dis)similarities between non-numeric objects is even more subject to discretionary choice. As an example, for comparing pieces of texts, the literature in text mining, linguistics and natural language processing proposed numerous metrics such as the Levenshtein distance, the Hamming distance, the Jaccard index or the Dice coefficient – each targetting different dimensions of words, sentences or texts, such as similarity in spelling or similarity in meaning

(Wang and Dong, 2020). It is, therefore, paramount to have access to statistical procedures which allow a free choice of metric, and may be tailored to the kind of data at hand and to the ultimate purpose of the analysis.

Indeed, our proposed ‘metric depth’ (), defined in Section 2, enables such flexible analyses. Its main properties are explored in Section 3 and an empirical version (computable from a sample) is described in Section 4. Section 5 illustrates its capabilities on several real data sets, including an application in ‘text mining’ (Section 5.5). Section 6 concludes.

2 Statistical depth in metric spaces: definition

Assume that the random object of interest, say , lives in a certain space which can be equipped with a distance . To avoid dispensable technical complications, it will be assumed throughout that is a complete and separable metric space. Let be the -algebra on generated by the open -metric balls and

be the space of all probability measures defined on the Borel sets of

.222In a separable metric space, the Borel -algebra is generated by the open balls. This makes a proper probability space for any . In particular, it will be assumed that the distribution of belongs to . Note that the cartesian product space is then also a valid probability space (Parthasarathy, 1967, Theorem I.1.10). We denote:

for any measurable statement – the statement returns the value 1 if it is true, and 0 otherwise. So, returns the probability that is true if are two independent replications of , whose distribution is .

Then we give the following definition:

Definition 2.1.

The ‘metric depth’ (‘’) of the point in the metric space with respect to the probability measure is defined as:

(2.1)

For each fixed , the set belongs to the -algebra with defined above, making the probability statement in (2.1) a well-defined one for any .

The interpretation of (2.1) in terms of depth is clear: a point is deep with respect to the distribution if it is likely to find it ‘between’ two objects and in randomly generated from . ‘Between’ here means that the side joining and is the longest in a ‘triangle’ of with vertices , and , or, in other words, that belongs to the intersection of the two open -balls and , where is the ball with center and radius . In this sense, (2.1) is an extension of the vectorial ‘lens depth’ (Liu and Modarres, 2011). If we define

(2.2)

the ‘lens’ defined by and in , then . This is the probability that a random set contains a certain element , and interesting parallels can be drawn with the theory of random sets, in particular Choquet capacities and related ideas (Molchanov, 2005, Chapter 1). Note that, independently of this work, Cholaquidis et al (2020) recently explored the extension of the ‘lens depth’ to general metric spaces as well. Their focus and the content of their paper are, however, much different to what is investigated here.

3 Main properties

The fact that the distance is left free really makes the metric depth a very flexible tool, as any meaningful equipping can be used in (2.1) without altering the theoretical properties which we explore below.

In addition, we note that no-where in the developments, it is used explicitly the fact that for any two (identity of indiscernibles). A proximity measure which satisfies all the properties of a distance (non-negativity, symmetry and triangle inequality) but not ‘identity of indiscernibles’ is called a pseudo-distance. Hence, the metric depth (2.1) can be used in conjunction with a pseudo-distance, while keeping its essential features. We can, for instance, assess the proximity between two objects by comparing the coefficients of their leading terms when expanded in certain bases, such as a spline basis in the case of functional data when smoothing the original data is necessary (Ramsay and Silverman, 2005, Chapter 3). Other examples are given in Section 5.

3.1 Elasticity invariance

  1. Let be an ‘elastic’ map in the sense that for any , . Then, where is the push-forward distribution of the image through of a random object of having distribution .

This follows from the fact that for such a map . These obviously include any isometry, such that , or other dilation-type transformations such that , for some positive scalar constant , but not only. Clearly, establishes as a purely topological concept. On another note, () may be thought of as an extension of property P1 in Zuo and Serfling (2000, p. 463) – that a depth measure in should not depend on the underlying coordinate system or, in particular, on the scales of the underlying measurements’.

3.2 Vanishing at infinity

Assume that is an unbounded metric space, i.e., . Then:

  1. For any and , .

This follows from Proposition 1(a) in Cholaquidis et al (2020). It is obviously the analogue to Zuo and Serfling (2000)’s P4: ‘The depth of a point should approach 0 as approaches infinity’.

Now suppose that, ,

(3.1)

This kind of continuity condition guarantees that, with probability 1, a given will not lie exactly on the boundary of a random lens such as (2.2). Then, we can prove the following properties and .

3.3 Continuity in

  1. For any such that (3.1) holds, and , there exists such that

Indeed, for any , take such that , for some . Then, by the triangle inequality, for any ,

Hence, is such that

Now, see that

is a cumulative distribution function assumed to be continuous at

by (3.1). This means that, for any , we can find a such that . As , the claim follows.

3.4 Continuity in

  1. For any such that (3.1) holds, and , there exists such that -almost surely for all with -almost surely, where metricises the topology of weak convergence on .

This follows directly from classical results on convergence of probability measures on separable metric spaces – e.g. Dudley (2002, Theorem 11.1.1) – as and in (2.1) are simple probability statements on elements of . Note that (3.1) guarantees that the ‘lens’ (2.2) is a continuity set in the sense of Dudley (2002, Section 11.1).

3.5 Further comments

Zuo and Serfling (2000) listed two more desirable properties for a depth measure on : ‘Maximality at centre’ and ‘Monotonicity relative to deepest point’ (their properties P2 and P3). Similar features are difficult to investigate here for without giving a stronger structure to , such as some sort of convexity, or to satisfy a parallelogram inequality, for example. As illustration, Zuo and Serfling (2000)’s P2 ‘Maximality at centre’ requires the depth to be maximum at a uniquely defined ‘centre’ with respect to some notion of symmetry. Without assuming a stronger structure , even the very definition of symmetry in is unclear. As our aim here is to stay as flexible as possible with the proposed metric depth, we do not investigate further in that direction. Those properties of may (or may not) be established on specific applications when and are precisely defined, though.

On a side note, even if Liu and Modarres (2011) supposedly showed (their Theorem 6) that their Euclidean ‘lens depth’ in – of which (2.1) can be thought of as an extension – satisfies ‘Maximality at centre’ for centrally symmetric distributions, their proof appears wrong as pointed out in Kleindessner and von Luxburg (2017). Yet, Kleindessner and von Luxburg (2017) conceded that they believe that the statement is true. In Appendix, we give three counter-examples, establishing that the statement is actually not true: Liu and Modarres (2011)’s ‘lens depth’ does not generally satisfy neither ‘Maximality at centre’ nor ‘Monotonicity relative to deepest point’ (Zuo and Serfling (2000)’s P2 and P3) for centrally symmetric distributions on .333Kleindessner and von Luxburg (2017) noticed that analogous proofs for the spherical depth (Elmore et al, 2006) and the -skeleton depth (Yang and Modarres, 2017) are mistaken as well. Furthermore, we have found that the proof of a similar property for the band depth given in López-Pintado and Romo (2009, Theorem 1(2)) is likewise erroneous.

A last important point is the following. Suppose that the balls are convex in . Then it can easily be checked that, for any non-degenerate distribution (i.e., not a unit point mass at some ), cannot be degenerated in the sense that for all . Indeed, by convexity, the intersection is non-empty as soon as , so, there always exists some which gets a positive depth by (2.1). It is known that some instances of statistical depth admit such a degenerate behaviour. For instance, that is the case of López-Pintado and Romo (2009, 2011)’s band and half-region depths for a wide class of distributions on common functional spaces (Chakraborty and Chaudhuri, 2014, Theorems 3 and 4).

4 Empirical metric depth

Assume now that we have a random sample of realisations of the object in . Then the depth of some point with respect to

must actually be estimated. The empirical analogue of (

2.1) is naturally , where is the empirical measure of the sample, i.e., the collection of -weighted point masses at the observed . This yields

(4.1)

Obviously, , which guarantees under (3.1) the strong pointwise consistency of the estimator , that is

(4.2)

for all . This easily follows from Property

Remark 4.1.

Universal uniform strong consistency of on every subset of that is equicontinuous with respect to : a desirable and much stronger result that (4.2) is the universal uniform strong consistency of the depth measure on a subset of . This is defined as

for any . Note that, from (2.2), (4.1) can also be written

(4.3)

From Gijbels and Nagy (2015)’s analysis of López-Pintado and Romo (2009)’s band depth, though, one can understand how the non-continuous indicator function in this expression may prevent from being universally strongly consistent, even on small and well-behaved subsets . However, Gijbels and Nagy (2015) showed how substituting with some smoothed version of it, fixes issues. Specifically, define an adjustment function such that is non-increasing on , and (Gijbels and Nagy, 2015, Definition 3). Now replace in (4.3) by , where the distance between and a subset of is defined as . The choice reduces down to the original expression, while a continuous produces a smoothed version of it. Gijbels and Nagy (2015) showed that indeed, for continuous as above, the ‘adjusted’ version of the band depth is universally strongly consistent on every equicontinuous subset . We note that this result was derived in Gijbels and Nagy (2015) for the case of the band depth on the space of continuous functions with supremum norm only – in particular, its proof involves the Arzelà-Ascoli Theorem, a result specific to . Although an attempt to extent this result to (4.3) could be pursued, Gijbels and Nagy (2015) admitted that their adjustment is primarily motivated by theoretical considerations, but plays very little role in practice. Therefore, we will not consider this any further here.

Finally, the obvious -statistics structure of (4.1

) allows us to easily deduce, through an appropriate Central Limit Theorem, the asymptotic normality of

, a result that could be used for inference, for instance to build a confidence region for the ‘true’ median element, i.e. the deepest element with respect to the population distribution (Serfling and Wijesuriya, 2017).

5 Data examples

In this section, we illustrate the usefulness of the proposed metric depth on 5 real data sets: two one-dimensional functional datasets (Sections 5.1 and 5.2), a bidimensional functional dataset (Section 5.3), a symbolic data set (Section 5.4) and a non-numeric data set (text) (Section 5.5).

5.1 Canadian weather data

The Canadian temperature data set is a classical functional data set available from the R package fda. The data give the daily temperature records of 35 Canadian weather stations over a year (365 days, day 1 is 1st of January) averaged over 1960 to 1994, see Figure 5.1. First, the depth of the 35 curves with respect to the sample has been computed from the empirical functional metric depth (4.1) with being the usual distance between two square-integrable functions, i.e. . The 5 deepest and least deep curves are shown in Figure 5.2

. The suggested depth measure identifies the Sherbrooke (deepest curve), Thunder Bay, Fredericton, Quebec and Calgary stations as the most representative of a median Canadian weather, in terms of temperature. On the other hand, the most outlying curves are seen to be Resolute (least deep curve), Victoria, Vancouver, Inuvik and Iqaluit. It is visually obvious that those curves are much different to the others: Resolute, Inuvik and Iqaluit are Arctic stations, with much colder temperatures across the year than the other stations, while Vancouver and Victoria lie on the south Pacific coast of Canada and enjoy much milder winters. We can appreciate that Vancouver and Victoria are ‘shape outliers’, whereas the Arctic stations are ‘location outliers’. These are equally easily flagged by the metric depth

– this has to be stressed, as some functional depths have been shown to be able to identify one type of outlier but not the other, or vice-versa (Serfling and Wijesuriya, 2017).

Figure 5.1: Canadian weather data – average daily temperatures at 35 stations.
Figure 5.2: Five deepest curves (left; the darkest curves are the deepest) and five least deep curves (right; the lightest curves are the least deep) according to (4.1) with , the distance.

Of course, the daily average temperature curves are particularly noisy, which could heavily affect the -distances computed between pairs of curves, hence the whole calculation of the depths. One can deal with the roughness of those curves in different manners: first, one could use smoothed versions of the initial curves, for instance the monthly average temperatures as in Serfling and Wijesuriya (2017); second, one could use for a distance less affected by such noise than the one, for instance the supremum () distance; finally one can expand the different curves in a certain basis and focus only on the first terms when assessing the proximity between them. We achieved that by expanding each curve in the empirical Principal Components basis (Hall, 2011) and keeping only the first two principal scores: the curves re-constructed from those two components only are indeed smooth approximations to the initial, rough curves. So, each curve is now represented by a point in the 2-dimensional space of the first two Principal Components, and the proximity between two curves quantified by the -distance between the corresponding two points. In effect, this defines a pseudo-distance between the initial curves, see Ferraty and Vieu (2006, Section 3.4.1). The depths assigned to each station according to these 4 methods are shown in Table 5.1. The four depth measures are in very good agreement, essentially identifying the same central and outlying curves. This shows that the depth measure (2.1) and its empirical version (4.1) are quite robust to any reasonable choice of .

Station Station
Sherbrooke .514 .474 .516 .506 Thunder B. .511 .513 .504 .496
Fredericton .504 .479 .504 .491 Quebec .504 .513 .506 .497
Calgary .492 .375 .492 .503 Bagottville .482 .457 .484 .476
Edmonton .474 .459 .472 .494 Arvida .469 .476 .472 .476
Regina .429 .435 .427 .420 Charlottvl .425 .435 .424 .435
Pr.George .422 .380 .425 .469 Ottawa .410 .479 .408 .403
Winnipeg .403 .418 .405 .398 Pr.Albert .400 .383 .400 .390
Montreal .383 .457 .380 .375 Halifax .380 .348 .378 .368
Whitehorse .371 .378 .373 .395 The Pas .360 .358 .358 .348
Sydney .324 .284 .333 .348 Uranium C. .316 .257 .319 .311
Toronto .313 .363 .309 .309 Scheffervll .294 .318 .292 .291
St.Johns .271 .237 .272 .257 London .267 .324 .267 .261
Yellowknife .227 .207 .224 .200 Yarmouth .208 .195 .208 .205
Dawson .207 .138 .208 .242 Churchill .160 .242 .160 .160
Kamloops .150 .202 .148 .148 Pr.Rupert .106 .108 .106 .109
Iqaluit .096 .134 .096 .096 Inuvik .066 .064 .066 .066
Vancouver .045 .050 .045 .040 Victoria .017 .002 .017 .018
Resolute 0 0 0 0
Table 5.1: Canadian weather data - metric depth measures for 4 different (pseudo-)distances : : distance; : Supremum () distance; : distance on the average monthly temperature curves; : distance in the plane of the first two principal components.

5.2 Lip movement data

Malfait and Ramsay (2003) studied the relationship between lip movement and time of activation of different face muscles, see also Ramsay and Silverman (2002, Chapter 10) and Gervini (2008). The study involved a subject saying the word ‘bob’ 32 times and the movement of their lower lip was recorded each time. Those trajectories are shown in Figure 5.3, and all share the same pattern: a first peak corresponding to the firt /b/, then a plateau corresponding to the /o/ and finally a second peak for the second /b/. These functions being very smooth (actually, they are smoothed versions of raw data not publicly available), it seems natural to use again the classical distance for assessing their relative proximity. Hence, the respective depth of each curve with respect to the sample was obtained by (4.1) with . The 5 deepest and 5 least deep curves are shown in the top row of Figure 5.4. In particular, this depth identifies as outliers the three curves showing a second peak at a much later time than for the rest of the curves, which were already hived off by Gervini (2008). The remaining two outlying curves show two peaks of lower amplitude than the others, with a second peak occurring earlier than the bunch.

Figure 5.3: Lip movement data.
Figure 5.4: Top and middle row: Five deepest curves (left; the darkest curves are the deepest) and five least deep curves (right; the lightest curves are the least deep) according to (4.1) with (i) , the distance between the curves (top row) and (ii) , the distance between the second derivatives of the curves (middle row). Bottom row: Five deepest acceleration curves (left; the darkest curves are the deepest) and five least deep acceleration curves (right; the lightest curves are the least deep).

Now, Malfait and Ramsay (2003), in their original study, were more interested in the acceleration of the lip during the process rather than on the lip motion itself. The study aimed at explaining time of activation of face muscles, and the acceleration reflects the force applied to tissue by muscle contraction. Hence, in this application, it may be worth contrasting the lip trajectories in terms of their corresponding accelerations, that is, comparing the second derivatives of the position curves. The distance between the second derivatives of the curves is naturally a pseudo-distance between the initial curves (Ferraty and Vieu, 2006, Section 3.4.3), which can be used in (4.1). The 5 deepest and 5 least deep curves, according to (4.1) based on the ‘acceleration’ pseudo-distance, are shown in the middle row of Figure 5.4 and differ from those in the first row of Figure 5.4. Naturally, the focus here is no more on the exact position of the curves, but rather on the more fundamental underlying dynamics. For instance, the 5 deepest curves show a first peak of distinctly different heights, but in terms of their second derivatives, they are in fact quite similar and representative of the sample (bottom row of Figure 5.4), and that is what matters in Malfait and Ramsay (2003)’s study. As argued in Section 1, the flexibility of (2.1) in terms of the choice of allows the analyst to tailor the depth measure to the given factors and the goal of the analysis.

5.3 Handwriting data

The ‘handwriting’ data set consists of twenty replications of the printing of the three letters ‘fda’ by a single individual. The position of the tip of the pen has been sampled 200 times per second. The data, available in the R package fda, have already been pre-processed so that the printed characters are scaled and oriented appropriately, see Figure 5.5.

Figure 5.5: Handwriting data.

These data are essentially bivariate functional data. Indeed, each instance of the word ‘fda’ arises through the simultaneous realisation of two components , where and give the position along the horizontal axis and the vertical axis, respectively, of the pen at time . This is illustrated for one instance of ‘fda’ in Figure 5.6. Hence, an appropriate functional metric space here could be with , (the time interval on which the position of the pen was recorded) and being the Euclidean distance on whose square is defined by

(5.1)

This distance can be used directly in (4.1) to identify the 5 deepest and 5 least deep instances of ‘fda’, see Figure 5.7. The bivariate nature of the data at hand does not cause any particular complication and the definition (2.1) need not be re-adapted to this case. Again, the so-defined depth only focuses on the ‘drawings’ fda themselves, and identifies the deepest instances. However, it was argued in the related literature that the tangential acceleration of the pen during the process was also a key element to analyse for understanding the writing dynamics, for instance for discriminating between genuine handwritings and forgeries (Geenens, 2011a, b). As in Subsection 5.2, one could therefore use (4.1) with a pseudo-distance assessing the proximity between two instances of fda through their tangential acceleration curves only, if that was to be the focus of the analysis.

Figure 5.6: One instance of the handwriting data, and its - and -components.
Figure 5.7: Five deepest curves (left; the darkest curves are the deepest) and five least deep curves (right; the lightest curves are the least deep) according to (4.1) with being the distance (5.1) on .

5.4 Age distribution in European countries

Symbolic Data Analysis (SDA) has recently grown as a popular research field in statistics (Billard and Diday, 2003, 2007). Indeed the intractably large ‘Big Data’ sets often need to be summarised so that the resulting summary datasets are of a manageable size, and so-called ‘symbolic data’ typically arise from such a process. No longer formatted as single values like classical data, they are meant to be ‘aggregated’ variable typically represented by lists, intervals, histograms, distributions and the like. In this section we give a closer look at a ‘distribution-valued’ symbolic data set. Specifically, we analyse the distribution of the age of the population of the 44 european countries (see Table 5.2).

The 2017 data were obtained from the US Census bureau (www.census.gov/population/
international/data/
). Typically, the population distribution for a given country is presented under the form of a population pyramid (that is, a histogram), from which a proper distribution function for population age can easily be extracted (Kosmelj and Billard, 2011). Hence, each country (here: ‘individual’, also called ‘concept’ in the SDA literature) is characterised by a distribution. Figure 5.8 displays the sample of age distributions. Here we will use the suggested metric depth to analyse which countries are most representative of the ‘European’ age distribution, and which countries can be regarded as ‘outliers’ in that respect.

Switzerland .59 Liechtenstein .52 Hungary .519 Malta .508
Czech Republic .503 Ukraine .498 Netherlands .488 Croatia .478
Portugal .477 Poland .474 Belgium .451 Serbia .449
Denmark .434 Romania .422 UK .421 Belarus .406
Spain .406 Estonia .401 Bulgaria .36 Montenegro .36
Slovakia .349 Latvia .348 Sweden .322 Lithuania .319
Luxembourg .318 Russia .292 Norway .28 Austria .277
Bosnia-Herz. .276 France .27 Finland .261 San Marino .228
Andorra .221 Macedonia .201 Slovenia .178 Moldova .151
Greece .141 Iceland .14 Ireland .088 Italy .087
Albania .044 Germany .044 Kosovo 0 Monaco 0
Table 5.2: Age distribution in European countries – metric depth for the age distributions of the 44 European countries, based on the Wasserstein distance.
Figure 5.8: Age distribution in European countries

The data being here distribution functions of nonnegative variables, can be identified with a space of distribution functions supported on , i.e. a space of nondecreasing càdlàg functions with and , equipped with an appropriate distance. The Wasserstein distance has proved useful for a wide range of problems explicitly involving distribution functions (Rachev, 1984, Panaretos and Zemel, 2020), hence seems a natural choice in this setting as well. For some , the Wasserstein distance between two distributions and whose

th moments exist, is defined as

where the infimum is taken over the set of all joint bivariate distributions whose marginal distributions are and respectively. Properties of this distance are described in Major (1978) and Bickel and Freedman (1981). In particular, it is known that is essentially the usual

-distance between the quantile functions

and over . Also, it is known that convergence in the Wasserstein distance is equivalent to convergence in distribution together with convergence of the first moments. Hence, the distance quantifies the proximity between two distributions through both their general appearance and the values of their moments. In what follows, we take , hence we consider functional data in , being the space of all probability distribution functions with finite second moment.

The flexibility of (2.1) allows us to base on the Wasserstein distance so as to define a depth measure specific to distribution functions without any difficulty. The ‘Wasserstein-depths’ of the 44 countries are given in Table 5.2. The 5 deepest and least deep age distributions are shown in Figure 5.9. The deepest distribution, hence the most representative of the age distributions in Europe, appears to be that of Switzerland, a country located at the very heart of Europe, in-between the Western and Eastern countries, and in-between the Northern countries and the Southern countries, at the meeting point between the ‘Germanic’ world (Germany, Austria) and the ‘Latin’ world (France, Italy). From that perspective, Switzerland can be regarded as really representative of a ‘median’ European country on many aspects. On the other hand, the Wasserstein-metric depth is null for Kosovo and Monaco, and indeed, the distributions for those two countries clearly lie outside the bunch of the other distributions. Monaco is a micro, mild-climate (and incidentally, tax haven) state which attracts a large amount of rich retirees from all over the continent (if not the world), hence its population is globally much older than for other countries and its age distribution is below the others. Monaco set aside, Germany and Italy show globally the oldest population of Europe. Kosovo was still recently at the heart of an armed conflict in the Balkans, which explains the low proportion of older people in that country and the position of its age distribution above all the others. To some extent, this also explains the outlyingness of Albania’s curve. In any case, this example illustrates that one can readily define a depth measure tailored for distribution curves, which paves the way for developing rank-like procedures in Symbolic Data Analysis as well.

Figure 5.9: Five deepest age distributions (left; the darkest curves are the deepest) and five least deep age distributions (right; the lightest curves are the least deep) according to (4.1) with being the Wasserstein distance between distributions.

5.5 Authorship attribution by intertextual distance

Author identification on an unknown or doubtful text is one of the oldest statistical problems applied to literature. Here the capability of the proposed metric depth is illustrated within that framework. William Shakespeare and Thomas Middleton were contemporaries (late 16th-early 17th centuries), and their oeuvre are often compared. In that aim, Merriam (2003) examined 9 Middleton plays and 37 Shakespeare texts, and computed between each pair of them the so-called ‘inter-textual distance’ proposed by Labbé and Labbé (2001).444It is not the purpose of this paper to describe how this index is computed or what it represents; neither do we imply that it is the panacea for the considered problem – for that matter, it has been criticised (Viprey and Ledoux, 2006). Here we use it in an illustrative purpose only. Although the entities of interest are here purely non-numerical (famous literary pieces), the obtained matrix of distances allows us to outline the relative position of each text – and this is essentially all what is needed for to come into play.

As an example, Table 5.3 (recovered from Appendix 2 in Merriam (2003)) reports the ‘inter-textual’ distances between the 9 essential plays of Middleton. Computing the empirical metric-depth (4.3) on each of this entry in the ‘Middleton sample’ reveals that the two deepest observations are ‘More Dissemblers Besides Women’ and ‘A Trick to Catch the Old One’ (both get a depth of 0.4167). They may, therefore, be considered as the most typical Middleton plays (as long as the ‘inter-textual’ distance is the relevant metric).

Table 5.3: Matrix of ‘inter-textual’ distances between 9 essential plays by Thomas Middleton: Phn: ‘The Phoenix’; Mad: ‘A Mad World, My Masters’; Trk: ‘A Trick to Catch the Old One’; Pur: ‘The Puritan’; Alm: ‘The Almanac’; CMC: ‘A Chaste Maid in Cheapside’; Dis: ‘More Dissemblers Besides Women’; Val: ‘The Nice Valour’; WBW: ‘Women Beware Women’.

This time focusing on the 37 Shakespeare texts only, ‘Antony and Cleopatra’ is identified as Shakespeare’s most typical text; i.e., the deepest among the considered sample (depth: 0.5255) – see Table 5.4 (left column). The following most representative of Shakespeare plays are ‘The Tempest’ (0.5135), ‘Othello’ (0.5030) and ‘Romeo and Juliet’ (0.5015). The most outlying piece of work is the verset part of ‘Henry V’ (depth: 0), which tends to confirm a common conjecture hold by many experts on Shakespeare’s oeuvre: the verset part of ‘Henry V’ was not written by Shakespeare himself, but by Christopher Marlowe (Merriam, 2002).

Now, if we computed the metric depth of the 9 Middleton’s plays in Shakespeare’s sample, all would receive depth 0 – all are ‘outlying’ in Shakespeare’s oeuvre. This clearly indicates that Middleton’s work cannot be confused with Shakespeare’s, and it should be easy to assign a new piece of text to one or the other based on . Further, it is interesting to analyse the depth of each text in a combined sample made up both the works of Middleton and Shakespeare. In particular, some of Shakespeare’s texts which have a low depth in the ‘Shakespeare’s only’ sample, see their depth increase by large in the combined sample. This indicates that these pieces may have a strong Middleton flavour, to some extent. This hypothesis is confirmed for at least one of those plays: ‘Timon of Athens’ sees its depth increase from 0.1141 to 0.3971 if one includes Middleton’s works in the reference sample; and indeed, extensive research on the topic has provided ample evidence that Middleton wrote approximately one third of that play (Taylor, 1987).

Note that computing and comparing the depth of certain observations in two different samples is the spirit of the -plot and the

-classifier proposed by

Li et al (2012). These procedures can naturally be used in conjunction with the metric depth , enabling similar powerful depth-based analyses in abstract metric spaces.

- Shakespeare - combined
The Two Gentlemen of Verona 0.3784 0.5072
The Taming of the Shrew 0.2192 0.4396
Henry VI - Part II 0.2508 0.2097
Henry VI - Part III 0.0526 0.0425
Henry VI - Part I 0.1532 0.1227
Titus Andronicus 0.1937 0.1594
Richard III 0.4760 0.4570
The Comedy of Errors 0.4595 0.5353
Love’s Labour’s Lost 0.4910 0.4889
A Midsummer Night’s Dream 0.3859 0.3691
Romeo and Juliet 0.5015 0.5227
Richard II 0.1006 0.0841
King John 0.2267 0.2048
The Merchant of Venice 0.4189 0.5179
Henry IV, Part I 0.4129 0.4058
The Merry Wives of Windsor 0.3829 0.4483
Henry IV, Part II 0.3619 0.4744
Much Ado about Nothing 0.2252 0.4444
Henry V (prose part) 0.3078 0.2870
Henry V (verset part) 0.0000 0.0000
Julius Caesar 0.4294 0.3990
As You Like It 0.2763 0.4792
Hamlet 0.4715 0.4473
Twelfth Night 0.0075 0.3488
Troilus and Cressida 0.3333 0.2850
Measure for Measure 0.3378 0.5063
Othello 0.5030 0.5585
All’s Well that Ends Well 0.0450 0.3710
Timon of Athens 0.1141 0.3971
King Lear 0.4640 0.5478
Macbeth 0.3649 0.3121
Antony and Cleopatra 0.5255 0.5401
Coriolanus 0.4655 0.4309
The Winter’s Tale 0.2462 0.4261
Cymbeline 0.1607 0.4184
The Tempest 0.5135 0.5246
Henry VIII 0.4099 0.5024
Table 5.4: 37 Shakespeare’s plays (shown in chronological order) – empirical in the sample of Shakespeare’s plays only (left column) and in the sample of combined Shakespeare’s and Middleton’s works (right column).

6 Conclusion

In this paper, we have proposed a new statistical depth function, called ‘metric depth’ or just , defined in an abstract metric space. It is explicitly constructed on a certain distance that must be chosen by the analyst, which allows them to tailor the depth to the data at hand and to the ultimate goal of the analysis. This offers an unmatched flexibility about the range of problems and applications that can be addressed using the said depth measure. The usefulness of has been illustrated on several real data sets, including one in the emergent field of Symbolic Data Analysis and an application in text mining (authorship attribution). Rejuvenating an old idea of Bartoszynski et al (1997), its definition is very intuitive: the depth of a functional point with respect to a distribution is the probability to find it ‘between’ two functional objects and randomly generated from , ‘between’ meaning here that belongs to the intersection of the two open -balls and . This definition is natural and enjoys many pleasant properties.

Acknowledgements

Gery Geenens’ research was supported by a Faculty Research Grant from the Faculty of Science, UNSW Sydney, Australia. Alicia Nieto-Reyes’ research was funded by the Spanish Ministerio de Ciencia, Innovación y Universidades grant number MTM2017-86061-C2-2-P.

References

  • Bartoszynski et al (1997) Bartoszynski, R., Pearl, D.K. and Lawrence, J. (1997), A multidimensional goodness-of-fit test based on interpoint distances, J. Amer. Statist. Assoc., 92, 577-586.
  • Bickel and Freedman (1981) Bickel, P.J. and Freedman, D.A. (1981), Some asymptotic theory for the bootstrap, Ann. Statist., 9, 1196-1217.
  • Billard and Diday (2003) Billard, L. and Diday, E. (2003), From the statistics of data to the statistics of knowledge: Symbolic Data Analysis, J. Amer. Statist. Assoc., 98, 470-487.
  • Billard and Diday (2007) Billard, L. and Diday, E. (2007), Symbolic Data Analysis: Conceptual Statistics and Data Mining, Wiley Series in Computational Statistics, Wiley.
  • Bray and Curtis (1957) Bray, J.R. and Curtis, J.T. (1957), An ordination of the upland forest communities of southern Wiscosin, Ecological Monographs, 27, 325-349.
  • Carrizosa (1996) Carrizosa, E. (1996), A characterization of halfspace depth, J. Multivariate Anal., 58, 21-26.
  • Chakraborty and Chaudhuri (2014) Chakraborty, A. and Chaudhuri, P. (2014), On data depth in infinite dimensional spaces, Ann. Inst. Statist. Math., 66, 303-324.
  • Cholaquidis et al (2020) Cholaquidis, A., Fraiman, R., Gamboa, F. and Moreno, L. (2020), Weighted lens depth: some applications to supervised classification, Manuscript, arXiv:2011.11140.
  • Cuevas et al (2007) Cuevas, A., Febrero, M. and Fraiman, R. (2007), Robust estimation and classification for functional data via projection-based depth notions, Computational Statistics, 22, 481-496.
  • Dudley (2002) Dudley, R.M. (2002), Real analysis and probability, Cambridge studies in Advanced Mathematics, Cambridge University Press.
  • Dutta et al (2011) Dutta, S., Ghosh, A.K. and Chaudhuri, P. (2011), Some intriguing properties of Tukey’s half-space depth, Bernoulli, 17, 1420-1434.
  • Elmore et al (2006) Elmore, R.T., Hettmansperger, T.P. and Xuan, F. (2006), Spherical data depth and a multivariate median, In:

    Data Depth: Robust Multivariate Analysis, Computational Geometry and Applications, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 72, 87.

  • Ferraty and Vieu (2006) Ferraty, F. and Vieu, P. (2006), Nonparametric Functional Data Analysis, Springer-Verlag, New York.
  • Fraiman and Muniz (2001) Fraiman, R. and Muniz, G. (2001), Trimmed means for functional data, Test, 10, 419-440.
  • Francisci et al (2020) Francisci, G., Nieto-Reyes, A. and Agostinelli, C. (2020), Generalization of the simplicial depth: no vanishment outside the convex hull of the distribution support, manuscript, arXiv:1909.02739v2
  • Geenens (2011a) Geenens, G. (2011a), Curse of dimensionality and related issues in nonparametric functional regression, Stat. Surv., 5, 30-43.
  • Geenens (2011b) Geenens, G. (2011b), A nonparametric functional method for signature recognition, In: Recent Advances in Functional Data Analysis and Related Topics, Ferraty, F. and Vieu, P. (Eds), Physica-Verlag HD, pp. 141-147.
  • Gervini (2008) Gervini, D. (2008), Robust functional estimation using the median and spherical principal components, Biometrika, 95, 587-600.
  • Gibbons and Chakraborti (2010) Gibbons, J. and Chakraborti, S. (2010), Nonparametric Statistical Inference, 5th Edition, Chapman and Hall/CRC.
  • Gijbels and Nagy (2015) Gijbels, I. and Nagy, S. (2015), Consistency of non-integrated depths for functional data, J. Multivariate Anal., 140, 259-282.
  • Hall (2011)

    Hall, P. (2011), Principal component analysis for functional data: methodology, theory and discussion, In: Oxford handbook on functional data analysis, Ferraty, F. and Romain, Y. (Eds), Oxford University Press, pp. 210-234.

  • Hallin et al (2021) Hallin, M., del Barrio, E., Cuesta-Albertos, J. and Matrán, C. (2021), Distribution and quantile functions, ranks and signs in dimension : a measure transportation approach, Ann. Statist., 49, 1139-1165.
  • Hlubinka et al (2015) Hlubinka, D., Gijbels, I., Omelka, M. and Nagy, S. (2015), Integrated data depth for smooth functions and its applications in supervised classification, Comput. Statist., 30, 1011-1031.
  • Kleindessner and von Luxburg (2017) Kleindessner, M. and von Luxburg, U. (2017), Lens depth function and -relative neighborhood graph: versatile tools for ordinal data analysis, J. Mach. Learn. Res., 18, 1-52.
  • Kosmelj and Billard (2011) Kosmelj, K. and Billard, L. (2011), Clustering of population pyramids using Mallows’ distance, Metodoloski zvezki, 8, 1-15.
  • Labbé and Labbé (2001) Labbé, C. and Labbé, D. (2001), Inter-textual distance and authorship attribution: Corneille and Molière, J. Quant. Linguist., 8, 213-231.
  • Li and Liu (2004) Li, J. and Liu, R.Y. (2004), New nonparametric tests of multivariate locations and scales using data depth, Statist. Sci., 19, 686-696.
  • Li and Liu (2008) Li, J. and Liu, R.Y. (2008), Multivariate spacings based on data depth: I. Construction of nonparametric tolerance regions, Ann. Statist., 36, 1299-1323.
  • Li et al (2011) Li, J., Ban, J. and Santiago, L.S. (2011), Nonparametric tests for homogeneity of species assemblages: a data depth approach, Biometrics, 67, 1481-1488.
  • Li et al (2012) Li, J.  Cuesta-Albertos, J.A. and Liu, R.Y. (2012), DD-classifier: nonparametric classification procedure based on DD-plot, J. Amer. Stat. Assoc., 107, 737-753.
  • Liu (1990) Liu, R.Y. (1990), On a notion of data depth based on random simplices, Ann. Statist., 18, 405-414.
  • Liu and Modarres (2011) Liu, Z. and Modarres, R. (2011), Lens depth and median, J. Nonparametr. Stat., 23, 1063-1074.
  • Liu et al (1999)

    Liu, R.Y., Parelius, J.M. and Singh, K. (1999), Multivaraite analysis by data depth: descriptive statistics, graphics and inference, Ann. Statist., 27, 783-858.

  • López-Pintado and Romo (2009) López-Pintado, S. and Romo, J. (2009), On the concept of depth for functional data, J. Amer. Statist. Assoc., 104, 718-734.
  • López-Pintado and Romo (2011) López-Pintado, S. and Romo, J. (2011), A half-region depth for functional data, Comput. Statist. Data Anal., 55, 1679-1695.
  • Maa et al (1996) Maa, J.F., Pearl, D.K. and Bartoszynski, R. (1996), Reducing multidimensional two-sample data to one-dimensional interpoint comparisons, Ann. Statist., 24, 1069-1074.
  • Major (1978)

    Major, P. (1978), On the invariance principle for sums of independent, identically distributed random variables, J. Multivariate Anal., 8, 487-501.

  • Malfait and Ramsay (2003) Malfait, N. and Ramsay, J.O. (2003), The historical functional linear model, Canad. J. Statist., 31, 115-128.
  • Merriam (2002) Merriam, T. (2002), Marlowe in Henry V: A crisis in Shakespearian identity? Oxquarry Books, Oxford.
  • Merriam (2003) Merriam, T. (2003), An application of authorship attribution by intertextual distance in English, Corpus, 2.
  • Molchanov (2005) Molchanov, I. (2005), Theory of Random Sets, Springer-Verlag, New York.
  • Mosler (2013) Mosler, K. (2013), Depth statistics, In: Robustness and Complex Data Structures, Becker C. et al (Eds), Springer Berlin Heidelberg, pp. 17-34.
  • Nieto-Reyes and Battey (2016) Nieto-Reyes, A. and Battey, H. (2016), A topologically valid definition of depth for functional data, Statist. Sci., 31, 61-79.
  • Nieto-Reyes and Battey (2021) Nieto-Reyes, A. and Battey, H. (2021), A topologically valid construction of depth for functional data, J. Multivariate Anal., 184, 104738.
  • Nieto-Reyes et al (2021) Nieto-Reyes, A., Battey, H. and Francisci, G. (2021), Functional Symmetry and Statistical Depth for the Analysis of Movement Patterns in Alzheimer’s Patients. Mathematics, 9, 820.
  • Panaretos and Zemel (2020) Panaretos, V.M. and Zemel, Y. (2020), An invitation to statistics in Wasserstein spaces, Springer.
  • Parthasarathy (1967) Parthasarathy, K.R. (1967), Probability measures on metric spaces, Series: Probability and mathematical statistics, Academic Press.
  • Rachev (1984) Rachev, S. T. (1984), The Monge-Kantorovich problem on mass transfer and its applications in stochastics, Theor. Probab. Appl., 29, 647-676.
  • Ramsay and Silverman (2002) Ramsay, J.O. and Silverman, B.W. (2002), Applied functional data analysis; methods and case study, Springer-Verlag, New York.
  • Ramsay and Silverman (2005) Ramsay, J.O. and Silverman, B.W. (2005), Functional Data Analysis, 2nd Ed., Springer-Verlag, New York.
  • Serfling (2006) Serfling, R. (2006), Depth functions in nonparametric multivariate inference, In: Data Depth: Robust Multivariate Analysis, Computational Geometry and Applications, DIMACS Ser. Discrete Math. Theoret. Comput. Sci., 72, 1-16.
  • Serfling and Wijesuriya (2017) Serfling, R. and Wijesuriya, U. (2017), Depth-based nonparametric description of functional data, with emphasis on use of spatial depth, Comput. Statist. Data Anal., 105, 24-45.
  • Sguera et al (2013) Sguera, C., Galeano, P. and Lillo, R.E. (2014), Spatial depth-based classification for functional data, TEST, 23, 725-750.
  • Snášel et al (2017) Snášel, V., Nowaková, J., Xhafa, F. and Barolli, L. (2017), Geometrical and topological approaches to Big Data, Future Generation Computer Systems, 67, 286-296.
  • Taylor (1987) Taylor, G. (1987), The Canon and Chronology of Shakespeare’s Plays, Clarendon Press, Oxford.
  • Tukey (1975) Tukey, J. (1975), Mathematics and Picturing Data, Proceedings of the 1975 International Congress of Mathematics, 2, 253-531.
  • Viprey and Ledoux (2006) Viprey, J.M. and Ledoux, C.N. (2006), About Labbe’s “intertextual distance”, J. Quant. Linguist., 13, 265-283.
  • Wang and Dong (2020) Wang, J. and Dong, Y. (2020), Measurement of text similarity: a survey, Information, 11, 421.
  • Yang and Modarres (2017) Yang, M. and Modarres, R. (2017), -Skeleton depth functions and medians, Comm. Statist. Theory Methods, 47, 5127-5143.
  • Zuo and Serfling (2000) Zuo, Y. and Serfling, R. (2000), General notions of statistical depth functions, Ann. Statist., 28, 461-482.

Appendix A Appendix

Here we give three counter-examples for illustrating that Liu and Modarres (2011)’s Euclidean ‘lens depth’ does not satisfy Zuo and Serfling (2000)’s properties P2 ‘Maximality at centre’ and P3 ‘Monotonicity relative to deepest point’ for centrally symmetric distributions – indeed, these two properties are related. For simplicity, we work in .

Example A.1 (Mixture of two normal distributions).

Let

be a mixture of two bivariate normal distributions with respective means

, , identity covariance matrices and equal weights – viz., the density of is , for the standard bivariate normal density.

Example A.2 (Bivariate normal distribution truncated to 4 squares).

Let be the distribution whose density function is where is the standard bivariate normal density, with , , and .

Example A.3 (Bivariate normal distribution truncated to a frame).

Let be as in Example A.2 but with , , and .

The distribution is clearly centrally symmetric about in each case: in Example A.1 as , and in Examples A.2-A.3 because the standard bivariate normal distribution is centrally symmetric about and the region is symmetric with respect to the origin. However, Figure A.1 reveals that Liu and Modarres (2011)’s Euclidean lens depth function with respect to any of these three distributions is not maximum at the centre , nor is monotonic away from the deepest point(s) – see that the depth function admits a local maximum at for Example A.2.

Figure A.1: Top row: Example A.1; Central row: Example A.2; Bottom row: Example A.3. From left to right, density function (first column), sample lens depth constructed with sample draws from (second column), corresponding heat-map (third column) and its section along the line (top-right panel), (central-right panel) and (bottom-right panel).