Fast nonparametric classification based on data depth

07/20/2012
by   Tatjana Lange, et al.
0

A new procedure, called DDa-procedure, is developed to solve the problem of classifying d-dimensional objects into q >= 2 classes. The procedure is completely nonparametric; it uses q-dimensional depth plots and a very efficient algorithm for discrimination analysis in the depth space [0,1]^q. Specifically, the depth is the zonoid depth, and the algorithm is the alpha-procedure. In case of more than two classes several binary classifications are performed and a majority rule is applied. Special treatments are discussed for 'outsiders', that is, data having zero depth vector. The DDa-classifier is applied to simulated as well as real data, and the results are compared with those of similar procedures that have been recently proposed. In most cases the new procedure has comparable error rates, but is much faster than other classification approaches, including the SVM.

READ FULL TEXT VIEW PDF

Authors

page 1

03/09/2018

A local depth measure for general data

We herein introduce a general local depth measure for data in a Banach s...
04/24/2021

The GLD-plot: A depth-based plot to investigate unimodality of directional data

A graphical tool for investigating unimodality of hyperspherical data is...
08/14/2016

Depth and depth-based classification with R-package ddalpha

Following the seminal idea of Tukey, data depth is a function that measu...
03/17/2018

A simulated annealing procedure based on the ABC Shadow algorithm for statistical inference of point processes

Recently a new algorithm for sampling posteriors of unnormalised probabi...
06/06/2018

Semiparametric Classification of Forest Graphical Models

We propose a new semiparametric approach to binary classification that e...
09/20/2019

A clusterwise supervised learning procedure based on aggregation of distances

Nowadays, many machine learning procedures are available on the shelve a...
01/18/2021

A simple geometric proof for the benefit of depth in ReLU networks

We present a simple proof for the benefit of depth in multi-layer feedfo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A steady interest in statistical learning theory has intensified recently since nonparametric tools have become available. A new impetus has been given to supervised classification by employing depth functions such as Tukey’s (

[25]) halfspace depth or Liu’s ([18]) simplicial depth. In supervised learning a function is constructed from labeled training data that classifies an arbitrary data point by assigning it one of the labels [12]. Given two or more labeled clouds of training data in -space, a data depth measures the centrality of a point with respect to these clouds. For any point in -space it indicates the degree of closeness to each label. This can be employed in different ways for solving the classification task. Many authors have made use of data depth ideas in supervised classification. Liu et al. [19]

were the first who stressed the usefulness and versatility of depth transformations in multivariate analysis. They introduced the notion of a DD-plot, that is the two-dimensional representation of multivariate objects by their data depths regarding two given distributions. In a straightforward way, an object can be classified to the class where it is deepest, that is, according to its maximum depth. Jornsten

[14] and Ghosh and Chaudhuri [11] have followed this and similar approaches; see also Hoberg and Mosler [22]. Dutta and Ghosh [7, 6]

employ a separator that is linear in a density based on kernel estimates of the projection depth, respectively

-depth. Recently, Li et al. [17] have used polynomial separators of the DD-plot to classify objects by their depth representation. These methods differ in the notion of depth used and allow for adaptive and other extensions.

The quoted literature has in common that a (possibly high-dimensional) space of objects is transformed into a lower-dimensional space of depth values of these objects and the classification task is performed in the depth space. In this context several questions arise:

  1. Which particular notion of depth should be employed?

  2. Which classification procedure should be applied to the depth-represented data?

  3. How extends the procedure to classes?

The above literature answers these questions in different ways. Ad (1), halfspace and simplicial depths, among others, have been employed in [10, 17, 19]. They depend only on the combinatorial structure of the data, being constant in the compartments spanned by them. Consequently, these depths are rather robust to outlying data, but calculating them in higher dimensions can be cumbersome if not impossible. On the other hand Mahalanobis depth [20]

, which has also been used by these authors, is easily calculated but highly non-robust. Moreover, it depends on the first two moments only and does not reflect any asymmetries of the data. More robust forms of the Mahalanobis depth remain still insensitive to data asymmetries.

-depth as used in [14] has similar drawbacks. [6] employ -depths, which are easily calculated if is known, and choose in an adaptive procedure; however the latter needs heavy computations. In [22] the maximum zonoid depth and a combination of it with the Mahalanobis depth are used; both can be efficiently calculated also in high dimensions but lack robustness. Ad (2), Li et al. [17] solve the classification problem of the DD-plot by designing a polynomial line that separates the unit square and provides a minimal average misclassification rate (AMR); the order (up to three) of the polynomial is selected by cross validation. Similarly, separators are determined in [7] and [6] by cross-validation.

Ad (3) with classes a given point is usually classified in two steps according to majority rule: firstly classifications are performed that are restricted to pairs of classes in the object space, and secondly the point is assigned to that class where it was most often assigned in step 1.

In this paper, ad (1), we employ the zonoid depth [15, 21], as it can be efficiently calculated also in higher dimensions (up to

and more) and has excellent theoretical properties regarding continuity and statistical inference. However the zonoid depth has a low breakdown point. If, in a concrete application, robustness is an issue the data have to be preprocessed by some outlier detection procedure. Ad (2), for final classification in the depth space a variant of the

-procedure is employed. It operates simply and very efficiently on low-dimensional spaces like the depth spaces considered here. The -procedure has been originally developed by Vasil’ev [27, 28] and Lange [29]. Ad (3) we employ DD-plots if there are two classes and -dimensional depth plots if there are classes. Assignment of a given point to a class is based on binary classifications in the -dimensional depth space plus a majority rule. Note that in each binary classification the whole depth information regarding all classes is used.

We call our approach the DD-approach and apply it to simulated as well as real data. The results are contrasted with those obtained in [17], [7], and [6].

The contribution of this paper is threefold. A classification procedure is proposed that

  1. is efficiently computable for objects of higher dimensions,

  2. employs a very fast classification procedure of the D-transformed data,

  3. uses the full multivariate information when classifying into classes,

The rest of the paper is organized as follows. Section 2 introduces the depth transform, which maps the data from -dimensional object space to -dimensional depth space, and provides a first discussion of the problem of ‘outsiders’, that are points having a vanishing depth vector. In Section 3 our modification of the -procedure is presented in some detail. Section 4 provides a number of theoretical results regarding the behavior of the DD-procedure on elliptical and mirror symmetric distributions. Section 5 contains extensive simulation results and comparisons. Calculations of real data benchmark examples are reported in Section 6 as well as a comparison of the DD-procedure with the SVM approach. Section 7 concludes.

2 Depth transform

A data depth is a function that measures, in a certain sense, how close a given point is located to the “center” of a finite set in , that is, how “deep” it is in the set. More precisely, a data depth is a function

that satisfies the following restrictions: affine invariant; upper semicontinuous in ; quasiconcave in (that is, having convex upper level sets), vanishing if . Sometimes two weaker restrictions are imposed: orthogonal invariant; decreasing on rays from a point of maximal depth (that is, starshapedness of the upper level sets). For surveys of these restrictions and many special notions of data depth, see e.g. [30, 21, 8, 24, 2].

Now, assume that data in are to be classified into classes and that are training sets for these classes each having finite size . Let be a data depth. The function mapping

(1)

will be mentioned as a depth representation. Each object is represented by a vector whose components indicate its depth or closeness regarding the classes. In particular, the training sets are transformed to sets in that represent the classes in the depth space. It should be noted that ‘closeness’ of points in the original space translates to ‘closeness’ of their representations. The classification problem then becomes one of partitioning the depth space into parts.

A simple rule, e.g., is to classify a point to that class where it has the largest depth value; see [11, 14]. This means that the depth space decomposes into compartments which are separated by (parts of)

bisecting hyperplanes. Maximum depth classification is a linear rule. A nonlinear classification rule is used in Li et al.

[17], who treat the case by constructing a polynomial line up to degree 3 that separates the depth space ; see also [7, 6].

With several important notions of data depth, vanishes outside the convex hull of . This is, e.g., the case with the halfspace, simplicial, and zonoid depths, but not with the Mahalanobis and -depths. A point that is not within the convex hull of at least one training set then is mapped to the origin in the depth space. Such a point will be mentioned as an outsider. Of course, it can be neither regarded as correctly classified nor ignored. To classify this point we may consider three principal approaches, each allowing for several variants.

  • Classify randomly, with probabilities equal to the expected proportions of origin of points to be classified.

  • Use the -nearest neighbors method with a properly chosen distance: Euclidean distance, -distance, Mahalanobis distance with moment estimates, Mahalanobis distance with robust estimates (MCD, cf. e.g. [13]).

  • Classify with maximum Mahalanobis depth (using moment estimates or MCD) or with the maximum of another depth that is properly extended beyond the convex hull as e.g. in [22].

In the sequel we will use either random classification, -nearest neighbors (with different distances), or maximum Mahalanobis depth (with moment and robust estimates).

3 The -procedure

To separate the classes in the multi-depth space we use the -procedure, which has been developed by Vasil’ev [27, 28] and Lange [29], see also [16]. Among others the regression depth method (see [23, 3] or [4]) or the support vector machine (see [26] and [4]) seem to be good alternatives. In contrast with those the -procedure, in application to the current task, is substantially faster and produces a unique decision rule. Besides that it focuses on features of the extended , i.e. depths and their products, which, by their nature, are rather relevant. Moreover, by selecting a few important features only, the -procedure yields a rather stable solution.

Let us first present the procedure in the case of classes. As above consider two clouds of training data in , and and notate , . By calculating the depth of all with respect to each of the two clouds, their depth representation, , is obtained, . The set

is the DD-plot of the data ([19]).

We use a modified version of the -procedure to construct a nonlinear separator in that classifies the D-represented data points. The construction is based on depth values and the products of depth values up to some degree that can be either chosen a priori or determined by cross-validation. For this, a linearized representation of the two classes in a depth feature space is

Each element of the extended D-representation is mentioned as a basic D-feature and the space as the feature space. When the maximum exponent is , is a vector in having components

(2)

The number of basic D-features, that is the dimension of the feature space, equals , which is easily seen by induction. We index the basic D-features by and notate

The -procedure now, in a stepwise way, performs linear discrimination in subspaces of the feature space. It is a bottom-up approach that successively builds new features from the basic D-features. In each step certain two-dimensional subspaces of are considered, and the projection of to each of these subspaces is separated by a straight discrimination line. Out of these subspaces the -procedure selects a subspace whose discrimination line provides the least classification error. Clearly any discrimination line that separates the DD-plot must pass through the origin since implies that the point cannot be classified to either of the two classes. The same must hold for all discrimination lines in subspaces of the extended depth space.

Figure 1: -procedure; step 1.

In a first step a pair of D-features (2) is chosen with . The latter restriction implies that the two D-features do not solely relate to one of the classes. A straight discrimination line is calculated in the two-dimensional coordinate subspace defined by the pair . As the line passes through the origin it is characterized by an angle . The best discriminating angle is determined by minimizing the average misclassification rate (AMR),

Here denotes the indicator function of . If the minimum is attained in an interval, its middle value is selected for ; see Figure 1. The same is done for all pairs of D-features satisfying the above restriction, and the pair is selected that minimizes (3). If the minimum is not unique the pair with the smallest and is chosen. Let and denote the respective AMR by . Next the D-features and are replaced by a new D-feature which is indexed by and gives value

(4)

to each . Geometrically the values are obtained by projecting to a straight line in the -plane that is perpendicular to the discrimination line; see Figure 1. The first step results in the new D-feature and the AMR produced by classifying according to this feature.

The second step couples the new D-feature with each of the basic D-features that have not been replaced so far. For each of these pairs of D-features a best discriminating angle is determined, and among these the pair of D-features is selected that provides the minimum AMR. The minimum error is denoted by and the angle at which it is attained by . This is visualized in Figure 2. The best pair of D-features is replaced by a new D-feature , where the values are calculated as in (4).

The last step is repeated with in place of , etc. The procedure stops after step if either the additional discriminating power or , that is, all basic D-features have been replaced. Then the angle defines a linear rule for discriminating between two (up to) -th order polynomials in and , which correspond to the two finally constructed D-features, according to their sign. This yields a polynomial separation of the classes in the depth space.

For example, let in step 1 the basic features and be selected and, consequently, and be included in steps 2 and 3. If the procedure terminates after step 3, the result is a polynomial in the two depths and that has form

A given point of the object space then is classified according to the sign of the polynomial.

If there are more than two classes, say , each data point is represented by the vector of depth values in . Again a depth feature space is considered of some order ; it has dimension . With classes every two training classes are separated by the -procedure in the same way as above: In each step a pair of D-features is replaced by a new D-feature as long as the AMR decreases and basic D-features are left to be replaced. For each pair of classes the procedure results in a hypersurface that separates the -dimensional depth space into two sets of attraction. A given point is finally assigned to that class to which it has been most often attracted.

Figure 2: -procedure; step 2.

4 Some theoretical aspects

In order to investigate some properties of the DD-approach we transfer it to a more general probabilistic setting and define a depth function as the population version of a data depth. Let

be a properly chosen set of probability distributions on

that includes the empirical distributions. A depth function is a function that assigns a value to every and in an affine invariant way (i.e. for any nonsingular matrix and any , denoting the push-forward measure), and has convex compact upper level sets. Obviously, the restriction of a depth function to the class of empirical distributions is an affine invariant quasiconvex data depth. For details on general depth functions, see e.g. the above cited surveys [2, 21, 24, 30].

While data depth is an intrinsically nonparametric notion, the behavior of depth functions and depth based procedures on parametric classes is of special interest as it indicates how the nonparametric approach relates to the more classical parametric one. As a generalization of multivariate Gaussian distributions, spherical and elliptical distributions play an important role in parametric multivariate analysis. A random vector

in has a spherical distribution if , where

is a random vector uniformly distributed on the sphere

and

is a random variable having support

and being independent of . A random vector has an elliptical distribution if it is an affine transform of a spherically distributed , . If has a density we notate . As, by definition, a depth function is affine invariant, it operates on elliptical distributions in a rather simple way. The following propositions give some insight into the the behavior of depth functions and the DD-procedure if the data generating processes are elliptical.

Proposition 4.1

If is an affine invariant depth function and an elliptical distribution, then for every the upper level set

is an ellipsoid.

Proof. Let and . Consider . Then, for all , is a sphere since is, in particular, orthogonal invariant. Hence, is a ball and, by affine transformation with and , is an ellipsoid.

Proposition 4.2
(i)

Let be the zonoid depth and a unimodal elliptical distribution, that is . Then, for every non-empty density level set , some exists such that

(ii)

If, in addition, has an interval support then is a continuous, strictly increasing function. It holds and therefore

(5)

Proof. (i): Note that . Thus, if , the claim holds with . Now let and assume w.l.o.g. that is spherical. Then is a ball with center at the origin. Let be a point on its surface. Also the central regions are balls around the origin. By Theorems 3.9 and 3.14 in [21], the are continuous and strictly decreasing on the convex hull of the support of and it holds . We conclude .
(ii): Under the additional premise, the density level sets are continuously and strictly decreasing in , which yields the result.

Corollary 4.1

Consider a mixture of unimodal elliptical distributions , , with mixing probabilities and assume that all have an interval support. Let be the zonoid depth.

Then, for each and exists a strictly increasing function so that

Proof. From Proposition 4.2 continuous and strictly increasing functions and are obtained with and . Consequently,

which proves the claim by use of the function  .

A similar result holds for other data depths including the halfspace, simplicial, projection and Mahalanobis depths; see Prop. 1 in [17]. In the rest of section we consider the limit behavior of the DD-procedure under independent sampling. For this, we assume that the empirical depth is a consistent estimator of its population version. This is particularly true for the zonoid, halfspace, simplicial, projection and Mahalanobis depths.

Theorem 4.1 (Bayes rule)

Let and probability distributions in having densities and , and let be a hyperplane such that is the mirror image of with respect to and in one of the half-spaces generated by . Then based on a 50:50 independent sample from and the DD-procedure will asymptotically yield the linear separator that corresponds to the bisecting line of the DD-plot.

Note that the rule given in the theorem corresponds the Bayes rule, see [12]. Especially the requirements of the theorem are satisfied if and are mirror symmetric and unimodal.

Proof. Due to the mirror symmetry of the distributions in the DD-plot is symmetric as well. Symmetry axis is the bisector, which is obviously the result of the -procedure when the sample is large enough.

Theorem 4.2

Let be unimodal elliptical, , . Then based on a 50:50 independent sample from and the DD-procedure will asymptotically yield the linear separator that corresponds to the bisecting line of the DD-plot.

Proof. If and are spherically symmetric, they satisfy the premise of the previous theorem. A common affine transformation of and does not change the DD-plot.

5 Simulation study

The DD-procedure has been implemented on a standard PC in an -environment. To explore its specific potencies we apply it to simulated as well as to real data. The same data have been analyzed with several classifiers in the literature. In this section results on simulated data are presented regarding the average misclassification rate of nine procedures besides the DD-classifier (Section 5.1). Then the speed of the DD-procedure is quantified (Section 5.2). The following Section 6 covers the relative performance of the the DD- and other classifiers on several benchmark data sets.

5.1 Comparison of performance

To simplify the comparison with known classifiers, we use the same simulation settings as in [17]. These are supervised classification tasks with two equally sized training classes. Data are generated by ten pairs of distributions according to Table 1. Here N and Exp denote the Gaussian and exponentional distributions, respectively, and
MixN

No. Alternative 1st class 2nd class
1 Normal N N
location
2 Normal N N
location-scale
3 Cauchy Cauchy Cauchy
location
4 Cauchy Cauchy Cauchy
location-scale
5 Normal Learning sample: 90% as No. 1, as No. 1
contaminated 10% from N.
location Testing sample: as No. 1
6 Normal Learning sample: 90% as No. 2, as No. 2
contaminated 10% from N.
location-scale Testing sample: as No. 2
7 Exponential
location
8 Exponential
location-scale
9 Asymmetric
location
10 Normal- N
exponential
Table 1: Distributional settings used in the simulation study.

The DD-classifier is contrasted with the following nine classifiers: linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), -nearest neighbors classification (-NN), maximum depth classification based on Mahalanobis (MM), simplicial (MS), and halfspace (MH) depth, and DD-classification with the same depths (DM, DS and DH, correspondingly). For more details about the data and the procedures as well as for some motivation the reader is referred to [17].

All simulations of [17] are recalculated following their paper as close as possible. The LDA, QDA and -NN classifiers are computed with the R-packages “MASS” and “class”, where the parameter of the -NN-classifier is selected by leave-one-out cross-validation over a relatively wide range. The simplicial, and halfspace depths have been determined by exact calculations with the R-package “depth”. The zonoid depth has been exactly computed by the algorithm in [9]

. Recall that, in dimension two, calculations of all these depths can be efficiently done by a circular sequence and note that the problem of prior probabilities is avoided by choosing test samples of equal size from both classes.

For the DD-classifiers a polynomial line (up to degree three) is determined to discriminate in the two-dimensional DD-Plot, a tenfold cross-validation is employed to choose the optimal degree of the polynomial, a smoothing constant is selected in the logistic function, and the DD-Plot is never rotated. Each experiment includes a training phase and an evaluation phase: From the given pair of distributions 400 observations (200 of each class) are generated to train the classifier, and 1000 (500 of each) observations to evaluate its AMR. For each distribution pair and each classifier 100 experiments are performed, and the resulting sample of AMRs is visualized as a box-plot; see Figures 3 to 7.

Figure 3: Normal location (left) and location-scale (right) alternatives.
Figure 4: Cauchy location (left) and location-scale (right) alternatives.
Figure 5: Normal contaminated location (left) and location-scale (right) alternatives.
Figure 6: Exponential location (left) and location-scale (right) alternatives.
Figure 7: Asymmetric location (left) and normal-exponential (right) alternatives.

As we have discussed at the end of Section 2, with depths like the simplicial, halfspace and zonoid depth the problem of outsiders arises. An outsider is, in the DD-plot, represented by the origin. A simple approach is to assign the outsiders randomly to the two classes. Throughout our simulation study we have chosen the random assignment rule, which results in kind of worst case AMR. Observe that this choice of assignment rule discriminates against the procedures that yield outsiders and advantages those that do not, in particular LDA, QDA, MM, DM and -NN for all distribution settings.

The principal results of the simulation study are collected in Figures 3 to 7. Under the normal location-shift model (Figure 3, left) all classifiers behave satisfactorily, and the DD-classifier performs well among them. However LDA, QDA, MM and DM show slightly better results since they do not have to cope with outsiders like the other depth-based procedures.

Also under the normal location-scale alternative (Figure 3, right) the DD-classifier performs rather well, like all DD-classifiers. A slightly worse performance of the DD-classifier is observed when discriminating the Cauchy location alternative (Figure 4, left), but it is still close to the DD-classifiers. This can be attributed to the lower robustness of the zonoid depth. However, when scaling enters the game (Cauchy location-scale alternative, Figure 4, right), the DD-classifier again performs quite satisfactorily. The same picture arises when considering contaminated normal settings (Figure 5, left and right). Under the location alternative, the DD-classifier is a bit worse than the DD-classifiers, while it slightly outperforms them in a location-scale setting.

The relative robustness of the DD-classifier may be explained by two of its features: First it maps the original data points to a compact set, the -dimensional unit hypercube. Second, for classification in the unit hypercube, it employs the -procedure, which, by choosing a median angle in each step, is rather insensitive to outliers.

Under exponential alternatives (Figure 6, left and right) the DD-classifier shows excellent performance, which is even similar to that of the the -NN for both location and location-scale alternatives. Its results for the asymmetric location alternative (Figure 7, left) are somewhat ambiguous, though still close to those of the DD-classifiers. Concerning the normal-exponential alternative (Figure 7, right) the DD-classifier performs distinctly better than the others considered here.

On the basis of the simulation study we conclude: The DD-classifier (1) performs quite well under various settings of elliptically distributed alternatives, it (2) is rather robust to outlier prone data, and (3) shows a distinctly good behavior under the asymmetrically distributed alternatives considered and when the two classes originate from different families of distributions.

5.2 Speed of the DD-procedure

To estimate the speed of the DD-classification we have quantified the total time of training and classification times under two simulation settings, a shift and a location-shift alternative concerning -variate normals (see Table 2, header), with various values of dimension and of total size of training classes . An experiment consists of a training phase based on two samples (each of size ) and an evaluation phase, where 2500 points (1250 from each distribution) are classified. Each experiment is performed 100 times, then the average computation time is determined. All these computations have been conducted on a single kernel of the processor Core i7-2600 (3.4 GHz) having enough physical memory.

Table 2

exhibits the average computation times (in seconds, with the standard deviations in parentheses) under the two distributional settings and for different

and . As it is seen from the table, the DD-classifier is very fast, in the learning phase as well as in classifying high amounts of data. However, computation times increase considerably with the number of training points, which is due to the many calculations of zonoid depth needed. With dimension computation time grows slower, which may be explained as follows. With increasing dimension of the data space, more points come to lie on the convex hull (thus having depth ) or outside it (thus having depth ). The algorithm from [9] computes the depth of such points much faster than that of points having larger depths.

N
N
0.14 1.55 1.89 2.24
(0.00014) (0.00014) (-) (-)
1.04 10.37 12.58 14.14
(0.00046) (0.00052) (0.00062) (-)
5.33 42.54 53.66 59.18
(0.0012) (0.0014) (0.0017) (-)
N
N
0.15 1.62 1.94 2.2
(0.00014) (0.00016) (0.00021) (0.00027)
1.09 11.33 14.44 15.18
(0.00044) (0.00059) (0.00079) (0.0010)
5.24 47.63 67.22 74.15
(0.0011) (0.0016) (0.0022) (0.0026)
Table 2: Computing times of DD-classification, in seconds.

6 Benchmark studies

Concerning real data, we take benchmark examples from [17, 7, 6] to compare the performance of the DD-classifier with respect to AMR (Section 6.1

). In addition we use four real data sets from the UCI machine learning repository

[1] to contrast the DD-classifier with the support vector machine (SVM) of [26] regarding both performance and time (Section 6.2).

6.1 Benchmark comparisons with nonparametric classifiers

As our benchmark examples are well known, we refer to the literature for their detailed description and restrict ourselves to mentioning the dimension , the number of classes , the number of points used for training (# train), the number of testing points (# test) and the total number of points (# total); see Table 3.

No. Dataset Results # train # test # total
1 Biomedical Tables 54 2 4 150 44 194
Table 6 2 4 100 94 194
2 Blood Table 6 2 3 374 374 748
Transfusion Table 4 2 3 500 248 748
3 Diabetes (1) Table 6 3 5 100 45 145
4 Diabetes (2) Table 7 2 8 767 1 768
5 Ecoli Table 7 3 7 271 1 272
6 Glass Tables 56 2 5 100 46 146
Table 7 2 9 145 1 146
7 Hemophilia Table 6 2 2 50 25 75
8 Image Segmentation Table 4 2 10 500 160 660
9 Iris Table 7 3 4 149 1 150
10 Synthetic Tables 56 2 2 250 1000 1250
Table 3: Overview of benchmark examples; dimension (), number of classes (), number of training points (# train), number of testing points (# test), total number of points (# total).

Tables 45 and  6

exhibit the performance (in terms of AMR, with standard errors in parentheses) of the DD

-classifier together with the performance of the different classifiers investigated in [17], [7] and [6] and based on the respective benchmark data. When applying the DD-classifier an auxiliary procedure has to be chosen by which outsiders are treated. In our benchmark study we employ several such procedures.

Dataset LDA QDA -NN MM MH DM DH DD
Biomedical 17.05 13.05 14.32 27.14 18.00 12.25 17.48 24.59
(0.49) (0.38) (0.45) (0.6) (0.49) (0.4) (0.51) (0.63)
Blood 29.49 29.11 29.74 32.56 30.47 26.82 28.26 32.27
Transfusion (0.08) (0.13) (0.13) (0.29) (0.3) (0.19) (0.19) (0.25)
Image 8.17 9.44 5.59 9.12 11.87 9.54 13.98 43.58
Segmentation (0.2) (0.19) (0.19) (0.23) (0.25) (0.2) (0.29) (0.34)
Table 4: Benchmark performance with DD- and other classifiers.

In Table 4 the DD-procedure is contrasted with the real data results in [17]. Here we use the same settings as in Section 5.1 and classify the outsiders on a random basis. All results in Table 4 have been recalculated.

As we see from the Table, the performance of our new classifier is mostly worse than the classifiers considered in [17]. Only in the Blood Transfusion case the AMR has comparable size. However, in this comparison the eventual presence and treatment of outsiders plays a decisive role. Observe that [17] in their procedures MH and DH use the random Tukey depth [5] to approximate the halfspace depth of a data point in dimension three and more. But the random Tukey depth generally overestimates the halfspace depth so that some of the outsiders remain undetected. This implies that, in the procedures MH and DH, considerably fewer points (we observed around 16%, 4% and 11% correspondingly) are treated as outsiders and assigned on a random basis.

In fact, as exactly determined by calculating the zonoid depth, the rate of outsiders in the Biomedical Data (with ) totals some 35%, in the Blood Transfusion Data () about 11%, and in the Image Segmentation Data with about 86%. This is in line with our expectation: the higher the dimension of the data the higher is the outsider rate. In contrast to the MH and DH procedures, the DD-procedure detects all outsiders and, in the comparison of Table 4, assigns them randomly. Obviously the performance of the latter can be improved with a proper non-random procedure of outsider assignment. In the subsequent benchmark comparisons several such procedures of non-random outsider assignment are included.

Dutta and Ghosh [7] introduce classification based on projection depth and compare it with several variants of the maximum-Mahalanobis-depth (MD)classifier. The same authors [6] propose an -depth classifier (with optimized ) and contrast it with two types of MD. To compare the DD-classifier on a par with [7, 6] we implement the following rules for handling outsiders: First, -nearest-neighbor rules are used with various and either Euclidean or Mahalanobis distance, the latter with moment or, alternatively, MCD estimates. Second, maximum Mahalanobis depth is employed, again based on moment or MCD estimation. As the -NN results of the benchmark examples do not vary much with , we restrict to . (However, the performance of the classifiers can be improved by an additional cross-validation over .) Consequently, five different rules for treating outsiders remain for comparison. Tables 5 and 6 exhibit the performance of the DD-classifier vs. the projection-depth classifiers of [7] and the -depth classifiers of [6], respectively, regarding the benchmark examples investigated in these papers. The last five columns of Table  5 and the bottom part of Table  6 report the AMR (standard deviations in parentheses) of the DD-classifier when one of the five outsiders treatments is chosen. The remaining columns are adopted as they stand in [7] and [6].

Dataset MD MD MD MD PD PD
(SS) (MS) (SS) (MS) (SS) (MS)
Synthetic 13.00 11.60 10.30 10.40 10.00 10.50
Glass 26.59 26.14 24.92 24.43 25.70 25.24
(0.25) (0.25) (0.25) (0.25) (0.34) (0.33)
Biomedical 12.44 12.04 14.25 14.03 12.37 12.18
(0.13) (0.12) (0.13) (0.14) (0.14) (0.13)
Dataset DD-classifier
1-NN Mahalanobis
Eucl. Mah. dist. depth
dist. Mom. MCD Mom. MCD
Synthetic 12.10 11.90 12.00 11.90 12.00
Glass 29.45 25.79 24.73 30.09 35.06
(0.20) (0.17) (0.18) (0.18) (0.22)
Biomedical 13.51 19.59 17.90 12.91 15.23
(0.14) (0.18) (0.17) (0.14) (0.16)
Table 5: Benchmark comparison with projection depth classifiers.

Regarding the Biomedical Data, [7] do not specify the sample sizes they use in training and testing. For the DD-classifier, we select 100 observations of the larger class and 50 of the smaller class to form the training sample; the remaining observations constitute the testing sample. As it is seen from Table 5 the DD-classifier shows results similar to the projection-depth classifier (except with the Synthetic Data), while the performance of outsider-handling methods varies depending on the type of the data. Specifically, with the Glass Data 1-NN based on the Mahalanobis distance (both with the moment and the robust estimate) performs best in handling outsiders. On the other hand, with the Biomedical Data the same approach performs quite poorly, while treating outsiders with moment-estimated Mahalanobis depth or Euclidean 1-NN yields best results.

Data- DD-classifier
set 1-NN Mahalanobis
MD LD Eucl. Mah. dist. depth
Mom. MCD Mom. MCD dist. Mom. MCD Mom. MCD
Syn. 10.20 10.60 9.60 10.70 12.10 11.90 12.00 11.90 12.00
Hem. 15.84 17.13 15.39 16.43 16.63 17.98 18.36 18.65 19.39
(0.30) (0.32) (0.32) (0.32) (0.20) (0.20) (0.19) (0.22) (0.22)
Gla. 26.80 24.80 27.64 24.75 30.13 28.37 26.63 32.88 36.82
(0.26) (0.29) (0.29) (0.26) (0.19) (0.22) (0.20) (0.22) (0.23)
Biom. 12.35 14.48 12.68 15.11 13.74 22.09 20.89 14.34 17.28
(0.14) (0.15) (0.15) (0.15) (0.09) (0.16) (0.14) (0.12) (0.14)
Diab. 8.22 11.49 9.39 11.92 10.77 18.36 18.33 12.70 15.90
(0.18) (0.22) (0.21) (0.27) (0.12) (0.18) (0.20) (0.18) (0.19)
B.Tr. 22.75 22.17 22.30 22.06 23.11 22.73 22.92 22.59 22.17
(0.07) (0.08) (0.07) (0.07) (0.06) (0.06) (0.06) (0.06) (0.06)
Table 6: Benchmark comparison with L-depth classifiers.

Table 6 presents a similar comparison of the DD-classifier with the -classifier of [6]. The same approaches are included to treat outsiders. In all six benchmark examples the DD-classifier generally performs worse than the best -depth classifier. However, its performance substantially depends on the chosen treatment of outsiders. In all examples the AMR of the DD-classifier comes close to that of the -depth classifier, provided the outsider treatment is properly selected. On the Hemophilia Data, e.g., Euclidean 1-NN should be chosen. On the Glass Data a 1-NN outsider treatment with robust Mahalanobis distance performs relatively best, etc. On the Blood Transfusion Data all outsider-handling approaches show equally good performance, which appears to be typical when is relatively large compared to .

6.2 Benchmark comparisons with SVM

The support vector machine (SVM) is a powerful solver of the classification problem and has been widely used in applications. However, different from the DD-classifier, the SVM is a parametric approach, as in applying it certain parameters have to be adjusted: the box-constraint and the kernel parameters. The AMR performance of the SVM depends heavily on the choice of these parameters. In applications, optimal parameters are selected by some cross-validation, which affords extensive calculations. Once these parameter have been optimized, SVM-classification is usually very fast and precise.

In comparing the SVM with the DD-procedure, this step of parameter optimization has to be somehow accounted for. Here we introduce a two-fold view on the comparison problem: Two values of the AMR are calculated, first the best AMR when the parameters have been optimally selected, second the expected AMR when the parameters are systematically varied over specified ranges. Corresponding training times are also clocked. As ranges we choose the intervals between the smallest and the largest number that arise as an optimal value in one of our benchmark data examples. This seems us a fair and, regarding the parameter ranges, rather conservative approach.

As benchmark four well-known data sets are employed in the sequel, Diabetes, Ecoli, Glass, and Iris Data being taken [1]. Following [7] the two biggest classes of the Glass Data have been selected, and similarly to [6] we have chosen three of the bigger classes from the Ecoli Data. The DD

-classifier is calculated with the same outsider treatments as above. For the SVM-classifier we use radial basis function kernels as implemented in LIBSVM with the R-Package “e1071” as an R-interface. Leave-one-out cross validation is employed for performance estimation of the all classifiers. The computation has been done on the same PC as in Section

5.2.

The results on the best AMR together with time quantities and portions of outsiders are collected in the Table 7. The Iris Data appears twice in the Table. First the original are used, and second the same data after a preprocessing step. The preprocessing consists in the exclusion of an obvious outlier in the DD-plot that was identified by visual inspection of the plot.

Data- DD-classifier SVM
set 1-NN Mahalanobis
Eucl. Mah. dist. depth
Legend dist. Mom. MCD Mom. MCD Opt. (CV)
Diab. Error 28.26 30.6 34.51 24.35 31.77 23.18
Time:train 16.63 16.62 16.59 16.58 17.39 0.05 (875)
Time:test 0.033 0.009 0.0092 0.0035 0.0037 0.0023
0.056/1
% outsiders 62.24 62.24 62.24 62.24 62.24
Ecoli Error 10.29 11.4 12.13 12.13 16.18 3.68
Time:train 0.26 0.26 0.26 0.26 0.26 0.0077 (105)
Time:test 0.014 0.0026 0.0032 0.001 0.00044 0.0019
5.62/1.78
% outsiders 75 75 75 75 75
Glass Error 18.49 26.03 31.51 34.93 34.93 21.23
Time:train 0.31 0.32 0.31 0.32 0.32 0.0082 (36)
Time:test 0.0083 0.0019 0.0016 0.00014 0.00055 0.0024
0.56/1
% outsiders 95.89 95.89 95.89 95.89 95.89
Iris Error 37.33 37.33 37.33 36 46.67 4.67
Time:train 0.07 0.07 0.07 0.07 0.07 0.0051 (30)
Time:test 0.0046 0.0018 0.0013 0.00033 0.00047 0.0017
0.056/10
% outsiders 50 50 50 50 50
Iris Error 3.36 3.36 4.03 2.68 13.42 2.68
(Pre.) Time:train 0.07 0.07 0.07 0.07 0.07 0.0052 (30)
Time:test 0.0046 0.0011 0.0013 0.0006 0.00027 0.0017
0.1/3.16
% outsiders 51.68 51.68 51.68 51.68 51.68
Table 7: Benchmark comparison with the support vector machine; - kernel parameter, - box constraint.

The overall analysis of the Table 7 shows that, even if using an arbitrary technique for handling outsiders, the DD-classifier mostly performs not much worse than an SVM where the parameters have been optimally chosen. In contrast, if the SVM is employed with some non-optimized parameters, its AMR can be considerably larger than that of the DD-classifier. For the regarded data sets average errors of the SVM over the relevant intervals varied from 44.99% to 66.67% (not reported in the Table).

The times needed to classify a new object (also given in Table 7) are quite comparable. But as the parameters of the SVM have to be adjusted first by running it many times for cross-validation, the computational burden of its training phase is much higher than that of the DD-classifier, which has to be run only once. Recall that the latter is nonparametric regarding tuning parameters. For example, in our implementation it took 875 seconds to determine approximate optimal values of SVM parameters for the Diabetes Data and similarly substantial times for the others (see Table 7, in parentheses).

7 Discussion and conclusions

A new classification procedure has been proposed that is completely nonparametric. The DD-classifier transforms the -variate data to a -variate depth plot and performs linear classification in an extended depth space. The depth transformation is done by the zonoid depth, and the final classification by the

-procedure. The procedure has attractive properties: First, it proves to be very fast and efficient in the training as well as in the testing phase; in this it highly outperforms existing alternative nonparametric classifiers, and also - regarding the training phase - the support vector machine. Second, in many settings of elliptically distributed alternatives, its AMR is of similar size than that of the competing classifiers. Moreover, it is rather robust to outlier prone data. As a nonparametric approach, the new procedure shows a particularly good behavior under asymmetrically distributed alternatives and, in certain cases, when the two classes originate from different families of distributions. Other than many competitors, it considers all classes in the multi-class classification problem even when performing binary classification. Different for KNN, SVM and other kernel based procedures our method does not need to be parametrically tuned. Also several theoretical properties of the DD

-procedure have been derived: It operates in a rather simple way if the data generating processes are elliptical, and a Bayes rule holds if and the two classes are mirror symmetric.

The zonoid depth has many theoretical and computational advantages: Most important here, it is efficiently computed also in higher dimensions. However, as it takes its maximum at the mean of the data, the zonoid depth lacks robustness. Nevertheless, the DD-classifier shows a rather robust behavior. Its relative robustness can be explained as follows: The original data points are mapped to a compact set, the -dimensional unit hypercube, and then classified by the -procedure. The latter, by choosing a median angle in each step, is rather insensitive to outliers.

Points that are not within the convex hull of at least one training set must be specially treated as their depth representation is zero. To classify those so called outsiders several approaches have been used and compared. Instead of assigning them randomly, which disadvantages the DD-procedure like other procedures based on halfspace or simplicial depth, one should classify outsiders by -NN and some distance or by a properly chosen maximum depth rule.

To contrast the DD-procedure with an SVM approach, a novel way of comparison has been taken: An optimal performance of an SVM has been evaluated, that arises under an optimal choice of the parameters, as well as an average performance, where the parameters vary over specified conservative intervals. It came out that, even with an arbitrary handling of outsiders, the DD-classifier mostly performs not much worse than an SVM whose parameters have been optimally chosen. However, if the SVM is employed with some non-optimized parameters, the AMR can be considerably larger than that of the DD-classifier.

More investigations are needed on the consistency of the DD

-classifier, its behavior on skewed or fat-tailed data, the - possibly adaptive - choice of outsider treatments, and the use of alternative notions of data depth. These are intended for future research.

Acknowledgements 1

Thanks are to Rainer Dyckerhoff for his constructive remarks on the paper as well as to the other participants of the Witten Workshop on “Robust methods for dependent data” for discussions. The helpful suggestions of two referees are gratefully acknowledged.

References

  • [1] A. Asuncion and D. Newman. UCI machine learning repository. URL http://archive.ics.uci.edu/ml/ (2007).
  • [2] I. Cascos. Data depth: Multivariate statistics and geometry. New Perspectives in Stochastic Geometry, (W. Kendall and I. Molchanov, eds.) Oxford University Press, Oxford (2009).
  • [3] A. Christmann and P.J. Rousseeuw. Measuring overlap in binary regression. Computational Statistics and Data Analysis, 37, 65-75 (2001).
  • [4] A. Christmann, P. Fischer and T. Joachims. Comparison between various regression depth methods and the support vector machine to approximate the minimum number of misclassifications. Computational Statistics, 17, 273-287 (2002).
  • [5] J.A. Cuesta-Albertos and A. Nieto-Reyes. The random Tukey depth. Computational Statistics and Data Analysis, 52, 4979-4988 (2008).
  • [6] S. Dutta and A.K. Ghosh. On classification based on depth with an adaptive choice of . Preprint 2011.
  • [7] S. Dutta and A.K. Ghosh. On robust classification using projection depth. Annals of the Institute of Statistical Mathematics, 64, 657–676 (2012).
  • [8] R. Dyckerhoff. Data depths satisfying the projection property. AStA - Advances in Statistical Analysis, 88, 163-190 (2004).
  • [9] R. Dyckerhoff, G. Koshevoy and K. Mosler. Zonoid data depth: Theory and computation. In A. Prat, ed., COMPSTAT 1996. Proceedings in Computational Statistics, 235-240, Heidelberg. Physica-Verlag. (1996).
  • [10] A.K. Ghosh and P. Chaudhuri. On data depth and distribution free discriminant analysis using separating surfaces. Bernoulli, 11, 1-27 (2005).
  • [11] A.K. Ghosh and P. Chaudhuri. On maximum depth and related classifiers. Scandinavian Journal of Statistics, 32, 327-350 (2005).
  • [12] T. Hastie, R. Tibshirani and J. H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd Edition. Springer Verlag. New York (2009).
  • [13] M. Hubert and K. Van Driessen. Fast and robust discriminant analysis. Computational Statistics and Data Analysis, 45, 301-320 (2004).
  • [14] R. Jornsten. Clustering and classification based on the L1 data depth. Journal of Multivariate Analysis 90, 67-89 (2004).
  • [15] G. Koshevoy and K. Mosler. Zonoid trimming for multivariate distributions. Annals of Statistics 25, 1998-2017 (1997).
  • [16] T. Lange, P. Mozharovskyi and G. Barath. Two approaches for solving tasks of pattern recognition and reconstruction of functional dependencies. XIV International Conference on Applied Stochastic Models and Data Analysis, Rome (2011).
  • [17] J. Li, J.A. Cuesta-Albertos and R.Y. Liu. -classifier: Nonparametric classification procedure based on -plot. Journal of the American Statistical Association 107, 737-753 (2012).
  • [18] R.Y. Liu. On a notion of data depth based on random simplices. Annals of Statistics, 18, 405-414 (1990).
  • [19] R.Y. Liu, J. Parelius and K. Singh.

    Multivariate analysis of the data-depth : Descriptive statistics and inference.

    Annals of Statistics 27, 783-858 (1999).
  • [20] P. Mahalanobis. On the generalized distance in statistics. Proceedings of the National Academy India 12, 49-55 (1936).
  • [21] K. Mosler. Multivariate Dispersion, Central Regions and Depth: The Lift Zonoid Approach. Springer Verlag. New York (2002).
  • [22] K. Mosler and R. Hoberg. Data analysis and classification with the zonoid depth. Data Depth: Robust Multivariate Analysis, Computational Geometry and Applications, (R. Liu, R. Serfling and D. Souvaine, eds.), 49-59 (2006).
  • [23] P.J. Rousseeuw and M. Hubert. Regression depth. Journal of the American Statistical Association 94, 388-433 (1999).
  • [24] R. Serfling. Depth functions in nonparametric multivariate inference. Data Depth: Robust Multivariate Analysis, Computational Geometry and Applications, (R. Liu, R. Serfling and D. Souvaine, eds.), 1-16 (2006).
  • [25] J.W. Tukey. Mathematics and the picturing of data. Proceeding of the International Congress of Mathematicians, Vancouver, 523-531 (1974).
  • [26] V.N. Vapnik. Statistical learning theory. Wiley. New York (1998).
  • [27] V.I. Vasil’ev. The reduction principle in pattern recognition learning (PRL) problem. Pattern Recognition and Image Analysis 1, 1 (1991).
  • [28] V.I. Vasil’ev. The reduction principle in problems of revealing regularities I. Cybernetics and Systems Analysis 39, 686-694 (2003).
  • [29] V.I. Vasil’ev and T. Lange. The duality principle in learning for pattern recognition (in Russian). Kibernetika i Vytschislit’elnaya Technika 121, 7-16 (1998).
  • [30] Y.J. Zuo and R. Serfling. General notions of statistical depth function. Annals of Statistics 28, 461-482 (2000).