# A nonlinear aggregation type classifier

We introduce a nonlinear aggregation type classifier for functional data defined on a separable and complete metric space. The new rule is built up from a collection of M arbitrary training classifiers. If the classifiers are consistent, then so is the aggregation rule. Moreover, asymptotically the aggregation rule behaves as well as the best of the M classifiers. The results of a small simulation are reported both, for high dimensional and functional data, and a real data example is analyzed.

There are no comments yet.

## Authors

• 13 publications
• 12 publications
• 9 publications
• 3 publications
05/04/2020

### A learning problem whose consistency is equivalent to the non-existence of real-valued measurable cardinals

We show that the k-nearest neighbour learning rule is universally consis...
01/20/2013

### Cellular Tree Classifiers

The cellular tree classifier model addresses a fundamental problem in th...
12/21/2014

### Principal Sensitivity Analysis

We present a novel algorithm (Principal Sensitivity Analysis; PSA) to an...
03/08/2021

### A reproducing kernel Hilbert space framework for functional data classification

We encounter a bottleneck when we try to borrow the strength of classica...
05/11/2022

### Analysis of convolutional neural network image classifiers in a rotationally symmetric model

Convolutional neural network image classifiers are defined and the rate ...
09/28/2021

### Confusion-based rank similarity filters for computationally-efficient machine learning on high dimensional data

We introduce a novel type of computationally efficient artificial neural...
09/10/2020

### Directional quantile classifiers

We introduce classifiers based on directional quantiles. We derive theor...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Supervised classification is still one of the hot topics for high dimensional and functional data due to the importance of their applications and the intrinsic difficulty in a general setup. In this context, there is a vast literature on classification methods which include: linear classification, -nearest neighbors and kernel rules, classification based on partial least squares, reproducing kernels or depth measures. Complete surveys of the literature are the works by Baíllo et al. [1], Cuevas [13] and Delaigle and Hall [16]. In the book Contributions in infinite-dimensional statistics and related topics [7], there are also several recent advances in supervised and unsupervised classification. See for instance, Chapters 2, 5, 22 or 48, or directly, Chapter 1 of this issue (Bongiorno et al. [6]). In this context, very recently there have been of great interest to develop aggregation methods. In particular, there is a large list of linear aggregation methods like boosting (Breiman [8], Breiman [9]

), random forest (Breiman

[10], Biau et al. [3], Biau [5]

), among others. All these methods exhibit an important improvement when combining a subset of classifiers to produce a new one. Most of the contributions to the aggregation literature have been proposed for nonparametric regression, a problem closely related to classification rules, which can be obtained just by plugging in the estimate of the regression function into the Bayes rule (see for instance, Yang

[19] and Bunea et al. [11]). Model selection (select the optimal single model from a list of models), convex aggregation (search for the optimal convex combination of a given set of estimators), and linear aggregation (select the optimal linear combination of estimators) are important contributions among a large list.

In the finite dimensional setup, Mojirsheibani [17] and [18] introduced a combined classifier showing strong consistency under someway hard to verify assumptions involving the Vapnik Chervonenkis dimension of the random partitions of the set of classifiers, which are non–valid in the functional setup. Very recently Biau et al. [4] introduced a new nonlinear aggregation strategy for the regression problem called COBRA, extending the ideas in Mojirsheibani [17] to the more general setup of nonparametric regression in . In the same direction but for the classification problem in the infinite dimensional setup, we extend the ideas in Mojirsheibani [17] to construct a classification rule which combines, in a nonlinear way, several classifiers to construct an optimal one. We point out that our rule allows to combine methods of very different nature, taking advantage of the abilities of each expert and allowing to adapt the method to different class of datasets. Even though our classifier allows aggregate experts of the same nature, the possibility of combine classifiers of different character, improves the use of existing rules as the bagged nearest neighbors classifier (see for instance Hall and Samworth [15]). As in Biau et al. [4], we also introduce a more flexible form of the rule which discards a small percentage of those preliminary experts that behaves differently from the rest. Under very mild assumptions, we prove consistency, obtain rates of convergence and show some optimality properties of the aggregated rule. To build up this classifier, we use the inverse function (see also Fraiman et al. [14]

) of each preliminary experts which makes the proposal particularly well designed for high dimensional data avoiding the curse of dimensionality. It also performs well in functional data settings.

In Section 2 we introduce the new classifier in the general context of a separable and complete metric space which combines, in a nonlinear way, the decision of experts (classifiers). A more flexible rule is also considered. In Section 3 we state our two main results regarding consistency, rates of convergence and asymptotic optimality of the classifier. Asymptotically, the new rule performs as the best of the classifiers used to build it up. Section 4 is devoted to show through some simulations the performance of the new classifier in high dimensional and functional data for moderate sample sizes. A real data example is also considered. All proofs are given in the Appendix.

## 2 The setup

Throughout the manuscript will denote a separable and complete metric space, a random pair taking values in and

the probability measure of

. The elements of the training sample , are iid random elements with the same distribution as the pair . The regression function is denoted by , the Bayes rule by and the optimal Bayes risk by .

In order to define our classifier, we split the sample into two subsamples and with . With we build up classifiers ,

which we place in the vector

and, following some ideas in [17], with we construct our aggregate classifier as,

 gT(x)=I{Tn(gk(x))>1/2}, (1)

where

 Tn(gk(x))=n∑j=k+1Wn,j(x)Yj,x∈F, (2)

with weights given by

 Wn,j(x)=I{gk(x)=gk(Xj)}∑ni=k+1I{gk(x)=gk(Xi)}. (3)

Here, is assumed to be . Like in [4], for a more flexible version of the classifier, called , can be defined replacing the weights in (3) by

 Wn,j(x)=I{1M∑Mm=1I{gmk(x)=gmk(Xj)}≥1−α}∑ni=k+1I{1M∑Mm=1I{gmk(x)=gmk(Xi)}≥1−α}. (4)

More precisely, the more flexible version of the classifier (1) is given by

 gT(x,α)=I{Tn(gk(x),α)>1/2}, (5)

where is defined as in (2) but with the weights given by (4). Observe that if we choose in (4) and (5) we obtain the weights given in (3) and the classifier (1) respectively.

###### Remark 1.
• The type of nonlinear aggregation used to define our classifiers turns out to be quite natural. Indeed, we give a weight different from zero to those which classify in the same group as the whole set of classifiers (or of them).

• Since we are using the inverse functions of the classifiers , observations which are far from for which the condition mentioned in a) is fulfilled are involved in the definition of the classification rule. This may be very important in the case of high dimensional data to avoid the curse of dimensionality. This is illustrated in Figure 1

, where we show two samples of points: one uniformly distributed in the square

(filled black points) and another uniformly distributed in the -ring (empty black points). We also show two points to classify, the empty red and the filled magenta triangles together with their corresponding voters, empty green squares and filled blue squares, respectively. As we can see, observations that are far from the triangles are also involved in the classification.

## 3 Asymptotic results

In this section we show two asymptotic results for the nonlinear aggregation classifier The first one shows that the classifier is consistent if, for , at least of them are consistent. Moreover, rates of convergence for (and ) are obtained assuming we know the rates of convergence of the consistent experts. The second result, shows that behaves asymptotically as the best of the classifiers used to build it up. Both results are proved under mild conditions. Throughout this section we will use the notation .

###### Theorem 1.

Assume that, for every , the classifier converges in probability to as , with and . Let us assume that and , then

• Let as , for and . If , then, for large enough,

 PDk(gT(X,α)≠Y)−L∗=O(max{exp(−Cl),βRk}), (6)

for some constant .

###### Remark 2.
• The assumption

 1) P(Y=1|g∗(X)=1)>1/22) P(Y=0|g∗(X)=0)>1/2, (7)

is really mild. It just requires that if the Bayes rule takes the value (or 0) the probability that is greater than the probability that (the probability that is greater than the probability that ). Moreover since the Bayes risk one of the conditions in (7) is always fulfilled.

• It is well known that in the finite dimensional case, if the regression function verifies a Lipschitz condition and is bounded supported, the accuracy of classical classification rules is . Therefore the right hand side of (6) is

 O(max{exp(−Cl),k−2/(d+2)}),

and the optimal rate for is attained for .

• The choice of the parameters and is an important issue. From a practical point of view, we suggest to perform a cross validation procedure to select the values of the corresponding parameters. See Section 5 for an implementation in a real data example.

In order to state the optimality result we introduce some additional notation. Let and let us call . Calling the -th entry of the vector , we define the following subsets

 and Aν=A0ν∪A1ν.

For each , we consider the assumption:

 (H)H(Dk):=PDk((X,Y)∈A1ν)−PDk((X,Y)∈A0ν)≠0a.s.
###### Theorem 2.
• For each ,

 PDk(gT(X)≠Y)−PDk(gmk(X)≠Y)≤Ok(l−1/2),

which implies that,

 liml→∞PDk(gT(X)≠Y)≤min1≤m≤MPDk(gmk(X)≠Y).
• Under assumption () we obtain a better approximation rate,

 PDk(gT(X)≠Y)−PDk(gmk(X)≠Y)≤Ok(exp(−K1l)).

## 4 A small simulation study

In this section we present the performance of the aggregated classifier in two different scenarios. The first one corresponds to high dimensional data while, in the second one, we consider two simulated models for functional data analyzed in Delaigle and Hall [16].

### High dimensional setting

In this setting we show the performance of our method by analyzing data generated in in the following way: we generate

iid uniform random variables in

, say . For each , if , we generate a random variable with uniform distribution in and set . If , we generate a random variable with uniform distribution in where is the translation along the direction for and set . Then we split the sample into two subsamples: with the first pairs , we build the training sample, with the remaining we build the testing sample. We consider two cases: the homogeneous case, where we aggregate classifiers of the same nature and in the heterogeneous case, where we aggregate experts of different nature.

• Homogeneous case: -nearest neighbor classifiers with the number of neighbors taken as follows:

1. we fix

consecutive odd numbers;

2. we choose at random different odd integers between and
.

In Table 1

, we report the mean and standard deviation (in brackets) of the misclassification error rate for case

1, when compared with the nearest neighbor rules build up with a sample size taking nearest neighbors (these classifiers are denoted by for ). In Table 2 we report the median and MAD (in brackets) of the misclassification error rate for this case.

In Table 3 we report the mean of the misclassification error rate and standard deviation for case 2, with the original aggregated classifier and the two more flexible versions: and . In this table we compare the performance of our rules with the (optimal) cross validated nearest neighbor classifier computed with and also with . In Table 4 we report the median and MAD of the misclassification error rate for this case.

• Heterogeneous case: classifiers: 3 -nearest neighbor rules with fixed values of , the Fisher and the random forest classifiers.

Here we take nearest neighbors (denoted by for ), the Fisher classifier (denoted by ) and the random forest classifier (denoted by ). In Table 5 we report the averaged misclassification error rates and standard deviation and in Table 6 we report the median and MAD for this case.

### Functional data setting

In this setting we show the performance of our method by analyzing the following two models considered in Delaigle and Hall [16]:

• Model I: We generate two samples of size from different populations following the model

 Xpi(t)=6∑j=1μp,jϕj(t)+epi(t),p=1,2,i=1,…,n/2,

where , and are, respectively, the j-th coordinate of the mean vectors , and while the errors are given by

with and .

• Model II: We generate two samples of size from different populations following the model

 Xpi(t)=3∑j=1μp,jϕj(t)+epi(t),p=1,2,i=1,…,n/2,

where and the j-th coordinate of , and the errors are given by

with and .

This second model looks more challenging since although the means of the two populations are quite different, the error process is very wiggly, concentrated in high frequencies (as shown in Figure 2 left and right panel, respectively). So in this case, in order to apply our classification method, we have first performed the Nadaraya-Watson kernel smoother (taking a normal kernel) to the training sample with different values of the bandwidths for each of the two populations. The values for the bandwidths were chosen via cross-validation with our classifier, varying the bandwidths between and (in intervals of length ). The optimal values, over 200 replicates, were for the first population (with mean ) and for the second one. Finally, we apply the classification method to the raw (non-smoothed) curves of the testing sample.

In Table 7 we report the averaged misclassification error rate and the standard deviation over replications for models I and II, taking , , , and . In the whole training sample (of functions) the labels for every population were chosen at random. The test sample consist of data, taking of every population. Here, -nearest neighbor rule for . In Table 8 we report the median of the misclassification error rate and the MAD. For Model I we get a better performance than the PLS-Centroid Classifier proposed by Delaigle and Hall [16]. For model II PLS-Centroid Classifier clearly outperforms our classifier although we get a quite small missclassification error, just using a combination of five nearest neighbor estimates.

## 5 A real data example: Analysis of spectrograms

The data to be analyzed in this section consists in the mass spectra from blood samples of 216 women of which, 121 suffer from an ovarian cancer condition and the remaining 95 are healthy women which were taken as control group. We refer to [2] for a previous analysis of these data with a detailed discussion of their medical aspects, see also [12] for further statistical analysis of these data.

A spectrogram is a curve showing the number of molecules (or fragments) found for every mass/charge ratio and, the idea behind spectrograms, is to control the amount of proteins produced in cells since, when cancer starts to grow, its cells produce a different kind of proteins than those produced by healthy cells. Moreover, the amount of common produced proteins may be different. Proteomics, broadly speaking, consists of a family of procedures allowing researchers to analyze proteins. In particular, here we are interested in some techniques which allow to separate mixtures of complex molecules according to the rate mass/charge (observe that, molecules with the same mass/charge ratio are indistinguishable with a spectrogram).

We have processed the data as follows: we have restricted ourselves to the interval mass charge (horizontal axis) . Then, in order to have all the spectra defined in a common equi-spaced grid, we have smoothed them via a Nadaraya-Watson smoother. Finally, every function has been divided by its maximum, in order to have all the values scaled in the common interval . Observe that our interest is to find the location of maxima amount of molecules more than the corresponding heights.

To build the classifier introduced in (5) we have taken nearest neighbor classifiers, with neighbors. We have implemented the cross validation method in a grid for , with taking the values and taking values . The minimum of the misclassification error was attained for and in whose case the accuracy obtained was 95%.

## 6 Concluding remarks

• We introduce a new nonlinear aggregating method for supervised classification in a general setup built up from a family of classifiers . It combines the decision of the M experts according to a “coincidence opinion” with respect to the new data we want to classify.

• The new method, besides being easy to implement, is particularly well designed for high dimensional and functional data. The method is not local, and the use of the inverse functions prevent from the curse of dimensionality that suffers all local methods.

• We obtain consistency and rates of convergence under very mild conditions on a general metric space setup.

• An optimality result is obtained in the sense that the nonlinear aggregation rule behaves asymptotically as well as the best one among the classifiers (experts) .

• A small simulation study confirms the asymptotic results for moderate sample sizes. In particular it is very well behaved for high–dimensional and functional data.

• In a well known spectrogram curves dataset, we obtain a very good performance, classifying 95%, very close to the best known results for these data.

• Although we have implemented cross validation to choose the parameters in Section 5, conditions for the validity of this procedure remains as an open problem.

## 7 Appendix: Proof of results

To prove Theorem 1 we will need the following Lemma.

###### Lemma 3.

Let be a classifier built up from the training sample such that when . Then, .

###### Proof of Lemma 3.

First we write,

 PDk(f(X)≠Y)−L∗ =PDk(f(X)≠Y)−P(g∗(X)≠Y) =PDk(f(X)≠Y,Y=g∗(X)) +PDk(f(X)≠Y,Y≠g∗(X))−P(g∗(X)≠Y) =PDk(f(X)≠g∗(X)) (8) +PDk(f(X)≠Y,Y≠g∗(X))−P(g∗(X)≠Y) =PDk(f(X)≠g∗(X))−PDk(g∗(X)≠Y,f(X)=Y),

where in the last equality we have used that

 P(g∗(X)≠Y)=PDk(g∗(X)≠Y,f(X)≠Y)+PDk(g∗(X)≠Y,f(X)=Y),

implies

 PDk(g∗(X)≠Y,f(X)=Y)=PDk(g∗(X)≠Y,f(X)≠Y)−P(g∗(X)≠Y).

Therefore, replacing in (7) we get that

 PDk(f(X)≠Y)−L∗ =PDk(f(X)≠g∗(X))−PDk(g∗(X)≠Y,f(X)=Y) ≤PDk(f(X)≠g∗(X)), (9)

which by hypothesis converges to zero as and the Lemma is proved. ∎

###### Proof of Theorem 1.

We will prove part b) of the Theorem since part a) is a direct consequence of it. By (7), it suffices to prove that, for large enough:

 PDk(gT(X,α)≠g∗(X))=O(max{exp(−C(n−k)),βRk}).

We first split into two terms,

 PDk(gT(X,α)≠g∗(X)) =PDk(gT(X,α)≠g∗(X),g∗(X)=1) +PDk(gT(X,α)≠g∗(X),g∗(X)=0)≐I+II.

Then we will prove that, for large enough,

for some arbitrary constant . The proof that

for some arbitrary constant is completely analogous and we omit it. Finally, taking , the proof will be completed. In order to deal with term , let us define the vectors

 gRk(X)= (g1k(X),…,gRk(X))∈{0,1}R, ν(X)= (1,…,1,g(R+1)k(X),…,gMk(X))∈{0,1}M.

Then,

 I =PDk(gT(X,α)≠g∗(X),g∗(X)=1) ≤PDk(gT(X,α)≠g∗(X),g∗(X)=1,gRk(X)=1) +R∑m=1PDk(gT(X,α)≠g∗(X),g∗(X)=1,gmk(X)=0) ≤PDk(gT(X,α)≠g∗(X),g∗(X)=1,gRk(X)=1) +R∑m=1PDk(g∗(X)≠gmk(X)) ≤PDk(Tn(gk(X),α)≤1/2∣∣g∗(X)=1,gRk(X)=1) +R∑m=1PDk(g∗(X)≠gmk(X)) ≐IA+IB.

Observe that, conditioning to and defining

 Zj≐I{1M∑Mm=1I{gmk(Xj)=ν(m)}≥1−α},

we can rewrite as

 Tn(gk(X),α)=∑nj=k+1ZjYj∑ni=k+1Zi.

Therefore,

 IA =PDk⎛⎝1n−k∑nj=k+1ZjYj1n−k∑ni=k+1Zi≤12∣∣g∗(X)=1,gRk(X)=1⎞⎠ =PDk⎛⎝1n−kn∑j=k+1Zj(Yj−1/2)≤0∣∣g∗(X)=1,gRk(X)=1⎞⎠. (10)

In order to use a concentration inequality to bound this probability, we need to compute the expectation of . To do this, observe that

 E(ZjYj)=PDk(1MM∑m=1I{gmk(X)=ν(m)}≥1−α,Y=1),

and

 E(Zj)=PDk(1MM∑m=1I{gmk(X)=ν(m)}≥1−α).

Since

 {gRk(X)=1}⊂{1MM∑m=1I{gmk(X)=ν(m)}≥1−α}≐Aα,

we have,

 E(ZjYj)−E(Zj)/2 =PDk(Aα)[PDk(Y=1|Aα)−1/2] (11)

Now, since for , in probability as ,

 PDk(gRk(X)=1)→P(g∗(X)=1)≐p∗>0. (12)

On the other hand, we have that, for large enough, . Indeed, for , let us consider the events which, by hypothesis, for large enough verify

 P(∩Rm=1Bmk)>1−ε,

for all . In particular, we can take such that . This implies that

 PDk(Y=1|Aα)= PDk(Y=1,Aα,∩Rm=1Bmk)PDk(Aα) +PDk(Y=1,Aα,(∩Rm=1Bmk)c)PDk(Aα) ≥ PDk(Y=1,Aα,∩Rm=1Bmk)PDk(Aα) (13) > PDk(Y=1,Aα∣∣∩Rm=1Bmk)PDk(Aα)(1−ε).

Conditioning to the event equals given by

 {RI{g∗(X)=1}+M∑m=R+1I{gmk(X)=ν(m)}≥M(1−α)}≐Cα. (14)

However, imply that . Indeed, from the inequality , it is clear that . On the other hand, and imply that , and so the sum in the second term of (14) is at most and consequently, . Then, combining this fact with (7) we have that, for large enough

 PDk(Y=1|Aα) ≥ PDk(Y=1,g∗(X)=1∣∣∩Rm=1Bmk)PDk(g∗(X)=1)(1−ε) = PDk(Y=1,g∗(X)=1)PDk(g∗(X)=1)(1−ε) (15) = P(V=1|g∗(X)=1)(1−ε) >1/2.

Therefore, from (12) and (7) in (7) we get

 E(ZjYj)−E(Zj)/2>c>0.

Going back to (7), conditioning to and using the Hoeffding inequality for , for large enough we have

 IA =PDk⎛⎝1n−kn∑j=k+1−(Zj(Yj−1/2)−E(Zj(Yj−1/2)))≥c∣∣g∗(X)=1,gRk(X)=1⎞⎠ ≤exp{−C1(n−k)},

with . On the other hand, by hypothesis we have

 IB=M∑m=1PDk(g∗(X)≠gmk(X))=O(βRk),

which concludes the proof. ∎

###### Proof of Theorem 2.

First we write,

 PDk(gT(X)≠Y) =PDk(Tn(gk(X))>1/2,Y=0) +PDk(Tn(gk(X))≤1/2,Y=1) =∑ν∈CPDk(Tn(gk(X))>1/2