 # Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models

We consider classifiers for high-dimensional data under the strongly spiked eigenvalue (SSE) model. We first show that high-dimensional data often have the SSE model. We consider a distance-based classifier using eigenstructures for the SSE model. We apply the noise reduction methodology to estimation of the eigenvalues and eigenvectors in the SSE model. We create a new distance-based classifier by transforming data from the SSE model to the non-SSE model. We give simulation studies and discuss the performance of the new classifier. Finally, we demonstrate the new classifier by using microarray data sets.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A common feature of high-dimensional data is that the data dimension is high, however, the sample size is relatively low. This is the so-called “HDLSS” or “large , small ” data situation where ; here is the data dimension and is the sample size. Suppose we have independent and -variate two populations,

, having an unknown mean vector

and unknown covariance matrix for each . We do not assume . The eigen-decomposition of is given by , where is a diagonal matrix of eigenvalues, , and

is an orthogonal matrix of the corresponding eigenvectors. We have independent and identically distributed (i.i.d.) observations,

, from each . We assume . We estimate and by and . Let be an observation vector of an individual belonging to one of the two populations. We assume and s are independent. When the s are Gaussian, a typical classification rule is that one classifies an individual into if

 (\boldmath{x}0−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{x}1)T% \boldmath{S}−11(\boldmath{x}0−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{x}1)−log{det(\boldmath{S}2\boldmath{S% }−11)}<(\boldmath{x}0−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{x}% 2)T\boldmath{S}−12(\boldmath{x}0−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{x}2),

and into otherwise. However, the inverse matrix of does not exist in the HDLSS context (). Also, we emphasize that the Gaussian assumption is strict in real high-dimensional data analyses. Bickel and Levina (2004)

considered a naive Bayes classifier for high-dimensional data

. Fan and Fan (2008)

considered classification after feature selection.

Cai and Liu (2011), Shao et al. (2011) and Li and Shao (2015) gave sparse linear or quadratic classification rules for high-dimensional data. The above references all assumed the following eigenvalues condition: There is a constant (not depending on ) such that

 c−10<λi(p)  and  λi(1)

Dudoit et al. (2002) considered using the inverse matrix defined by only diagonal elements of . Aoshima and Yata (2011, 2015a) considered substituting for by using the difference of a geometric representation of HDLSS data from each . Here,

denotes the identity matrix of dimension

. On the other hand, Hall et al. (2005, 2008) and Marron et al. (2007) considered distance weighted classifiers. Ahn and Marron (2010) considered a HDLSS classifier based on the maximal data piling. Hall et al. (2005), Chan and Hall (2009), Aoshima and Yata (2014) and Watanabe et al. (2015) considered distance-based classifiers. Aoshima and Yata (2014) gave the misclassification rate adjusted classifier for multiclass, high-dimensional data whose misclassification rates are no more than specified thresholds under the following eigenvalues condition:

 λ2i(1)tr(\boldmathΣ2i)→0  as p→∞ for i=1,2. (2)

We emphasize that (2) is much milder than (1) because (2) includes the case that as . See Remark 1 for the details. Aoshima and Yata (2014) considered the distance-based classifier as follows: Let

 W(\boldmath{x}0)=(\boldmath{x}0−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{x}1+¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{x}22)T(¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{x}2−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯% \boldmath{x}1)−tr(\boldmath{S}1)2n1+tr(\boldmath{S}2)2n2. (3)

Then, one classifies into if and into otherwise. Here, is a bias-correction term. Note that the classifier (3) is equivalent to the scale adjusted distance-based classifier given by Chan and Hall (2009). Aoshima and Yata (2015b) called the classification rule (3) the “distance-based discriminant analysis (DBDA)”.

Recently, Aoshima and Yata (2018) considered the “strongly spiked eigenvalue (SSE) model” as follows:

 liminfp→∞{λ2i(1)tr(% \boldmathΣ2i)}>0  for i=1 or 2. (4)

On the other hand, Aoshima and Yata (2018) called (2) the “non-strongly spiked eigenvalue (NSSE) model”. Note that (4) holds under the condition:

 liminfp→∞{λi(1)tr(\boldmathΣi)}>0  for i=1 or 2, (5)

from the fact that . Here, is the first contribution ratio. We call (5) the “super strongly spiked eigenvalue (SSSE) model”.

###### Remark 1.

Let us consider a spiked model such as

 λi(r)=ai(r)pαi(r) (r=1,...,ti)andλi(r)=ci(r) (r=ti+1,...,p) (6)

with positive and fixed constants, s, s and s, and a positive and fixed integer . Note that the NSSE condition (2) holds when for . On the other hand, the SSE condition (4) holds when , and further the SSSE condition (5) holds when . See Yata and Aoshima (2012) for the details of the spiked model.

We observed

 λi(r)tr(\boldmathΣi) (=εi(r), say)  and  λ2i(r)tr(\boldmathΣ2i) (=ηi(r), say),  i=1,2; r=1,2,...,

for six well-known microarray data sets by using the noise-reduction methodology and the cross-data-matrix methodology. For those methods, see Yata and Aoshima (2010, 2012). Note that is the contribution ratio and is a quadratic contribution ratio of the -th eigenvalue. We estimated by and by , where is defined by (15), and and are defined in Section 4.3. We note that and are consistent estimators of and when . See (17) and (22) for the details. The six microarray data sets are as follows:

(D-i)

Non-pathologic tissues data with genes, consisting of : placenta or blood ( samples) and other solid tissue ( samples) given by Christensen et al. (2009);

(D-ii)

Colon cancer data with genes, consisting of : colon tumor ( samples) and normal colon ( samples) given by Alon et al. (1999);

(D-iii)

Breast cancer data with genes, consisting of good ( samples) and poor ( samples) given by Gravier et al. (2010);

(D-iv)

Lymphoma data with genes, consisting of DLBCL (58 samples) and follicular lymphoma (19 samples) given by Shipp et al. (2002);

(D-v)

Myeloma data with genes, consisting of patients without bone lesions (36 samples) and patients with bone lesions (137 samples) given by Tian et al. (2003);

(D-vi)

Breast cancer data with genes, consisting of luminal group (84 samples) and non-luminal group (44 samples) given by Naderi et al. (2007).

The data sets (D-ii), (D-iv) and (D-v) are given in Jeffery et al. (2006), (D-i) and (D-iii) are given in Ramey (2016), and (D-vi) is given in Glaab et al. (2012). We summarized the results for , and in Table 1, where is an estimate of , given in Section 4.3. We will discuss and in Sections 3 and 4.3. We also visualized the first ten contribution ratios given by in Fig. 1 and the first ten quadratic contribution ratios given by in Fig. 2. See (17) and (22) for the details. Figure 1: Estimates of the first ten contribution ratios by ^εi(r)s for the six well-known microarray data sets Figure 2: Estimates of the first ten quadratic contribution ratios by ^ηi(r)s for the six well-known microarray data sets

We observed from Fig. 1 that the first several eigenvalues are much larger than the rest for the microarray data sets (except (D-v)). In particular, the first eigenvalues for (D-i) and (D-iv) are extremely large. These data appear to be consistent with the SSSE asymptotic domain given in (5). On the other hand, the first several eigenvalues for (D-v) are relatively small. However, from Table 1 and Fig. 2, s for (D-v) are not sufficiently small. Also, s for (D-ii), (D-iii) and (D-vi) are relatively large in Table 1 and Fig. 2. Hence, the six microarray data appear to be consistent with the SSE asymptotic domain given in (4). See Section 4.3. In this paper, we consider classifiers under the SSE model. We do not assume the normality of the population distributions. We propose an effective distance-based classifier for such high-dimensional data sets.

The organization of this paper is as follows. In Section 2, we introduce asymptotic properties of the distance-based classifier for high-dimensional data. We discuss the distance-based classifier in the SSE model. In Section 3, we consider a distance-based classifier using eigenstructures for the SSE model. In Section 4, we discuss estimation of the eigenvalues and eigenvectors for the SSE model. We create a new distance-based classifier by estimating the eigenstructures. In Section 5, we give simulation studies and discuss the performance of the new classifier. Finally, we demonstrate the new classifier by using microarray data sets.

## 2 Distance-based classifier for high-dimensional data

In this section, we introduce asymptotic properties of the distance-based classifier for high-dimensional data. As for any positive-semidefinite matrix , we write the square root of as . Let

 \boldmath{x}ij=\boldmath{H}i\boldmathΛ1/2i\boldmath{z}ij+\boldmathμi,

where is considered as a sphered data vector having the zero mean vector and identity covariance matrix. Similar to Bai and Saranadasa (1996) and Chen and Qin (2010), we assume the following assumption for , , as necessary:

(A-i)

for all ,  ,    and   for all .

When the s are Gaussian, (A-i) naturally holds. Let

 \boldmathμ=\boldmathμ1−\boldmathμ2,  Δ=∥\boldmathμ∥2,  nmin=min{n1,n2}  % and  m=min{p,nmin},

where denotes the Euclidean norm. Note that when for . Also, note that the divergence condition “, and ” is equivalent to “”. Let

 δoi={tr(\boldmathΣ2i)ni+tr(\boldmathΣ1\boldmathΣ2)ni′+2∑l=1tr(\boldmathΣ2l)2nl(nl−1)}1/2

and for . Note that when for .

Let denote the error rate of misclassifying an individual from into the other class for . Then, for the classification rule (3) DBDA, Aoshima and Yata (2014) gave the following result.

###### Theorem 1 (Aoshima and Yata, 2014).

Assume the following conditions:

(AY-i)

as for ;

(AY-ii)

as .

Then, for DBDA, we have that as

 e(i)→0  for i=1,2. (7)
###### Remark 2.

For DBDA, under (AY-i) and (AY-ii), one may write (7) as

 e(i)=O(δ2i/Δ2)  for i=1,2.

Next, we consider the asymptotic normality of the classifier. Hereafter, for a function, , “ as ” implies and . Let “” denote the convergence in distribution,

denote a random variable distributed as the standard normal distribution and

denote the cumulative distribution function of the standard normal distribution.

Aoshima and Yata (2014) gave the following result.

###### Theorem 2 (Aoshima and Yata, 2014).

Assume the following conditions:

(AY-iii)

as ,    for ,  and   as .

Assume also the NSSE condition (2). Under a certain assumption milder than (A-i), it holds that as

 W(\boldmath{x}0)−(−1)iΔ/2δoi⇒N(0,1)  when \boldmath{x}0∈πi for i=1,2.

Furthermore, for DBDA, it holds that as

 e(i)−Φ(−Δ2δoi)=o(1) % when \boldmath{x}0∈πi for i=1,2. (8)
###### Remark 3.

Aoshima and Yata (2015b) gave a different asymptotic normality from Theorem 2 under different conditions. From the facts that as under (AY-iii) and when , one may write (8) as

 e(i)−Φ{−Δ/(2δi)}=o(1)  when \boldmath{x}0∈πi for i=1,2.

By using the asymptotic normality, Aoshima and Yata (2014) proposed the misclassification rate adjusted classifier (MRAC) in high-dimensional settings.

In this paper, we consider the distance-based classifier from a different point of view. We consider the classifier under the SSE model. We emphasize that high-dimensional data often have the SSE model. See Table 1, Figs. 1 and 2. If the SSE condition (4) is met, one cannot claim the asymptotic normality in Theorem 2. In addition, if the SSE condition (4) is met, (AY-ii) in Theorem 1 is equivalent to

 λ2i(1)/(nminΔ2)=o(1) \ for i=1,2. (9)

Thus (AY-ii) in the SSE model is stricter than that in the NSSE model, For example, for the NSSE model as the spiked model in (6) with , , (AY-ii) is equivalent to . On the other hand, for the SSE model as (6) with (and for ), (AY-ii) is equivalent to . That means or should be quite large for the SSE model compared to the NSSE model. Thus if the SSE condition (4) is met, DBDA has the classification consistency (7) under strict conditions compared to the NSSE condition (2). In order to overcome the difficulties, we propose a new distance-based classifier by estimating eigenstructures for the SSE model.

## 3 Distance-based classifier using eigenstructures

Let

 Ψi(r)=tr(\boldmathΣ2i)−r−1∑s=1λ2i(s)=p∑s=rλ2i(s)for i=1,2; r=1,...,p.

In this section, similar to Aoshima and Yata (2018), we assume the following model for :

(M-i)

There exists a fixed integer such that are distinct in the sense that when , and and satisfy

 liminfp→∞λ2i(ki)Ψi(ki)>0  % and  λ2i(ki+1)Ψi(ki+1)→0  as p→∞.

Note that (M-i) implies the SSE condition (4), that is (M-i) is one of the SSE models. For example, (M-i) holds in the spiked model in (6) with

 αi(1)≥⋯≥αi(ki)≥1/2>αi(ki+1)≥⋯≥αi(ti)  and  ai(r)≠ai(s)

for . We emphasize that (M-i) is a natural model under the SSE condition (4). See Fig. 2. The six microarray data appear to be consistent with (M-i). Similar to (9), we note that the sufficient condition (AY-ii) in Theorem 1 is equivalent to

 ki∑r=1λ2i(r)/(nminΔ2)=o(1) \ for i=1,2

under (M-i). According to the arguments in the last paragraph of Section 2, if (M-i) is met, DBDA has the classification consistency (7) under strict conditions compared to the NSSE condition (2). Also, one cannot claim the asymptotic normality in Theorem 2 under (M-i). In order to overcome the difficulties, similar to Aoshima and Yata (2018), we consider a data transformation from the SSE model to the NSSE model.

### 3.1 Data transformation

Recall that is the -th eigenvector of . Let

 \boldmath{A}i=\boldmath{I}p−ki∑r=1% \boldmath{h}i(r)\boldmath{h}Ti(r)=p∑r=ki+1\boldmath{h}i(r)\boldmath{h}Ti(r)and\boldmath{x}ij,A=\boldmath{A}i\boldmath{x% }ij

for . Note that for . Let us write that , , , and . Note that and for all . Thus the transformed data, , has the NSSE model in the sense that

 {λmax(\boldmathΣi,A)}2/tr(% \boldmathΣ2i,A)=λ2i(ki+1)/Ψi(ki+1)→0  as p→∞,

where denotes the largest eigenvalue of any positive-semidefinite matrix, . Hence, we can say that a classifier by using the transformed data has the classification consistency (7) under mild conditions compared to DBDA when (M-i) is met. In addition, one can claim the asymptotic normality of the classifier even when the SSE condition (4) is met.

Now, we propose the classifier by using the transformed data. Let us write that , and for . We consider the following classifier:

 WA(\boldmath{x}0) =(\boldmath{x}0,A∗−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯% \boldmath{x}1,A+¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{x}2,A2)T(¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{x}2,A−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{x}1,A)−tr(\boldmath{A}1\boldmath{S}1)2n1+tr(\boldmath{A}2\boldmath{S}2)2n2 =\boldmath{x}T0,A∗(¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{x}2,A−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{x}1,A)+n1∑j

Then, one classifies into if and into otherwise. Let . Here, let us write that ,

 δoi,A= {tr(\boldmathΣi,A∗% \boldmathΣi,A)ni+tr(\boldmathΣi,A∗\boldmathΣi′,A)ni′+2∑l=1tr(\boldmathΣ2l,A)2nl(nl−1)}1/2; and  δi,A= {δ2oi,A+\boldmathμTA% \boldmathΣi,A∗\boldmathμA+\boldmathμTi\boldmath{A}1,2\boldmathΣi,A\boldmath{A% }1,2\boldmathμi/(4ni) +(\boldmathμA−\boldmath{A}1,2% \boldmathμi/2)T\boldmathΣi′,A(% \boldmathμA−\boldmath{A}1,2\boldmathμi/2)/ni′}1/2

for . Then, we claim that when for ,

 E{WA(\boldmath{x}0)}=(−1)iΔA2−(−1)i\boldmathμTi\boldmath{A}1,2\boldmathμA2  and  Var{WA(\boldmath{x}0)}=δ2i,A. (11)
###### Remark 4.

In general, in (11) is not sufficiently large because of rank. If , it holds that and

 Var{WA(\boldmath{x}0)}= tr(\boldmathΣ2i,A)ni+tr(\boldmathΣ1,A\boldmathΣ2,A)ni′+2∑l=1tr(\boldmathΣ2l,A)2nl(nl−1) +\boldmathμTA(\boldmathΣi,A+\boldmathΣi′,A/ni′)\boldmathμA

when for .

In Sections 3.2 and 3.3, we give consistency properties and an asymptotic normality of . We assume the following conditions as necessary:

(C-i)

as  for ;

(C-ii)

as  for ;

(C-iii)

as  and    for ;

(C-iv)

as ,    for ,  and    as ;

(C-v)

as ,   ,
and    as  for .

### 3.2 Consistency of the classifier (10)

We consider consistency properties of . We note that as under (C-i) to (C-iii). See Section 6.1. Then, we have the following results.

###### Theorem 3.

Assume (M-i). Assume also (C-i) to (C-iii). Then, it holds that as

 WA(\boldmath{x}0)ΔA=(−1)i2+oP(1)  when \boldmath{x}0∈πi for i=1,2.

For the classification rule (10), we have the classification consistency (7) as .

###### Corollary 1.

If , for the classification rule (10), we have the classification consistency (7) as under (M-i) and the following conditions:

 \boldmathμTA\boldmathΣi,A% \boldmathμAΔ2A→0  as p→∞ \ and  tr(\boldmathΣ2i,A)nminΔ2A→0  as m→∞ \ for i=1,2.
###### Remark 5.

For the classification rule (10), under (M-i) and (C-i) to (C-iii), one may write (7) as

 e(i)=O(δ2i,A/Δ2A)  for i=1,2.

Now, we consider the sufficient condition (C-ii) in Theorem 3. When as for