A common feature of high-dimensional data is that the data dimension is high, however, the sample size is relatively low. This is the so-called “HDLSS” or “large , small ” data situation where ; here is the data dimension and is the sample size. Suppose we have independent and -variate two populations,
, having an unknown mean vectorand unknown covariance matrix for each . We do not assume . The eigen-decomposition of is given by , where is a diagonal matrix of eigenvalues, , and
is an orthogonal matrix of the corresponding eigenvectors. We have independent and identically distributed (i.i.d.) observations,, from each . We assume . We estimate and by and . Let be an observation vector of an individual belonging to one of the two populations. We assume and s are independent. When the s are Gaussian, a typical classification rule is that one classifies an individual into if
and into otherwise.
However, the inverse matrix of does not exist in the HDLSS context ().
Also, we emphasize that the Gaussian assumption is strict in real high-dimensional data analyses.
Bickel and Levina (2004) considered a naive Bayes classifier for high-dimensional data
considered a naive Bayes classifier for high-dimensional data. Fan and Fan (2008)
considered classification after feature selection.Cai and Liu (2011), Shao et al. (2011) and Li and Shao (2015) gave sparse linear or quadratic classification rules for high-dimensional data. The above references all assumed the following eigenvalues condition: There is a constant (not depending on ) such that
Dudoit et al. (2002) considered using the inverse matrix defined by only diagonal elements of . Aoshima and Yata (2011, 2015a) considered substituting for by using the difference of a geometric representation of HDLSS data from each . Here,
denotes the identity matrix of dimension. On the other hand, Hall et al. (2005, 2008) and Marron et al. (2007) considered distance weighted classifiers. Ahn and Marron (2010) considered a HDLSS classifier based on the maximal data piling. Hall et al. (2005), Chan and Hall (2009), Aoshima and Yata (2014) and Watanabe et al. (2015) considered distance-based classifiers. Aoshima and Yata (2014) gave the misclassification rate adjusted classifier for multiclass, high-dimensional data whose misclassification rates are no more than specified thresholds under the following eigenvalues condition:
Then, one classifies into if and into otherwise. Here, is a bias-correction term. Note that the classifier (3) is equivalent to the scale adjusted distance-based classifier given by Chan and Hall (2009). Aoshima and Yata (2015b) called the classification rule (3) the “distance-based discriminant analysis (DBDA)”.
Recently, Aoshima and Yata (2018) considered the “strongly spiked eigenvalue (SSE) model” as follows:
from the fact that . Here, is the first contribution ratio. We call (5) the “super strongly spiked eigenvalue (SSSE) model”.
Let us consider a spiked model such as
with positive and fixed constants, s, s and s, and a positive and fixed integer . Note that the NSSE condition (2) holds when for . On the other hand, the SSE condition (4) holds when , and further the SSSE condition (5) holds when . See Yata and Aoshima (2012) for the details of the spiked model.
for six well-known microarray data sets by using the noise-reduction methodology and the cross-data-matrix methodology. For those methods, see Yata and Aoshima (2010, 2012). Note that is the contribution ratio and is a quadratic contribution ratio of the -th eigenvalue. We estimated by and by , where is defined by (15), and and are defined in Section 4.3. We note that and are consistent estimators of and when . See (17) and (22) for the details. The six microarray data sets are as follows:
Non-pathologic tissues data with genes, consisting of : placenta or blood ( samples) and other solid tissue ( samples) given by Christensen et al. (2009);
Colon cancer data with genes, consisting of : colon tumor ( samples) and normal colon ( samples) given by Alon et al. (1999);
Breast cancer data with genes, consisting of good ( samples) and poor ( samples) given by Gravier et al. (2010);
Lymphoma data with genes, consisting of DLBCL (58 samples) and follicular lymphoma (19 samples) given by Shipp et al. (2002);
Myeloma data with genes, consisting of patients without bone lesions (36 samples) and patients with bone lesions (137 samples) given by Tian et al. (2003);
Breast cancer data with genes, consisting of luminal group (84 samples) and non-luminal group (44 samples) given by Naderi et al. (2007).
The data sets (D-ii), (D-iv) and (D-v) are given in Jeffery et al. (2006), (D-i) and (D-iii) are given in Ramey (2016), and (D-vi) is given in Glaab et al. (2012). We summarized the results for , and in Table 1, where is an estimate of , given in Section 4.3. We will discuss and in Sections 3 and 4.3. We also visualized the first ten contribution ratios given by in Fig. 1 and the first ten quadratic contribution ratios given by in Fig. 2. See (17) and (22) for the details.
We observed from Fig. 1 that the first several eigenvalues are much larger than the rest for the microarray data sets (except (D-v)). In particular, the first eigenvalues for (D-i) and (D-iv) are extremely large. These data appear to be consistent with the SSSE asymptotic domain given in (5). On the other hand, the first several eigenvalues for (D-v) are relatively small. However, from Table 1 and Fig. 2, s for (D-v) are not sufficiently small. Also, s for (D-ii), (D-iii) and (D-vi) are relatively large in Table 1 and Fig. 2. Hence, the six microarray data appear to be consistent with the SSE asymptotic domain given in (4). See Section 4.3. In this paper, we consider classifiers under the SSE model. We do not assume the normality of the population distributions. We propose an effective distance-based classifier for such high-dimensional data sets.
The organization of this paper is as follows. In Section 2, we introduce asymptotic properties of the distance-based classifier for high-dimensional data. We discuss the distance-based classifier in the SSE model. In Section 3, we consider a distance-based classifier using eigenstructures for the SSE model. In Section 4, we discuss estimation of the eigenvalues and eigenvectors for the SSE model. We create a new distance-based classifier by estimating the eigenstructures. In Section 5, we give simulation studies and discuss the performance of the new classifier. Finally, we demonstrate the new classifier by using microarray data sets.
2 Distance-based classifier for high-dimensional data
In this section, we introduce asymptotic properties of the distance-based classifier for high-dimensional data. As for any positive-semidefinite matrix , we write the square root of as . Let
where is considered as a sphered data vector having the zero mean vector and identity covariance matrix. Similar to Bai and Saranadasa (1996) and Chen and Qin (2010), we assume the following assumption for , , as necessary:
for all , , and for all .
When the s are Gaussian, (A-i) naturally holds. Let
where denotes the Euclidean norm. Note that when for . Also, note that the divergence condition “, and ” is equivalent to “”. Let
and for . Note that when for .
Theorem 1 (Aoshima and Yata, 2014).
Assume the following conditions:
as for ;
Then, for DBDA, we have that as
For DBDA, under (AY-i) and (AY-ii), one may write (7) as
Next, we consider the asymptotic normality of the classifier.
Hereafter, for a function, , “ as ” implies and .
Let “” denote the convergence in distribution, denote the cumulative distribution function of the standard normal distribution.
denote the cumulative distribution function of the standard normal distribution.Aoshima and Yata (2014) gave the following result.
Theorem 2 (Aoshima and Yata, 2014).
Assume the following conditions:
as , for , and as .
Assume also the NSSE condition (2). Under a certain assumption milder than (A-i), it holds that as
Furthermore, for DBDA, it holds that as
By using the asymptotic normality, Aoshima and Yata (2014) proposed the misclassification rate adjusted classifier (MRAC) in high-dimensional settings.
In this paper, we consider the distance-based classifier from a different point of view. We consider the classifier under the SSE model. We emphasize that high-dimensional data often have the SSE model. See Table 1, Figs. 1 and 2. If the SSE condition (4) is met, one cannot claim the asymptotic normality in Theorem 2. In addition, if the SSE condition (4) is met, (AY-ii) in Theorem 1 is equivalent to
Thus (AY-ii) in the SSE model is stricter than that in the NSSE model, For example, for the NSSE model as the spiked model in (6) with , , (AY-ii) is equivalent to . On the other hand, for the SSE model as (6) with (and for ), (AY-ii) is equivalent to . That means or should be quite large for the SSE model compared to the NSSE model. Thus if the SSE condition (4) is met, DBDA has the classification consistency (7) under strict conditions compared to the NSSE condition (2). In order to overcome the difficulties, we propose a new distance-based classifier by estimating eigenstructures for the SSE model.
3 Distance-based classifier using eigenstructures
In this section, similar to Aoshima and Yata (2018), we assume the following model for :
There exists a fixed integer such that are distinct in the sense that when , and and satisfy
for . We emphasize that (M-i) is a natural model under the SSE condition (4). See Fig. 2. The six microarray data appear to be consistent with (M-i). Similar to (9), we note that the sufficient condition (AY-ii) in Theorem 1 is equivalent to
under (M-i). According to the arguments in the last paragraph of Section 2, if (M-i) is met, DBDA has the classification consistency (7) under strict conditions compared to the NSSE condition (2). Also, one cannot claim the asymptotic normality in Theorem 2 under (M-i). In order to overcome the difficulties, similar to Aoshima and Yata (2018), we consider a data transformation from the SSE model to the NSSE model.
3.1 Data transformation
Recall that is the -th eigenvector of . Let
for . Note that for . Let us write that , , , and . Note that and for all . Thus the transformed data, , has the NSSE model in the sense that
where denotes the largest eigenvalue of any positive-semidefinite matrix, . Hence, we can say that a classifier by using the transformed data has the classification consistency (7) under mild conditions compared to DBDA when (M-i) is met. In addition, one can claim the asymptotic normality of the classifier even when the SSE condition (4) is met.
Now, we propose the classifier by using the transformed data. Let us write that , and for . We consider the following classifier:
Then, one classifies into if and into otherwise. Let . Here, let us write that ,
for . Then, we claim that when for ,
In general, in (11) is not sufficiently large because of rank. If , it holds that and
when for .
In Sections 3.2 and 3.3, we give consistency properties and an asymptotic normality of . We assume the following conditions as necessary:
as for ;
as for ;
as and for ;
as , for , and as ;
as , ,
and as for .
3.2 Consistency of the classifier (10)
We consider consistency properties of . We note that as under (C-i) to (C-iii). See Section 6.1. Then, we have the following results.
Now, we consider the sufficient condition (C-ii) in Theorem 3. When as for , it holds that