DeepAI

# Classification of Functional Data by Detecting the Discrepancy of Second Moment Structure of Scaled functions

This article presents a new classification method for functional data. We consider the case where different groups of functions have similar means so that it is difficult to classify them using only the mean function. To overcome this limitation, we propose the second moment-based functional classifier (SMFC). Here, we demonstrate that the new method is sensitive to divergence in the second moment structure and thus produces lower rate of misclassification compared to competitor methods. Our method uses the Hilbert-Schmidt norm to measure the divergence of second moment structure. One important innovation of our classification procedure lies in the dimension reduction step, where the SMFC method data-adaptively determines the basis functions that account for the difference of second moment structure rather than the functional principal component of each individual group, and good performance can be achieved as unnecessary variability is removed so that the classification accuracy is improved. Consistency properties of the classification procedure and the relevant estimators are established. Simulation study and real data analysis on phoneme and rat brain activity trajectories empirically validates the superiority of the proposed method.

• 7 publications
• 6 publications
• 52 publications
03/09/2020

### Principal Moment Analysis

Principal Moment Analysis is a method designed for dimension reduction, ...
05/09/2022

### Enhanced Change-Point Detection in Functional Means

A new dimension reduction methodology for change-point detection in func...
10/21/2017

### Functional data analysis in the Banach space of continuous functions

Functional data analysis is typically conducted within the L^2-Hilbert s...
10/25/2021

### Minimax estimation of Functional Principal Components from noisy discretized functional data

Functional Principal Component Analysis is a reference method for dimens...
06/03/2019

### Copula-based functional Bayes classification with principal components and partial least squares

We present a new functional Bayes classifier that uses principal compone...
10/15/2020

### Estimation of Discrete Choice Models: A Machine Learning Approach

In this paper we propose a new method of estimation for discrete choice ...
10/24/2019

## 1 Introduction

Classification is a popular tool for functional data analysis. The standard problem setting is, two groups of functions with some kinds of difference are available, and we need to classify new objects with unknown group index to the correct group. The existing classification methods for functional data usually assume the two groups have different mean functions. However, functional data of different groups sometimes present similar mean functions. A common nontrivial example is electroencephalogram (EEG) recordings, where the records of each epoch always oscillate around zero. The motivation of this work comes from discriminating pre-stroke and post-stroke local field potential trajectories of rats brain. Figure

1 represents 20 sample trajectories of the first channel before and after the stroke. In such case, the classification methods based on the mean function will not be helpful. Motivated by this challenge, we propose a new classification method for functional data with similar mean functions.

The main contribution of this work is that we develop a second moment based functional classifier for the classification of different groups of functional data with similar mean function. Compared with the existing methods, we do not make any distributional assumption and thus the classification procedure can be widely applied. We would like to stress that, the classification accuracy is influenced by two factors, discrepancy between groups and the variability of the object to be classified. As we use more basis functions to compare different groups, the discrepancy can become more pronounced, but meanwhile, the variability will also increase. Essentially this is a trade off between bias and variance. Therefore, it is not always good to incorporate more basis functions. We propose to use the basis function that can account for the most discrepancy between groups. This is important as we can remove unnecessary variability to improve the classification accuracy. We also would like to stress that, the performance of proposed classifiers can tend to be perfect as the samples size increase. In other words, the misclassification rate can go to zero as the sample size goes to infinity. The asymptotically perfect classification can be achieved in the case that the functional trajectories present pronounced discrepancy in the second moment over a wide range of frequency and comparatively small variability. This is a similar point discussed in

delaigle2012achieving, and we state that this property still holds for our second-moment based functional classifier. More discussion can be found in Section 3.2.

In the past two decades, a variety of classification and clustering methods for functional data have been proposed. james2001functional

extended linear discriminate analysis to functional data and used a parametric model to reduce the rank.

preda2007pls applied partial least squares in functional linear discriminate analysis. james2003clustering developed a flexible model-based procedure. biau2005functional and fromont2006functional applied nearest neighbor rule in functional data classification, and their methods are based on the first moment. muller2005generalized studied generalized functional linear model, which was used for classification in leng2006classification. li2018bayesian used multinomial logistic model for multi-class functional data classification. chiou2007functional proposed a novel functional principal component (fPC) subspace-projected -centers functional discrimination approach. chiou2008correlation proposed a correlation-based -centers functional clustering method. wang2007bayesian and fryzlewicz2009consistent employed wavelet methods. tian2013interpretable proposed a interpretable dimension reduction technique for functional data classification. delaigle2012achieving studied a novel functional linear classifier, which is optimal under normality and can be perfect as sample size diverges. delaigle2013classification also studied the functional Bayesian quadratic classifier and applied it to censored functional data. ieva2016covariance proposed a new algorithm to perform clustering of functional data based on covariance, where the true group index is assumed unknown. wang2016functional gave an overall overview of the existing classification method for functional data. Most of these works highly depend on the discrepancy between first moments, which motivates us to develop a classification methodology for groups with similar means. Some of these methods also incorporate the covariance difference, (e.g. chiou2007functional and delaigle2013classification), however, in the methods, the discrepancy between second moments is accounted for by group-wise fPCs, which can not capture the discrepancy efficiently. pigoli2014distances discussed different distances for covariance operators.

Meanwhile, there are some research works on discriminating multivariate data by covariance. anderson1962classification and hoffbeck1996covariance

studied classification procedures for observations coming from one of two multivariate normal distributions in the case that the two distributions differ both in mean vectors and covariance matrices.

madiraju1994rotation proposed a simple and powerful approach for texture classification using the eigen-features of local covariance measures. kluckner2009semantic, fehr2012compact and fehr2014rgb used covariance descriptor in the classification of multivariate data. In EEG-based classification, farquhar2009linear and barachant2013classification) used spatial covariance matrix as a feature. sundararajan2019modeling proposed a FS-ratio statistics and used it as a feature to discriminate different states. fontaine2019modeling proposed a copula-based algorithm to detect changes in brain signals. These methods use different features to discriminate epochs under different states. However, since the intra-curve information is not incorporated, the methods will not perform well if the discrepancy is mainly present in the intra-curve structure.

Compared with the existing methods, our method has the following advantages:

• The proposed classification procedure is entirely data-driven and non-parametric, making it a suitable method for a broad range of data.

• We use the sequence of orthonormal basis, that account for most of the discrepancy of the second moments, in comparison to improve the classification accuracy.

• Our method takes account of the intra-curve information, which is important in functional trajectories.

• We extend the procedure to correlated and multivariate functional data.

The rest of the paper is organized as follows. In Section 2, we introduce some preliminaries of functional data. In Section 3, we present the classification procedure in the cases of independent, multivariate and correlated functional data, and shows the consistency of the classification procedure and the relevant estimators. In Section 4, we study the finite sample properties of the procedure by simulations, and compare our method with some competitor methods. In Section 5, we implement the proposed method to classify different phonemes and different states of rats brain activity. Conclusion is made in Section 6. Technical proofs, some relevant algorithms and additional figures can be found in the appendix.

## 2 Preliminaries

• The notation indicates that, for some , . Let be a set of functional trajectories such that each function is an element of the Hilbert space , where the inner product is defined as , and the norm is defined as .

• Assume , we define the mean function by

 μ(t)=E[X(t)],

and the second moment operator by

 C(x)=E[⟨X,x⟩X].
• By Mercer’s theorem, we have the following expression of the symmetric positive-definite compact operator ,

 C(x)=∞∑j=1λj⟨vj,x⟩vj,

where

are the positive eigenvalues (in strictly descending order) and

are the corresponding normalized eigenfunctions, so that

and .

• The Hilbert–Schmidt norm of a Hilbert-Schmidt operator is defined as:

 ∥Φ∥2S=∑i,j|Φi,j|2,

where , and . is a sequence of orthonormal basis functions. This norm does not depend on the choice of .

## 3 Model, consistency, and algorithm

### 3.1 General setting

Suppose we have a sequence of functions in for each group,

 X(g)1(t),X(g)2(t),…,X(g)ng(t), g=0,1, and n0+n1=N,

where is the group index. We define group mean function and second moment function at lag of the scaled functions as

 E[X(g)k(t)]=μg(t),
 C(h)g(x)=∫E⎛⎝X(g)k(t)X(g)k+h(s)∥X(g)k(t)∥∥X(g)k+h(s)∥⎞⎠x(s)ds,
 C(−h)g(x)=∫E⎛⎝X(g)k+h(t)X(g)k(s)∥X(g)k+h(t)∥∥X(g)k(s)∥⎞⎠x(s)ds,

where . In practice, are unknown, and we can estimate them by the following empirical estimator

 ^C(h)g(x)=∫1ng−hng−h∑k=1(X(g)k(t)/∥X(g)k∥)(X(g)k+h(s)/∥X(g)k+h∥)x(s)ds,
 ^C(−h)g(x)=∫1ng−hng−h∑k=1(X(g)k+h(t)/∥X(g)k+h∥)(X(g)k(s)/∥X(g)k∥)x(s)ds.

Here we assume that is close to zero, so that it is very hard to classify two groups by mean function difference, then we need to check the second moment structure of two groups. An important step is scaling. Without it, objects in the group with higher variability would be more likely to be misclassified into the group with lower variability. The scaling step prevents the magnitude of variability to interfere with detection of the difference of second moment structure. In the following of the article, we assume all functions are already scaled to norm one. If the variability of two groups are significantly different, say, the variability of one group is significantly greater than the other one, we can set a threshold for the norm of the functions, and pre-classify the functional objects by thresholding. More specifically, if the norm of a function exceeds the threshold, then we can classify it into the group with higher variability.

Remark: We compare second moment functions instead of covariance functions, as second moment function incorporates mean function. If the mean functions of the two groups are slightly different, that discrepancy will contribute to the group discrimination.

### 3.2 Independent functions

We assume are independent functions, and suppose is a new object to be classified. Our centroid classifier assigns to if

 D(Y⟨Y,x⟩,C(0)g(x))

 D(Y⟨Y,x⟩,C(0)g(x))=∞∑i,j⟨(Y⟨Y,νi⟩−C(0)g(νi)),νj⟩2,

where are orthonormal basis functions.

Remark: The log-Euclidean metric and the affine invariant Riemannian metric are popular for finite-dimensional covariance matrix. However, the matrix logarithm is not extended to infinite-dimensional trace-class functional operators. The eigenvalue of second moment operator typically converge to zero, making it difficult to extend those distance to functional case. Comparatively, the distance induced by Hilbert-Schmidt norm is well defined for second moment functional operators and can also produce reasonable between-group comparison. This point is also discussed in pigoli2014distances.

In practice, the sample size is limited, so we need to do dimension reduction to extract the most important features that discriminate the two groups, otherwise, the bases which cannot well discriminate the two groups will reduce the classification accurary. In other words, we propose the truncated distance for comparison, defined as

 Dd(Y⟨Y,x⟩,C(0)g(x))=d∑i,j⟨(Y⟨Y,νi⟩−C(0)g(νi)),νj⟩2.

If is specified, we will omit subscript in the notation. We aim to find a series of orthonormal basis , such that the difference between groups is maximized. We propose to use the eigenfunctions of the compact symmetric operator , where . In Theorem 1

, we have shown that, the misclassification probability is partially determined by

evaluated in a -dimensional space, and a larger value of this norm can result in a smaller misclassification probability. Since is a compact symmetric positive-definite operator, and allows the spectral decomposition

 C(t,s)=∞∑i=1λCiνi(t)νi(s),

and by nature of Hilbert-Schmidt norm,

 ∥C(0)0−C(1)0∥S=∞∑i=1λCi.

It is noted that has the same eigenfunction of , and the Hilbert-Schmidt norm of is equal to the summation of eigenvalues of . is positive definite, so we propose to use the eigenfunctions associated with largest eigenvalues for the computation of . In practice, the estimator of is .

The classification algorithm proceeds as follows:

Remark: It is not always good to incorporate more dimensions, as more dimensions bring more variability. We introduce two methods to select : 1). We choose the dimension , such that the approximation accuracy of by its first eigenfunctions, which can be measured by , exceeds a threshold. 2) We can apply cross validation. To be more specific, we try different values of to do classification in the training set, and choose the with the lowest classification rate. The same procedure can also be applied to the other two cases, say, multivariate functional data and correlated functional data, which will be discussed later.

Theorem 1 tells us the classification can be near perfect as the discrepancy between groups are large enough or sample size diverges. In the theorem, we give a upper bound of the misclassification rate. More specifically, by applying Chebyshev and Cauchy-Schwarz inequality, we find that the upper bound can be expressed as a ratio of score variability to the second moment discrepancy.

###### Theorem 1.

Assume and , the misclassification probability satisfies

 P(Π1−g|Πg)≤4∑di,j=1σ(g)ij∑di,j=1⟨(C(0)0−C(0)1)νi,νj⟩2,

where .

In practice, as the sample size increases, we can incorporate a larger dimension , and the term can increase with the sample size. If the variance does not increase very fast with , the misclassification rate will converge to zero. Consider EEG recordings, assume the trajectories admit the following Fourier expansion,

 X(g)k(t)=L∑j=1(a(g)kjcos(jt)+b(g)kjsin(jt)),

where is the length of each epoch. If the number of epochs increase, we can incorporate more basis functions to discriminate and over a wider range of frequency , as typically depicts higher frequency component as increases. If the data of different groups presents discrepancy over a wide range of frequency band, the classification can be near perfect as the sample sizes go to infinity.

### 3.3 Multivariate functional data

Assume we simultaneously observe functions , for each , and the second moment structure of two groups are different for each set, we aim to jointly classify -set multivariate functions . For each set , the mean functions of different groups are assumed to be similar. In EEG recordings, can be considered as the number of channels, is the time argument of epochs, and is the index of epoch.

Define the second moment and cross second moment as

 cg,m1m2(t,s)=E(X(g)km1(t)X(g)km2(s)), m1,m2=1,…,M,

and the estimator of is

 ^cg,m1m2(t,s)=1ngng∑k=1X(g)km1(t)X(g)km2(s), m1,m2=1,…,M.

We propose to compare the weighted concatenated second moment functions, defined as

 Sg(t,s)=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝ω2(1)c(g)11ω(1)ω(2)c(g)12⋯ω(1)ω(M)c(g)1Mω(1)ω(2)c(g)21ω2(2)c(g)22⋯ω(2)ω(M)c(g)2M⋮⋮⋱⋮ω(1)ω(M)c(g)M1ω(2)ω(M)c(g)M2⋯ω2(M)c(g)MM⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,

where the weight should depend on the discrepancy of the th set of functions. Then by the same argument, the basis function used for the comparison of the concatenated second moment should capture the main discrepancy

 ∥C0(x)−C1(x))∥S,

where . Therefore, we propose to use the eigenfunctions of to compute the distance. The classification procedure is summarized in Algorithm 2.

In this case, the discrepancy and variability come from multiple sources. We stress that if at least for one set of functions, the discrepancy between the two groups goes to infinity, and the variability of the scores do not go to infinity very fast, then we can get perfect classification. The result is illustrated in Theorem 2.

###### Theorem 2.

Assume and , the misclassification probability satisfies

 P(Π1−g|Πg)≤4M2∑di,j=1∑m1,m2σ(g)m1m2,ij∑di,j=1⟨(C0−C1)ψi,ψj⟩2,

where .

Remark: The value of weight function is large if we can discriminate the th set of functions well. For examples, assume is a increasing function, then we can set or , where is the probability of misclassifying an object in group into group based on the th set only, which can be estimated by a cross-validation procedure.

### 3.4 Correlated functional data

When functions are not independent, we may further compare with . Assume are correlated across , and we collect a sequence of sample consecutively , where is number of samples to be jointly classified. One way to do joint classification is to apply the technique proposed for multivariate functional data. Since correlation across curves usually decay very fast, the correlation between concatenated functions should be less pronounced. However, if the second moment structure of two groups are only different in the auto second moment at some specific, rather than all, lags, it is not helpful to consider the second moments at lags where no discrepancy is present. Here we propose another method, which check the auto second moment functions separately.

The estimators of and are respectively

 ^C(h)g(x)=∫1ng−hng−h∑k=1X(g)k(t)X(g)k+h(s)x(s)ds,
 ^C(−h)g(x)=∫1ng−hng−h∑k=1X(g)k+h(t)X(g)k(s)x(s)ds.

We propose to compare the operators , , . The basis function used for comparison of second moments should capture most of the discrepancy

 ∥C(h)0+C(−h)0−(C(h)1+C(−h)1)∥S.

Similar with the independent case, we note that the first few eigenfunctions of account for most of the above discrepancy, and in order to find the most important eigenfunctions, we apply fPCA to the operator

 Rh=(C(h)0+C(−h)0−(C(h)1+C(−h)1))2,

which is symmetric and positive definite, and employ the first eigenfunctions for comparison.

In practice, we can only consider finite lags. Suppose we consider the comparison up to lag , we need to consecutively collect functions for each group, i.e. , and classify them jointly. Let

 κ(h)g=C(h)g+C(−h)g

and

 ^κy,h(x)=1p+1−hp+1−h∑k=1Yk⟨Yk+h,x⟩ +1p+1−hp+1−h∑k=1Yk+h⟨Yk,x⟩,

The second moments at different lags usually have different contributions to classification, so we consider the weighted classifier, the procedure is summarized in Algorithm 3.

When the functions are correlated, we need to select the the best lags so that the second moment functions are significantly different. When sample size is limited, we propose to apply Monte-Carlo cross-validation procedure (see e.g. xu2001monte). In each step, we randomly separate the sample into two sets, say, training set and testing set. We classify the objects in the testing set based on only, which is estimated from the training set, and the obtain the misclassification rate. We can repeat the same procedure for multiple times and if the average misclassification rate does not exceeds a pre-specified threshold, then we incorporate the corresponding into classification.

The major contribution of Theorem 3 is showing that the classification is perfect if, at least for one lag , the second moment discrepancy goes to infinity, and the variability, which comes from at multiple lags, is not very large.

###### Theorem 3.

Assume and , and the weight function satisfies , where , are two positive constants, the misclassification probability satisfies

 P(Π1−g|Πg)≤4TpE{maxh(∑dhi,j=1⟨(^κy,h−κ(h)g)(νh,i),νh,j⟩2)}∑ph=0∑dhi,j=1⟨(κ(h)0−κ(h)1)(νh,i),νh,j⟩2,

where is a constant determined by and .

Remark: We should set a large value to if we can discriminate and well. We can apply cross-validation to obtain the classification rate based on only, and set , where is an increasing function. In simulation, as the discrepancy at one lag is more pronounced than other lags, we find that the following weight function works well

 Wo(h)={1,if p(h)=maxh′p(h′)0,otherwise

When the functions can be discriminated equally well at multiple lags, we need to put equal weights for those lags.

In Theorem 4, we show the consistency of the estimators. The consistency property also holds in the previous two cases.

###### Theorem 4.

Suppose is an -approximable sequence (see e.g. hormann2010weakly), we have

 E{∥(^κ(h)0−^κ(h)1)2−(κ(h)0−κ(h)1)2∥S}→0

as .

Remark: According to lemma 2.2 and 2.3 in horvath2012inference, we can conclude from Theorem 4 that the estimated eigenvalues are consistent, and the estimated eigenfunctions are consistent up to a constant sign. Under independence, the sequence will be naturally -approximable.

### 3.5 Classification among multiple groups

Assume we have groups , where , the procedure can be naturally extended to this case. In such case, we can do pairwise classification for different pairs of groups. More specifically, We first compare the first two groups and , if the new object is classified into , then we further do pairwise comparison between and . We can repeat the comparison until we find the group whose centroid is the closest to . The classification algorithm for multi-group independent functions is summarized in Algorithm 4.

Remark: We only discuss the extension in independent case, however, the algorithm can be extended in the cases of multivariate and correlated functions in a similar way.

## 4 Simulations

### 4.1 General setting

To study the finite sample property of the method, we simulated two groups () of independent or correlated functions in a -dimensional space spanned by the first Fourier basis or B-spline basis functions (), The two groups have the same mean function, which were set to be zero, and different covariance functions. The functions in each group have the following expansion,

 X(g)i(tj)=D∑k=1ξ(g)ikϕk(tj)+e(tj),

where are th basis function, and , , where is the number of equally-spaced discrete grids, and .

### 4.2 Independent functions

For independent functions, we tried two different classes of basis functions for simulation. In the first setup, the functions were generated by 21 -spline functions. The scores of two groups follow normal distribution,

 (ξ(g)k:k=1,…,21) ∼N(0,σbg).

and were generated in two steps: First set

 σb1=I3⊗diag(a,a,b,b,b,a,a),σb2=I3⊗diag(b,b,a,a,a,b,b),

and is a identity matrix, then replaced the first and last element of and with zero to avoid boundary effect. In the other setup, the functions were generated by 21 Fourier functions. The scores of two groups follow the same normal distribution. For different pairs of , we compared four methods: 1) our new method (SMFC); 2) projection method (denoted by PJ, chiou2007functional); 3) functional linear classifier (denoted by FLC, delaigle2012achieving); 4) functional quadratic classifier (denoted by FQC, delaigle2013classification).

We simulated 200 curves for each group, and classified 100 curves by the other 100 curves. We repeated the classification procedure 100 times and calculated the average classification rates (ACR) for each group, which are presented in Table 1,2.

In the simulation, we can see that, even though the projection method and the functional quadratic classifier also check the second moment structure, sometimes they still cannot distinguish groups with different second moments. That is because, in those methods, we compare groups using the group-wise principal components, which are not guaranteed to capture the discrepancy. Comparing with those methods, we use the basis functions that account for most of the discrepancy between groups, and that makes our method being able to detect the discrepancy more efficiently, especially when the difference in covariance structure is not pronounced.

### 4.3 Correlated functions

To analyze the finite sample properties of the new method in the case of correlated functional data, we discriminated a FAR() process and a sequence of i.i.d. random functions by the proposed method. Both sequences were generated in a 21-dimensional sub-space spanned by the first 21 Fourier basis, , where each function were generated in the same expansion as described in the previous section. The scores of the first sequence follows a VAR() process, i.e.

 ξ(1)i=Aξ(1)i−1+ϵi,

where is i.i.d. 21-dimensional random error vectors following , where is a diagonal matrix with diagonal elements . Two types of were chosen, namely,

 σ1=(1.2−D:D=1,…,21),σ2=(1/D:D=1,…,21).

The scores of the second sequence identically follow the normal distribution .

To show the usefulness of the new method for correlated data, we simulated the two sequences such that they have similar covariance functions but different auto-covariance functions. To be specific, we set , where and is the 21-dimensional identity matrix. Then the covariance function of the FAR() process should be . Then set , consequently the covariance function of the second sequence is the same as that of the first sequence. The auto-covariance function of the two sequences at lag () are

 c(0)h(t,s)=ah(1−a2)−1F(t)Σ1FT(t),c(1)h(t,s)=0.

We tried three weight functions , , and , and compared the corresponding classification rates. 500 functions were simulated as training set, and 100 more functions were simulated as testing set. We repeated the procedure 100 times and the average classification rates are shown in Table 5–7.