be a vector space, typicallybut can also be an infinite dimensional polish space (i.e: separable complete metric space). In Section 8 is a separable Banach space. In the binary classification problem, the aim is to recover the unknown class associated with an observation
. In other words, we seek a classification rule (also called classifier), i.e a measurable. This rule gives an incorrect classification for the observation if . The underlying probabilistic model, that makes a performance measure of possible, is set by distributions () on . For , the distribution is the distribution of the data having label equal to
. In this framework, the weighted sum of the probabilities of misclassification is defined by
In a bayesian framework, the weight reflects the marginal distribution of the label . In our approach, we do not want this marginal distribution to set the importance of the different errors. In the many applications we have in mind, such as tumour detection from an MRI signal, the class that appears most frequently is not necessarily the one for which a classification error has the most important medical consequences. This is the reason why we search a procedure that minimise and not its bayesian counterpart : .
Here, we do not want to study the influence of the weight in the problem. The main reason is that our results, to be given later, are simpler to formulate and to understand when , and that the problem we are interested in is the problem that rise from the high dimension of the space , and not the problem related to the use of . Therefore, in the rest of the present paper we will make the assumption that . In the sequel, we will set . This is a usual assumption (see for example Bickel and Levina bickel:2004fk )
In the case where it is known that, if and are equivalent, then the rule that minimises is given by
is the logarithm of the likekihood ratio between and (i.e the Radon-Nikodym derivative).
In real life problems, is unknown, and the only thing we have is a substitute of it. Also, it is natural to plug it in (2) and to use the classifier
The natural question that we will investigate in this article is the following:
Is there a simple way to relate the excess risk to a measure of the log-likelihood ”perturbation”: .
In other words we seek an upper bound and a lower bound of by a simple-to-study real valued function of . In this article we focus on the gaussian case, and unless the contrary is explicitly stated, and will be gaussian equivalent probabilities on . We investigate Problem 1 and the answer we obtain in the general case leads to the bound
while for a gaussian measure , where is a constant only depending on . In some particular cases (when and are affine) we are able to give an explicit constant and an exponent higher than (exponent ).
If we suppose that and have equal covariance, then it is known that is affine and it is natural to take an affine . The corresponding procedure is usually called Linear Discriminant Analysis (LDA) (even if the underlying procedure is affine). If we suppose that and have different covariance, then is quadratic and it is natural to take a quadratic . The corresponding classification procedure will be called Quadratic Discriminant Analysis (QDA).
The corresponding procedures are also known as plug-in procedures: is plugged into (2) in order to obtain . Plug-in procedure have been studied in a different context (see for example Audibert:2006fk and the references therein), but our approach differs from those.
The interest of Problem 1 in the gaussian setting, is understood by addressing the problem of finding a good substitute for . For example, in many applications, we are given a learning set consisting of random variables drawn independently from and drawn from . The problem of finding a good substitute of then becomes an estimation problem whose error measure is given in the answer to Problem 1. Also, our answer to Problem 1 given below gives rise to a natural way to estimate in high dimension, which is the answer to what we call Problem 2:
Given a learning set, construct in order to get a satisfactory classification procedure in high dimension: a procedure that can be justified theoretically and with numerical experiment.
Classical methods of classification break down when the dimensionality is extremely large. For example. Bickel and Levina bickel:2004fk have studied the poor performances of Fisher discriminant analysis. Although, the number of parameters to learn in order to build a classification rule seems to be responsible for the poor performance. In the sequel we shall give theoretical non-asymptotic results that emphasise this poor performances. To overcome the poor performance Bickel and Levina bickel:2004fk propose to use a rule which relies on feature independence, Fan and Fan Fans:2007vn propose to select the interesting features with a multiple testing procedure. Bickel and Levina give a theoretical study of a particular LDA procedure (i.e a LDA procedure based on a particular estimator ), they do not study the QDA procedure.
The selection of interesting features constitutes a reduction of the dimension of the space on which the classification rule acts. Feature selection is widely used in high dimensional classification, the procedures used for selection of interesting features are often motivated by theoretical results (seeFans:2007vn ). Unfortunately, these theoretical results are based on the following two postulates. On the one hand, features can be a priori divided into two parts, an interesting one and a non interesting one. On the other hand, selecting the interesting features is necessary and sufficient to get a good classification rule. If we accept that these postulates reflect nothing but a relatively clear intuition, we would like to give an analysis of the classification risk in order to justify a feature selection method based on multiple hypothesis testing.
Thresholding techniques are widely used in the non-parametric regression framework (see Candes:2006lk for an introduction to the thresholding techniques), and as we shall see, the techniques can be used to give an answer to Problem 2. Also we believe that our answer to Problem 1 will shed light on the simple link that exists between the nonparametric regression and the classification problem.
Functional data analysis is the study of data that lives in an infinite dimensional functional space. Hence curve classification is one of the problems it deals with. Since Grenander:1950fk , functional data analysis has undergone further developments and especially in the context of classification (see for example Berlinet:2005at and the references therein). In the gaussian setting, it is rather natural to expect results that are dimensionless and that can be applied to any abstract polish space. Hence, our answer to problem will be given in terms of norms, with a gaussian measure, and since the constant involved in our theoretical result does not depend on the dimension, the extension from to more abstract spaces is straightforward.
Let us introduce some notation. In the whole article, is a gaussian measure on with mean and covariance , is the zero mean gaussian measure with covariance and is the gaussian measure on with mean zero and covariance ;is a probability measure on , will be the norm of the orthogonal projection in of the vector on the hyper-plan orthogonal to ; if will be the norm of the linear application . We shall use both the fact that if and is a gaussian measure with mean zero and covariance , then ; and that is a natural measure that can be extended in an infinite dimensional framework. The symmetric difference between two subsets of and is denoted by , it is the set of all elements that are in or in . If is a matrix of will be the Hilbert-Schmidt norm of the matrix , the trace of , and will be given by for all .
This article is organized as follows. We give the main theoretical results -leading to a solution to Problem 1- for the LDA procedure in Section 2, and for the QDA procedure in Section 3. In section 4 we give our algorithm for high dimensional data classification and the theoretical result related to it. This leads to our contribution to Problem 2 in the light of our solution to Problem 1. In Section 5 we apply this algorithm to curve classification. In Section 6 we introduce a geometric measure of error and derive its link with the excess risk. Section 7 is devoted to the proof of results given in Section 2 and Section 8, to the proof of results given in Section 3 and possible generalisations.
2 Affine perturbation of affine rules
2.1 An solution to Problem 1
2.1.1 Main result
In this section, , is a symmetric definite positive matrix and . Under these hypotheses is affine on :
and . In this section, we restrict ourselves to an affine substitute , we note and the corresponding substitutes of and . We then decide that comes from if it is in
One can define the angle in between and by
This angle will play a very important role in the sequel. We obtained the following solution to Problem 1.
Let and be two vectors and defined by substituting and for and in (3). Let and be two gaussian measures on with the same covariance with means respectively and .
If is the subset defined by (4), we have:
If and ( is defined by (5)), then
is the measure of an area between two hyperplans obtained by a rotation of angle. The proof also uses the inequality
which defines . We call the learning error, it is the probability of making a a wrong classification with and a good classification with the optimal rule . We will use and motivate more deeply this measure of error in Section 6. Let us now give comments on Theorem 2.1.
2.1.2 General comments
If we note
Also, in the sequel we will talk about affine perturbation of the optimal rule. The preceding theorem results from the study of affine perturbations of affine rules.
The case where will be studied later but we can already note that in this case, Theorem 2.1 yields
The quantity measures the theoretical separation of the data. Indeed it is the distance between and , defined by that measures this separation: it is known that , which implies
Also, , and then the data cannot be distinguished by any rule. The data tends to be perfectly separated when . In this case, and
Also note that in the infinite dimensional setting two gaussian measures and are either orthogonal (there exists a Borelian set such that ) or equivalent (i.e mutually absolutely continuous) and the latter case appears if and only if is finite.
Although, if measures the estimation error,
in the upper bounds (6) and (7), are linked with the proximity of the measures and . When is large, data are well separated and the terms in (10) measure the impact of this separation on the excess risk. We believe that when tends to , is linked to the error measure used in the proof (defined by (8)). Indeed, it is not correct to think that the classification problem is harder (in the sense of the excess risk) when data are not well separated: straightforward computation leads to
As we shall see in the sequel (see Theorem 6.1) behaves almost like the excess risk if and only if does not tend to .
The learning set has to be used to elaborate estimators and of and . The preceding theorem allows us to quantify what intuition clearly indicates: a good estimation of the parameters and (or more indirectly and ) leads to a good classification rule. These estimators must lead to a small excess risk and by the preceding theorem
where is the learning set distribution.
It seems that little is known on theoretical behaviour of the LDA procedure (a plug-in procedure) with respect to the optimal rule (the Bayes rule). The result that is classically used (see for example Anderson and Bahadur Anderson:1962fk ) to show the consistency of a LDA rule using estimators and is that the probability to observe (in that case comes from class ) falling into (and affect it to class ) is
where is the -field generated by the learning set, and is a centered gaussian random vector of with covariance . Note that the proof of (12) follows from a straightforward calculation. We believe that a direct analysis of this error term misses the geometrical aspect of the problem. In addition, this error has to be compared with the lowest possible error . Note that for the LDA procedure in a high dimensional framework, an analysis of the worst case excess risk has been done with (12) by Bickel and Levina bickel:2004fk for a particular choice of and . Our Theorem, because it is intrinsic to the classification procedure, is singularly different from the type of result that they obtain. In particular, it will allow us to establish a revealing link between dimensionality reduction and thresholding estimation.
2.1.3 The constant part of the perturbation
The error due to the constant part of the perturbation ( in equation (9)), is measured by
In order to give a first simple analysis of this term, we are going to suppose that and are independent. This independence can be obtained by keeping a part of the learning set for the estimation of and a part for the estimation of . In thisat case, if observations of the learning set were used to construct , and if ( is the empirical mean of the observations of group ), then, straightforward calculation leads to
Ultimately, the difficulty of the problem does not come from the constant part of the perturbation, but from the linear part.
The conditions under which the second inequality (7) of the theorem is given shall easily be satisfied. The second condition is that . It is not difficult to satisfy if and are close enough to each other. The first one is verified if the second is and if we have:
If for example and the learning set is composed of observations uniquely used for the estimation of , then, given the rest of the learning set, and the preceding condition is satisfied with probability
2.1.4 The linear part of the perturbation
As we shall explain in the proof of Theorem 2.1, the angle defined by (5) measures quite well the error due to the linear part of the perturbation. Also, the upper bound given in the preceding theorem is not sharp everywhere. Indeed, if , and , the error is null and the bound (6) can be arbitrarily large. We believe that the study of methods designed to estimate direction (parameter on the sphere ) in a high dimensional setting are required. We only want to give the link between the problem of estimating as a vector of and the problem of estimating in order to get small . In addition, this invariance of the error under dilatation only exists in the direction which is unknown and is seems to be quite tricky to make a direct use of it.
Let us give a simple example to illustrate the interest of the link between estimation and learning.
Let , suppose , and that is known. In the estimation problem of for classification we wish to recover from the observation and the error is measured by
In Example 2.1 the problem is exactly the one we encounter in the regression framework, while estimating from noisy observations of with an error measured with a norm. Suppose now that we want to let grow to infinity. If the coefficients of decrease sufficiently fast, for example if with , then (see for example Candes:2006lk ), it is possible to obtain a good statistical estimation of by setting to zero the coefficient that are are, in absolute value, under a threshold. It is a thresholding estimation and we shall use this type of procedure in Section 4. In the case where we observe from the distribution (or equivalently , from the distribution ) and if is known, the problem can be reduced to the preceding particular case thanks to the transformation . When is unknown, the parallel with the estimation framework is more delicate because the error depends on .
Replacing coefficients by zero in the regression framework of Example 2.1 is equivalent to reducing the dimension of the space on which the chosen classification rule acts. Selecting the significant coefficients of is equivalent to finding the direction for which is large. This is almost equivalent to finding the direction in which a theoretical version of the ratio between inter-variance and intra-variance is big. This type of heuristic with empirical quantities has been used by Fisher
is large. This is almost equivalent to finding the direction in which a theoretical version of the ratio between inter-variance and intra-variance is big. This type of heuristic with empirical quantities has been used by FisherFisher:1936fk , whose strategy is to maximize the Rayleigh quotient (see for example (Friedman:2001wq, )). The point is that the use of empirical quantities in high dimension can be catastrophic (see next subsection).
2.2 Procedures to avoid in high dimension
We are going to give two results that will lead to the following precepts in the problem of estimating . While giving a solution to Problem 2,
one should not try to estimate the full covariance matrix from the data,
one should restrict the possible values of to a (sufficiently small) subset of .
These precepts have been known for some time, but we give precise non-asymptotic results emphasising them. The first fact is a consequence of Proposition 2.1 below while the second one results from Proposition 2.2.
These two proposition arise from the use of a more geometric error measure, the learning error , which has already been defined by (8) and which shall be studied in more detail in Section 6. In fact it is an easy geometric exercise, for one who knows a little on gaussian measure, to obtain the following lower bound
for all measurable . Also, it suffices to get a lower bound on the Learning error by the use of (13) to get (a good) lower bound on the excess Risk when cannot be as closed as desired from zero. This is what we shall do. For the case where the distributions and are almost undistinguishable () we refer to the discussion in Section 6.
2.2.1 One should not try to identify the correlation structure
Let us recall that if is a definite positive matrix, one can define its generalised inverse, also called Moore-Penrose pseudo-inverse: . This generalised inverse arises from the decomposition . On , is null, and on , equals the inverse of ( i.e is the restriction of to ).
Before we prove this proposition, let us comment it in few words.
Comment. As a particular application of this proposition, we see that the Fisher rule performs badly when , which was already given in bickel:2004fk , but in a different form (asymptotic and not in a direct comparison of the risk with the Bayes risk). Many alternatives to the estimation of the correlation structure can be used, based for example on approximation theory of covariance operators, together with model selection procedure or more sophisticated aggregation procedure. Much work has already been done in this direction, see for example Bickel:2007fk and the references therein. The approximation procedure has to be linked with a statistical hypothesis, as it is in the case when stationarity assumptions are made that lead to a Toeplitz covariance matrix (i.e with a -perioric sequence). These matrices are circular convolution operators and are diagonal in the discrete Fourier Basis where
This is roughly the type of harmonic analysis that is used in Bickel and Levina bickel:2004fk and combined with an approximation in bestbasis . Under assumption such as commutation (or quasi-commutation) of the covariance with a given family of projections, the covariance matrix can be search in the set of operator given by a spectral density. This leads to a huge reduction of the parameters to estimate. Let us finally notice that the use of harmonic analysis of stationarity in curve classification becomes very interesting when one considers the larger class of group stationary-processes (see Yazici:2004mz ) or semi-group stationary processes (see Girardin:2003rt ).
The proof is based on ideas from Bickel and Levina bickel:2004fk used in their Theorem : if is the identity their exist , valued random variables forming an orthonormal basis of , a random vector of whose property are the following.
The are independent of each other, independent of , and follows a distribution with degrees of freedom.
For every , is drawn in an independent and uniform fashion on the intersection of the unitary sphere of and the orthogonal to .
The empirical estimator of satisfies:
where if , is the linear operator of that associates to the vector .
When does not necessarily equal , we get, almost-surely:
Then, if we define , we have the following equations
For reasons of symmetry (the are drawn uniformly on the sphere), we have for all subsets from of size :
and we obtain
|( Cauchy-Schwartz inequality and function arccos is decreasing)|
This and inequality (13) lead to the desired result. ∎
2.2.2 One should not use a simple linear estimate to get .
Suppose that is a positive definite matrix, and that we are given drawn independently from a gaussian Probability distribution with mean and covariance on . Let be the associated empirical mean. Let us take and . Then, the classification rule defined by (4) leads to
Before we give a proof, we comment this result briefly.
Comment. Suppose there exists such that . From the preceding proposition, uniformly on all the possible values of and , the learning error and the excess risk can converge to zero only if tends to . Recall that if no a priori assumption is done on , is the best estimator (according to the mean square error) of . Also, as in the estimation of a high dimensional vector problem (such as those described in (Candes:2006lk )), one should make a more restrictive hypothesis on . We will suppose, in Section 5, that if are the coefficients of in a well chosen basis, then for .
As in the preceding proposition, we will use inequality (13). Also it is sufficient to show the following
where is defined by (5). Because the function is decreasing and concave on , it suffices to obtain
On the other hand,
where this last inequality results from Cauchy-Schwartz. Recall that
where is a standardised gaussian random vector of . Also, we easily obtain,
The rest of the proof follows from the following simple fact which is a consequence of the Cochran Theorem and a classical calculation on random variables:
Let , , a gaussian random vector of with mean and covariance . Then
2.3 Case where diverges: well separated data.
We shall now rapidly consider the case when the data are well separated: the case where diverges. In the next theorem, we assume that tends to infinity.
Suppose that ( is defined by (5)), and that
when tends to infinity. We then have