Robust multiple-set linear canonical analysis based on minimum covariance determinant estimator

11/06/2018
by   Ulrich Djemby Bivigou, et al.
0

By deriving influence functions related to multiple-set linear canonical analysis (MSLCA) we show that the classical version of this analysis, based on empirical covariance operators, is not robust. Then, we introduce a robust version of MSLCA by using the MCD estimator of the covariance operator of the involved random vector. The related influence functions are then derived and are shown to be bounded. Asymptotic properties of the introduced robust MSLCA are obtained and permit to propose a robust test for mutual non-correlation. This test is shown to be robust by studying the related second order influence function under the null hypothesis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/19/2019

Robustifying multiple-set linear canonical analysis with S-estimator

We consider a robust version of multiple-set linear canonical analysis o...
05/09/2017

Influence Function and Robust Variant of Kernel Canonical Correlation Analysis

Many unsupervised kernel methods rely on the estimation of the kernel co...
02/17/2016

Robust Kernel (Cross-) Covariance Operators in Reproducing Kernel Hilbert Space toward Kernel Methods

To the best of our knowledge, there are no general well-founded robust m...
06/01/2016

Gene-Gene association for Imaging Genetics Data using Robust Kernel Canonical Correlation Analysis

In genome-wide interaction studies, to detect gene-gene interactions, mo...
04/05/2020

Nearly Optimal Robust Mean Estimation via Empirical Characteristic Function

We propose an estimator for the mean of random variables in separable re...
10/17/2020

Significance testing for canonical correlation analysis in high dimensions

We consider the problem of testing for the presence of linear relationsh...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many multivariate statistical methods are based on empirical covariance operators. That is the case for multiple regression, principal components analysis, factor analysis, linear discriminant analysis, linear canonical analysis, multiple-set linear canonical analysis, and so on. However, these empirical covariance operators are known to be extremely sensitive to outliers. That is an undesirable property that makes the preceding methods themselves sensitive to outliers. For overcoming this problem, robust alternatives for these methods have been proposed in the literature, mainly by replacing the aforementioned empirical covariance operators by robust estimators. In this vein, robust versions of multivariate statistical methods have been introduced, especially for multiple regression (

[21]), principal components analysis ([8],[10],[16],[22]), factor analysis ([19]), linear discriminant analysis ([6],[9],[14]), linear canonical analysis ([5],[24]). Multiple-set linear canonical analysis (MSLCA) is an important multivariate statistical method that analyzes the relationship between more than two random vectors, so generalizing linear canonical analysis. It has been introduced for many years (e.g., [12]) and has been studied since then under different aspects (e.g., [15],[23],[25]

). A formulation of MSLCA within the context of Euclidean random variables has been made recently (

[18]) and permitted to obtain an asymptotic theory for this analysis when it is estimated by using empirical covariance operators. To the best of our knowledge, such estimation of MSLCA is the one that have been tackled in the literature, despite the fact that it is known to be nonrobust as it is sensitive to outliers. So, there is a real interest in introducing a robust estimation of MSLCA as it was done for the others multivariate statistical methods. This can be done by using robust estimators of the covariance operators of the involved random vectors instead of the empirical covariance operators. Among such robust estimators, the minimum covariance determinant (MCD) estimator has been extensively studied ([1], [2],[3],[7]), and it is known to have good robustness properties. Also, its asymptotic properties have been obtained ([1],[2],[3]) mainly under elliptical distribution.

In this paper, we propose a robust version of MSLCA based on MCD estimator of the covariance operator. We start by recalling, in Section 2, the notion of MSLCA for Euclidean random variables and we study its robustness properties by deriving the influence functions of the functionals that lead to its estimator from the empirical covariance operators. It is proved that the influence function of the operator that determines MSLCA is not bounded. In Section 3, we introduce a robust estimation of MSLCA (denoted by RMSLCA) by using the MCD estimator of the covariance operator on which this analysis is defined. Then we derive the influence function of the operator that determines RMSLCA, which is proved to be bounded, and that of the canonical coefficients and the canonical directions. Section 4 is devoted to asymptotic properties of RMSLCA. We obtain limiting distributions that are then used in Section 5 where a robust test for mutual non-correlation is introduced. The robustness properties of this test are studied through the derivation of the second order influence function of the test statistic under the null hypothesis. The proofs of all theorems and propositions are postponed in Section 6.

2 Influence in multiple-set canonical analysis

In this section we recall the notion of multiple-set linear canonical analysis (MSLCA) of Euclidean random variables as introduced by Nkiet[18], and also its estimation based on empirical covariance operators. Then, the robustness properties of this analysis are studied through derivation of the influence functions that correspond to the functionals related to it.

2.1 Multiple-set linear canonical analysis

Letting

be a probability space, and

be an integer such that , we consider random variables defined on this probability space and with values in Euclidean vector spaces respectively. We then consider the space which is also an Euclidean vector space equipped with the inner product defined by:

where is the inner product of and , . From now on, we assume that the following assumption holds :


(): for , we have and , where denotes the norm induced by .


Then, we consider the random vector with values in , and we can give the following definition of multiple-set linear canonical analysis (see [18]):


Definition 2.1

The multiple-set linear canonical analysis (MSLCA) of is the search of a sequence of vectors of , where , satisfying:

(1)

where and for :


A solution of the above maximization problem is obtained from spectral analysis of an operator that will know be specified. For , let us consider the covariance operators

where

denotes the tensor product such that

is the linear map : , and denotes the adjoint of . Letting be the canonical projection

which adjoint operator is the map

we consider the operators defined as

(2)

The covariance operator is a self-adjoint and positive operator; we assume throughout this paper that it is invertible. Then, it is easy to check that is also self-adjoint positive and invertible operator, and we consider

The spectral analysis of this last operator gives a solution of the maximization problem specified in Definition 2.1. Indeed, if is an orthonormal basis of such that

is an eigenvector of

associated with the

-th largest eigenvalue

, then we obtain a solution of (1) by taking , and we have . Finally, the MSLCA of is the family obtained as indicated above. The ’s are termed the canonical coefficients and the ’s are termed the canonical directions.


Note that can be expressed as a function of the covariance operator of . Indeed, denoting by the space of linear maps fom to itself, and considering the linear maps and from to itself defined as

(3)

it is easy to check, by using properties of tensor produts (see [11]), that

(4)

and, therefore, from (2), (3) and (4), it follows

2.2 Estimation based on empirical covariance operator

Now, we recall the classical way for estimating MSLCA by using empirical covariance operators (see, e.g., [18]). For , let be an i.i.d. sample of . We then consider the sample means and empirical covariance operators defined for as

and . These permit to define random operators, with values in , as

(5)

and to estimate by

(6)

Considering the eigenvalues of , and an orthonormal basis of such that is an eigenvector of associated with , we can estimate by , by and by .


The random operator can also be expressed as a function of the empirical covariance operator of the ’s that are defined as ; this empirical covariance operator is

where Since and , we straighforwardly obtain from (3), (5) and (6):

(7)

2.3 Influence functions

For studying the effect of a small amount of contamination at a given point on MSLCA it is important, as usual in robustness litterature (see [13]), to use influence function. More precisely, we have to derive expressions of the influence functions related to the functionals that give , and (for ) at the distribution of . Recall that the influence function of a functional at is defined as

where is the Dirac measure putting all its mass in .


First, we have to specify the functionals related to , and (for ) and their empirical counterparts. Let us consider the functional given by

where is the functional defined as

Applying this functional to the distribution of gives and, therefore, . For , denoting by (resp. ; resp. ) the functional such that is the -th largest eigenvalue of (resp. the associated eigenvector; resp. ), we have , and .


Furthermore, denoting by the empirical measure corresponding to the sample , we have

These functionals are to be taken into account in order to derive the influence functions related to MSLCA. We make the following assumption:


: For all , we have , where   denotes the identity operator of .


Then, we have the following theorem that gives the influence function of .

Theorem 1

We suppose that the assumptions and hold. Then, for any vector we have:

(8)

As determines MSLCA, it is important to ask whether its influence function is bounded. If so, we say that MSLCA is robust because it would mean that a contamination at the point has a limited effect on . The following proposition shows that is not bounded. We denote by the operators norm defined as .


Proposition 1

We suppose that the assumptions and hold. Then, there exists such that:

Now, we give in the following theorem, the influence functions related to the canonical coefficients and the canonical directions.

Theorem 2

We suppose that the assumptions and hold. Then, for any and any , we have:

(ii)    We suppose, in addition, that . Then :

where denotes the identity operator of .


Remark 1

Romanazzi[20] derived influence functions for the squared canonical coefficients and the canonical directions obtained from linear canonical analysis (LCA) of two random vectors. LCA is in fact a particular case of MSLCA obtained when (see [18]). With Theorem 2 we recover the results of [20] when whe take . We will only show it below for the canonical coefficients. For , by applying Theorem 2 with , we obtain

(10)

The linear canonical analysis (LCA) of and is obtained from the spectral analysis of (since and ). If we denote by , , the related squared canonical coefficients and canonical vectors, it is known (see Remark 2.2 in [18]) that

(11)

Then, putting and , we deduce from (10), (11) and the equality that:

(12)

what is the result obtained in [20].

3 Robust multiple-set linear canonical analysis (RMSLCA)

It has been seen that the MSLCA based on empirical covariance operator is not robust since is not bounded. There is therefore an interest in proposing a robust version of MSLCA. In this section, we introduce such a version by replacing in (7) the empirical covariance operator by a robust estimator of . More precisely, we use the minimum covariance determinant (MCD) estimator of . We consider the following assumption:


() : the distribution of is an elliptical contoured distribution with density

where is a function having a strictly negative derivative .

We first define the estimator of MSLCA based on MCD estimator of , then we derive the related influence functions.

3.1 Estimation of MSLCA based on MCD estimator

Letting be a fixed real such that , we consider a subsample of size , where , and we define the empirical mean and covariance operator based on this subsample by:

and

We denote by the subsample of which minimizes the determinant of over all subsamples of size . Then, the MCD estimators of the mean and the covariance operator of are and , respectively. It is well known that the these estimators are robusts and have high breakdown points (see, e.g., [21]). From them, we can introduce an estimator of MSLCA which is expected to be also robust. Indeed, putting

we consider the random operators with values in defined as

where and , and we estimate by

(13)

Considering the eigenvalues of , and an orthonormal basis of such that is an eigenvector of associated with , we estimate by , by and by . This gives a robust MSLCA that we denote by RMSLCA.

3.2 Influence functions

In order to derive the influence functions related to the above estimator of MSLCA, we have to specify the functional that corresponds to it. For doing that, we will first recall the functional associated to the above MCD estimator of covariance operator. Let

where is determined by the equation

being the usual gamma function. The functional related to the aforementioned MCD estimator of is defined in [2] (see also [1], [7]) by

where

It is known that where

Therefore, the functional related to is defined as

where and are defined in (3). Now, we can give the influence functions related to RMSLCA of . First, putting

and , we have:


Theorem 3

We suppose that the assumptions to hold. Then

where is given in (8).


From this theorem we to obtain the following proposition which proves that RMSLCA is robust since the preceding influence function is bounded. We denote by the usual operators norm defined by .


Proposition 2

We suppose that the assumptions to hold. Then,

Now, we give in the following theorem, the influence functions related to the canonical coefficients and the canonical directions obtained from RMSLCA. For , denoting by (resp. ; resp. ) the functional such that is the -th largest eigenvalue of (resp. the associated eigenvector; resp. ), we put , and . Considering

(14)

we have:


Theorem 4

We suppose that the assumptions to hold. Then, for any and any , we have:

(ii)   We suppose, in addition, that . Then :

where is given in (2).

Remark 2

From this theorem, we recover the results of [5] which gives the influence function of MCD estimator of LCA of two random vectors. Indeed, using the notation of Remark 1 and (12), we deduce from the previous theorem that, when , we have

what is the result obtained in [5].

4 Asymptotics for RMSLCA

In this section we deal with asymptotic expansion for RMSLCA. We first establish asymptotic normality for and then we derive the asymptotic distribution of the canonical coefficients.


Theorem 5

Under the assumptions to , converges in distribution, as , to a random variable

having a normal distribution in

, with mean 0 and covariance operator equal to that of the random operator

where is the function defined by

(15)

and , , and are given in (14).


This theorem permits to obtain asymptotic distributions for the canonical coefficients. Let (with ) be the decreasing sequence of distinct eigienvalues of , and the multiplicity of . Putting with , we clearly have for any . We denote by the orthogonal projector from

onto the eigenspace associated with

, and by the continuous map which associates to each self-adjoint operator the vector of its eigenvalues in nonincreasing order. For , we consider the -dimensional vectors