Estimation of large block covariance matrices: Application to the analysis of gene expression data

06/26/2018
by   Marie Perrot-Dockès, et al.
0

Motivated by an application in molecular biology, we propose a novel, efficient and fully data-driven approach for estimating large block structured sparse covariance matrices in the case where the number of variables is much larger than the number of samples without limiting ourselves to block diagonal matrices. Our approach consists in approximating such a covariance matrix by the sum of a low-rank sparse matrix and a diagonal matrix. Our methodology can also deal with matrices for which the block structure only appears if the columns and rows are permuted according to an unknown permutation. Our technique is implemented in the R package BlockCov which is available from the Comprehensive R Archive Network and from GitHub. In order to illustrate the statistical and numerical performance of our package some numerical experiments are provided as well as a thorough comparison with alternative methods. Finally, our approach is applied to gene expression data in order to better understand the toxicity of acetaminophen on the liver of rats.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 17

07/30/2019

Block-diagonal covariance estimation and application to the Shapley effects in sensitivity analysis

In this paper, we aim to estimate block-diagonal covariance matrices for...
03/27/2015

Estimating a common covariance matrix for network meta-analysis of gene expression datasets in diffuse large B-cell lymphoma

The estimation of covariance matrices of gene expressions has many appli...
12/04/2020

A Canonical Representation of Block Matrices with Applications to Covariance and Correlation Matrices

We obtain a canonical representation for block matrices. The representat...
01/02/2016

Joint Estimation of Precision Matrices in Heterogeneous Populations

We introduce a general framework for estimation of inverse covariance, o...
11/28/2018

Beyond Pham's algorithm for joint diagonalization

The approximate joint diagonalization of a set of matrices consists in f...
03/29/2020

The covariance shift (C-SHIFT) algorithm for normalizing biological data

Omics technologies are powerful tools for analyzing patterns in gene exp...
05/06/2022

Linear-Complexity Black-Box Randomized Compression of Hierarchically Block Separable Matrices

A randomized algorithm for computing a compressed representation of a gi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Plant functional genomics refers to the description of the biological function of a single or a group of genes and both the dynamics and the plasticity of genome expression to shape the phenotype. Combining multi-omics such as transcriptomic, proteomic or metabolomic approaches allows us to address in a new light the dimension and the complexity of the different levels of gene expression control and the delicacy of the metabolic regulation of plants under fluctuation environments. Thus, our era marks a real conceptual shift in plant biology where the individual is no longer considered as a simple sum of components but rather as a system with a set of interacting components to maximize its growth, its reproduction and its adaptation. Plant systems biology is therefore defined by multidisciplinary and multi-scale approaches based on the acquisition of a wide range of data as exhaustive as possible.

In this context, it is crucial to propose new methodologies for integrating heterogeneous data explaining the co-regulations/co-accumulations of products of gene expression (mRNA, proteins) and metabolites. In order to better understand these phenomena, our goal will thus be to propose a new approach for estimating block structured covariance matrix in a high-dimensional framework where the dimension of the covariance matrix is much larger than the sample size. In this setting, it is well known that the commonly used sample covariance matrix performs poorly. In recent years, researchers have proposed various regularization techniques to consistently estimate large covariance matrices or the inverse of such matrices, namely precision matrices. To estimate such matrices, one of the key assumptions made in the literature is that the matrix of interest is sparse, namely many entries are equal to zero. A number of regularization approaches including banding, tapering, thresholding and minimization, have been developed to estimate large covariance matrices or their inverse such as, for instance, [27], [5], [3], [6] and [34] among many others. For further references, we refer the reader to [7] and to the review of [13].

In this paper, we shall consider the following framework. Let , zero-mean i.i.d.

-dimensional random vectors having a covariance matrix

such that the number of its rows and columns is much larger than . The goal of the paper is to propose a new estimator of and of the square root of its inverse, , in the particular case where is assumed to have a block structure without limiting ourselves to diagonal blocks. An accurate estimator of can indeed be very useful to better understand the links between the columns of the observation matrix and may highlight some biological processes. Moreover, an estimator of can be very useful in the general linear model in order to remove the dependence that may exist between the columns of the observation matrix. For further details on this point, we refer the reader to [29], [31] and to the R package MultiVarSel in which such an approach is proposed and implemented for performing variable selection in the multivariate linear model in the presence of dependence between the columns of the observation matrix.

More precisely, in this paper, we shall assume that

(1)

where is a sparse matrix with , denotes the transpose of the matrix and is a diagonal matrix such that the diagonal terms of are equal to one. Two examples of such matrices and are given in Figure 1 in the case where and and in the case where the columns of do not need to be permuted in order to see the block structure. Based on (1), our model could seem to be close to factor models described in [24] and [13]. However, in [24], the high-dimensional aspects are not considered and in [13] the sparsity constraint is not studied. Note also that the block diagonal assumption has already been recently considered by [9] for estimating the inverse of large covariance matrices in high-dimensional Gaussian Graphical Models (GGM).

Figure 1. Examples of matrices generated from different matrices leading to a block diagonal or to a more general block structure (extra-diagonal blocks).

We also propose a methodology to estimate in the case where the block structure is latent; that is, permuting the columns and rows of renders visible its block structure. An example of such a matrix is given in Figure 2 in the case where and .

Figure 2. Examples of matrices of Figure 1 in which the columns and rows are randomly permuted.

Our approach is fully data-driven and consists in providing a low rank matrix approximation of the part of and then in using a regularization to obtain a sparse estimator of

. When the block structure is latent, a hierarchical clustering step must be applied first. With this estimator of

, we explain how to obtain an estimator of .

Our methodology is described in Section 2. Some numerical experiments on synthetic data are provided in Section 3. An application to the analysis of “-omic” data to study seed quality is performed in Section 4.

2. Statistical inference

The strategy that we propose for estimating and can be summarized as follows.

  • First step: Low rank approximation. In this step, we propose to approximate the part of

    by a low rank matrix using a Singular Value Decomposition (SVD).

  • Second step: Detecting the position of the non null values. In this step, we use a Lasso criterion to yield a sparse estimator of .

  • Third step: Positive definiteness. We apply the methodology of [19] to to ensure that the final estimator of is positive definite.

  • Fourth step: Estimation of . In this step, is estimated from the spectral decomposition of obtained in the previous step.

2.1. Low rank approximation

By definition of in (1), is a low rank matrix having its rank smaller or equal to . In the first step, our goal is thus to propose a low rank approximation of an estimator of .

Let be the sample covariance matrix defined by

where . The corresponding sample correlation matrix is defined by:

(2)

where

Let us also consider the matrix defined by:

(3)

If was the real matrix , the corresponding matrix would have a rank less than or equal to . Since is an estimator of , we shall use a rank approximation of . This will be performed by considering in its singular value decomposition only the largest singular values and by replacing the other ones by 0. By [12], this corresponds to the best rank approximation of . The choice of will be discussed in Section 2.5.

2.2. Detecting the position of the non null values

Let us first explain the usual framework in which the Lasso approach is used. We consider a linear model of the following form

(4)

where , and are vectors and is sparse meaning that it has a lot of null components.

In such models a very popular approach initially proposed by [35] is the Least Absolute Shrinkage eStimatOr (Lasso), which is defined as follows for a positive :

(5)

where, for , and , i.e. the -norm of the vector . Observe that the first term of (5) is the classical least-squares criterion and that can be seen as a penalty term. The interest of such a criterion is the sparsity enforcing property of the -norm ensuring that the number of non-zero components of the estimator of is small for large enough values of . Let

(6)

where defined in Section 16.4 of [17] is such that for a matrix ,

where is the sub-vector of the column of obtained by striking out the first elements. In order to estimate the sparse matrix , we need to propose a sparse estimator of . To do this we apply the Lasso criterion described in (5), where

is the identity matrix. In the case where

is an orthogonal matrix it has been shown in

[16] that the solution of (5) is:

where denotes the th column of . Using the fact that is the identity matrix we get

(7)

We then reestimate the non null coefficients using the least-squares criterion and get:

(8)

where is defined in (6).

It has to be noticed that obtained in (7) satisfies the following criterion:

where denotes the Frobenius norm defined for a matrix by , denotes the -norm of the vector formed by stacking the columns of . It is thus closely related to the generalized thresholding estimator defined in [38] and to the one defined in [34] with except that in our case is replaced by where corresponds to the matrix in which the diagonal terms are replaced by 0. The diagonal terms of were indeed already removed in . Hence, we get by elementwise soft-thresholding that is by putting to zero the value of that are under a given threshold and by multiplying the non null values by a coefficient containing this threshold.

Here, we choose to estimate by defined through in (8) which corresponds to a hard-thresholding and we set the upper triangular part of the estimator of to be equal to . Since the diagonal terms of are assumed to be equal to 1, we take the diagonal terms of equal to 1. The lower triangular part of is then obtained by symmetry.

The choice of the best parameter denoted in the following will be discussed in Section 3.2.

2.3. Positive definiteness

To ensure the positive definiteness of our estimator of , we consider the nearest correlation matrix to which is computed by using the methodology proposed by [19] and which is implemented in the function nearPD of the R package Matrix, see [4].

2.4. Estimation of

Even if providing an estimator of a large covariance matrix can be very useful in practice, it may also be interesting to efficiently estimate . Such an estimator can indeed be used in the general linear model in order to remove the dependence that may exist between the columns of the observations matrix. For further details on this point, we refer the reader to [29], [31] and to the R package MultiVarSel in which such an approach is proposed and implemented for performing variable selection in the multivariate linear model in the presence of dependence between the columns of the observation matrix.

Since is a symmetric matrix, it can be rewritten as , where is a diagonal matrix and is an orthogonal matrix. The matrix can thus be estimated by where is a diagonal matrix having its diagonal terms equal to the square root of the inverse of the singular values of

. However, inverting the square root of too small eigenvalues may lead to poor estimators of

. This is the reason why we propose to estimate by

(9)

where is a diagonal matrix such that its diagonal entries are equal to the square root of the inverse of the diagonal entries of except for those which are smaller than a given threshold which are replaced by 0 in . The choice of will be further discussed in Section 3.6.

2.5. Choice of the parameters

Our methodology for estimating depends on two parameters: The number of singular values kept for defining and the parameter which controls the sparsity level namely the number of zero values in defined in (8).

For choosing , we shall compare two strategies in Section 3.1:

  • The Cattell criterion based on the Cattell’s scree plot described in [8] and

  • the PA permutation method proposed by [20] and recently studied from a theoretical point of view by[10].

To choose the parameter in (8), we shall compare two strategies in Section 3.2:

  • The BL approach proposed in [5] based on cross-validation and

  • the Elbow method which consists in computing for different values of the Frobenius norm , where and are defined in (2) and at the end of Section 2.2

    , respectively. Then, it fits two simple linear regressions and chooses the value of

    achieving the best fit.

3. Numerical experiments

Our methodology described in the previous section is implemented in the R package BlockCov and is available from the CRAN (Comprehensive R Achive Network) and from GitHub.

We propose hereafter to investigate the performance of our approach for different types of matrices defined in (1) and for different values of and . The four following cases considered correspond to different types of matrices , the matrices being chosen accordingly to ensure that the matrix has its diagonal terms equal to 1.

  • Diagonal-Equal case. In this situation, has the structure displayed in the left part of Figure 1, namely it has 5 columns such that the numbers of the non values in the five columns are equal to , , , and , respectively and the non null values are equal to , , , and , respectively.

  • Diagonal-Unequal case. In this scenario, has the same structure as for the Diagonal-Equal case except that the non null values in the five columns are not fixed but randomly chosen in except for the third column for which its values are randomly chosen in .

  • Extra-Diagonal-Equal case. Here, has the structure displayed in the right part of Figure 1. The values of the columns of are the same as those of the Diagonal-Equal case except for the fourth column which is assumed to contain additional non values equal to -0.5 in the range .

  • Extra-Diagonal-Unequal case. has the same structure as in the Extra-Diagonal-Equal case except that the values are randomly chosen as in the Diagonal-Unequal case except for the fourth column where the additional non values are still equal to -0.5 in the range .

For and , 100 matrices were generated such that its rows are i.i.d. -dimensional zero-mean Gaussian vectors having a covariance matrix chosen according to the four previous cases: Diagonal-Equal, Diagonal-Unequal, Extra-Diagonal-Equal or Extra-Diagonal-Unequal.

3.1. Low rank approximation

The approaches for choosing described in Section 2.5 are illustrated in Figure 3 in the Extra-Diagonal-Unequal case. We can see from this figure that both methodologies find the right value of which is here equal to 5.

Figure 3. Illustration of PA and Cattell criteria for choosing when and in the Extra-Diagonal-Unequal case. The value of found by both methodologies is displayed with a dotted line, the straight lines obtained for the Cattell criterion and the eigenvalues of the permuted matrices in the PA methodology are displayed in grey.

To go further, we investigate the behavior of our methodologies from 100 replications of the matrix for the four different types of . Figure 4 displays the barplots associated to the estimation of made in the different replications by the two approaches for the different scenarii. We can see from this figure that the PA criterion seems to be slightly more stable than the Cattell criterion when . However, in the case where , the PA criterion underestimates the value of . Moreover, in terms of computational time, the performance of Cattell is much better, see Figure 5.

Figure 4. Barplots corresponding to the number of times where each value of is chosen in the low-rank approximation from 100 replications for the two methodologies in the different scenarii for the different values of et .
Figure 5. Computational times of PA and Cattell criteria.

3.2. Positions of the non null values

For the four scenarios, the performance of the two approaches: BL and Elbow described in Section 2.5 for choosing and hence the number of non null values in is illustrated in Figure 6. This figure displays the True Positive Rate (TPR) and the False Positive Rate (FPR) of the methodologies from 100 replications of the matrix for the four different types of and for different values of and .

Figure 6. Boxplots comparing the TPR (True Positive Rate) and the FPR (False positive Rate) of the two methodologies proposed to select the parameter from 100 replications in the different scenarii.

We can see from this figure that the performance of Elbow is on a par with the one of BL except for the case where for which the performance of Elbow is slightly better in terms of True Positive Rate. Moreover, in terms of computational time, the performance of Elbow is much better, see Figure 7.

Figure 7. Computational times of Elbow and BL criteria.

3.3. Comparison with other methodologies

The goal of this section is to compare the statistical performance of our approach with other methodologies.

Since our goal is to estimate a covariance matrix containing blocks, we shall compare our approach with clustering techniques. Once the groups or blocks have be obtained, is estimated by assuming that the corresponding matrix estimator is block-wise constant except for the diagonal blocks for which the diagonal entries are equal to 1 and the extra-diagonal terms are assumed to be equal. This gives a great advantage to these methodologies in the Diagonal-Equal and in the Extra-Diagonal-Equal scenarii. More precisely, let denote the value of the entries in the block having its rows corresponding to Group (or Cluster) and its columns to Group (or Cluster) . Then, for a given clustering :

(10)

where denotes the cluster , denotes the number of elements in the cluster and is the entry of the matrix defined in Equation (2).

For the matrices corresponding to the four scenarios previously described, we shall compare the statistical performance of the following methods:

  • empirical which estimates by defined in (2),

  • blocks which estimates using the methodology described in this article with the criteria PA and BL for choosing and , respectively,

  • blocks_fast which estimates using the methodology described in this article with the criteria Cattell and Elbow for choosing and , respectively,

  • blocks_real which estimates using the methodology described in this article when and the number of non null values are assumed to be known which gives access to the best value of ,

  • hclust which estimates by determining clusters using a hierarchical clustering with the “complete” agglomeration method described in [18] and then uses Equation (10) to estimate ,

  • Specc which estimates

    by determining clusters using spectral clustering described in

    [37] and estimates with Equation (10),

  • kmeans which estimates by determining clusters from a -means clustering approach described in [18] and then uses Equation (10) to estimate .

In order to improve the performance of the clustering approaches: hclust, Specc and kmeans, the real number of clusters has been provided to these methods. The performance of the different approaches is assessed using the Frobenius norm of the difference between and its estimator.

Figure 8

displays the mean and standard deviations of the Frobenius norm of the difference between

and its estimator for different values of and in the four different cases: Diagonal-Equal, Diagonal-Unequal, Extra-Diagonal-Equal and Extra-Diagonal-Unequal. We can see from this figure that in the case where , the performance of blocks_fast is on a par with the one of blocks_real and is better than the one of blocks. In the case where , the performance of blocks is slightly better than the one of blocks_fast and is similar to the one of blocks_real. Moreover, in all cases, either blocks_fast or blocks outperforms the other approaches.

Figure 8. Comparison of the Frobenius norm of for different estimators of and for different .

Then, the estimators of derived from blocks, blocks_fast and blocks_real were compared to the PDSCE estimator proposed by [34] and implemented in the R package PDSCE. Since the computational burden of PDSCE is high for large values of , we limit ourselves to the Extra-Diagonal-Equal case when and for the comparison. Figure 9 displays the results. We can see from this figure that blocks, blocks_fast and blocks_real provide better results than PDSCE. However, it has to be noticed that the latter approach is not designed for dealing with block structured covariance matrices but just for providing sparse estimators of large covariance matrices.

Figure 9. Comparison of the Frobenius norm of in the Extra-Diagonal-Equal case for = 30 and = 100.

3.4. Columns permutation

In practice, it may occur that the columns of consisting of the rows are not ordered in a way which makes blocks appear in the matrix . To address this issue, we propose to perform a hierarchical clustering on beforehand and use the obtained permutation of the observations which guarantees that a cluster plot using this ordering will not have crossings of the branches. Let us denote the matrix in which the columns have been permuted according to this ordering and the covariance matrix of each row of . Then, we apply our methodology to which should provide an efficient estimator of . In order to get an estimator of the columns and rows are permuted according to the ordering coming from the hierarchical clustering.

To assess the corresponding loss of performance, we generated for each matrix used for making Figure 8 a matrix in which the columns of were randomly permuted. The associated covariance matrix is denoted . Then, we applied the methodology described in the previous paragraph denoted blocks_samp and blocks_fast_samp in Figure 10 thus providing . The performance of this new methodology was compared to the methodology that we proposed in the previous sections (denoted blocks and blocks_fast in Figure 10) when the columns of were not permuted. The results are displayed in Figure 10. We can see from this figure that the performance of our approach does not seem to be altered by the permutation of the columns.

Figure 10. Comparison of the Frobenius norm of , and .

3.5. Numerical performance

Figure 11 displays the computational times for estimating with the methods blocks and blocks_fast for different values of ranging from 100 to 3000 and . The timings were obtained on a workstation with 16 GB of RAM and Intel Core i7 (3.66GHz) CPU. Our methodology is implemented in the R package BlockCov which uses the R language (R Core Team, 2017) and relies on the R package Matrix. We can see from this figure that it takes around 3 minutes to estimate a correlation matrix.

Figure 11. Times in seconds to perform our methodology in the Extra-Diagonal Unequal case.

3.6. Choice of the threshold for estimating

Since we are interested in assessing the ability of defined in (9) to remove the dependence that may exist between the columns of , we shall consider the Frobenius norm of which should be close to zero, where denotes the identity matrix of . Figure 12 displays the Frobenius norm of for different threshold . A threshold of 0.1 seems to provide a small error in terms of Frobenius norm. Hence, in the following, will be equal to 0.1 and will be referred as .

Figure 12. Frobenius norm of , where is computed for different thresholds .

This technique was applied to all of the estimators of discussed in Section 3.3 to get different estimators of . The Frobenius norm of the error is used to compare the different estimators obtained by considering the different estimators of . The results are displayed in Figure 13. We observe from this figure that in the case where the estimators of derived from the empirical, the blocks_fast and the blocks_real estimators of perform similarly and seem to be more adapted than the others to remove the dependence among the columns of . However, when , the behavior is completely different. Firstly, in the Diagonal-Equal case, the estimator of derived from the hclust estimator of seems to perform better than the others. Secondly, in the Diagonal-Unequal case, the estimator derived from blocks, blocks_fast and blocks_real perform similarly than the one obtained from hclust. Thirdly, in the Extra-Diagonal case, the estimators derived from blocks, blocks_fast and blocks_real methodology perform better than the other estimators.

Figure 13. Comparison of the Frobenius norm of the error , for different estimators of .

Then, the estimators of derived from blocks, blocks_fast and blocks_real were compared to the GRAB estimator proposed by [22]. Since the computational burden of GRAB is high for large values of , we limit ourselves to the Extra-Diagonal-Equal case when and for the comparison. Figure 14 displays the results. We can see that blocks and blocks_real provide better results than GRAB. However, it has to be noticed that the latter approach depends on a lot of parameters that were difficult to choose, thus we used the default ones.

Figure 14. Comparison of the Frobenius norm of in the Extra-Diagonal-Equal case.

4. Application to “multi-omic” approaches to study seed quality

Climate change could lead to major crop failures in world. In the present study, we addressed the impact of mother plant environment on seed composition. Indeed, seed quality is of paramount ecological and agronomical importance. They are the most efficient form of dispersal of flowering plants in the environment. Seeds are remarkably adapted to harsh environmental conditions as long as they are in a quiescent state. Dry mature seeds (so called “orthodox seeds”) are an appropriate resource for preservation of plant genetic diversity in seedbanks. It has been reported that the temperature regime during seed production affects agronomical traits such as seed germination potential, see [23],[28] and [26]. In order to highlight biomarkers of seed quality according to thermal environment of the mother plant, Arabidopsis seeds were produces under three temperature regimes (14-16 C, 18-22 C or 25-28 C under a long-day photoperiod). Dry mature seeds were analysed by shotgun proteomic and GC/MS-based metabolomics [11]. The choice to use the model plant, Arabidopsis, was motivated by the colossal effort of the international scientific community for its genome annotation. This plant remains at the forefront of modern genetics, genomics, plant modelling and system biology, see [32]. Arabidopsis provides a very useful basis to study gene regulatory networks, and develop modelling and systems biology approaches for translational research towards agricultural applications.

In this section, we apply our R packages BlockCov and MultiVarSel [30] to metabolomic and proteomic data to better understand the impact of the temperature on the seed quality. More precisely, we use the following modeling for our observations:

(11)

where is a matrix containing the responses of the metabolites (resp. the proteins) for the samples with , (resp. ) for the metabolomic (resp. proteomic) dataset, is a design matrix of a one-way ANOVA model, such that its first (resp. second, resp. third) column is a vector which is equal to 1 if the corresponding sample grows under low (resp medium, resp. elevated) temperatures and 0 otherwise. is a coefficient matrix and is such that its rows are zero-mean i.i.d. -dimensional random vectors having a covariance matrix . We used our R package BlockCov to estimate and assuming that there exists a latent block structure in the covariance matrix of the rows of . More precisely, we assume that there exists some groups of metabolites (resp. proteins) having the same behavior since they belong to the same biological process. Then, we plugged this estimator into our R package MultiVarSel to obtain a sparse estimation of . Thanks to this estimator of , we could identify the metabolites (resp. proteins) having a higher (resp. lower) concentration when the temperature is high or low.

4.1. Results obtained for the metabolomic data

We first estimated the matrices and associated to defined in Equation (11) by using the methodology developed in this paper, namely the BlockCov package. By the results of Section 3, we know that the PA and BL approaches performed poorly when . Since here , we used the Cattell and Elbow criteria to choose and , respectively. The results are displayed in Figure 15. The Cattell criterion chooses and the Elbow criterion chooses , which implies that among the 19701 coefficients of the correlation matrix only 6696 values are considered as non null values.

Figure 15. Illustration of the Cattell and Elbow criteria.

The estimation of obtained with our methodology is displayed in Figure 16 once the rows and the columns have been permuted according to the ordering provided by the hierarchical clustering to make visible the latent block structure.

Figure 16. Estimator of the correlation matrix of the rows of once the rows and the columns have been permuted according to the ordering provided by the hierarchical clustering.

Using the estimator of provided by the BlockCov package in the R package MultiVarSel provides the sparse estimator of the matrix defined in Model 11 and displayed in Figure 17. We can see from this figure that for the metabolite X5MTP the coefficient of the matrix is positive when the temperature is high which means that the production of the metabolite X5MTP is larger in high temperature conditions than in low temperature conditions.

Figure 17. Sparse estimator of the coefficients matrix obtained thanks to the package MultiVarSel with a threshold of 0.95.

In order to go further in the biological interpretation, we wanted to better understand the underlying block structure of the estimator of the correlation matrix of the residuals based on metabolite abundances . Thus, we applied a hierarchical clustering with 8 groups to this matrix in order to split it into blocks. The corresponding dendogram is on the left part of Figure 16. The matrix containing the correlation means within and between the blocks or groups of metabolites is displayed in Figure 18. The composition of the metabolites groups is available in Appendix 6.1.

Figure 18. Means of the correlations between the groups of metabolites.

Interestingly, we could observe that X5MTP belongs to Group 6 which displays an high correlations mean equal to 0.8 between the 14 metabolites that make it up. At least 6 metabolites of this group belong to the same family, namely glucosinolates (i.e. X4MTB, 4-methylthiobutyl glucosinolate; X5MTP, 5-methylthiopentyl glucosinolate; X6MTH, 6-methylthiohexyl glucosinolate; X7MTH, 7-methylthiohexyl glucosinolate; X8MTO, 8-methylthiooctyl glucosinolate; UGlucosinolate140.1, unidentified glucosinolate). Glucosinolates (GLS) are specialized metabolites found in Brassicaceae and related families (e.g. Capparaceae), containing a -thioglucose moiety, a sulfonated oxime moiety, and a variable aglycone side chain derived from a -amino acid. These compounds contribute to the plant’s overall defense mechanism, see [39]. Methylthio-GLS are derivated from methionine. Methionine is elongated through condensation with acetyl CoA and then, are converted to aldoximes through the action of individual members of the cytochrome P450 enzymes belonging to CYP79 family, see [14]. The aldoxime undergoes condensation with a sulfur donor, and stepwise converted to GLS, followed by the side chain modification. The present results suggest that the accumulation of methionine-derived glucosinolate family is strongly coordinated in Arabidopsis seed. Moreover, we can see that they are influenced by the effect of the mother plant thermal environment.

4.2. Results obtained for the proteomic data

The same study was conducted on the proteomic data. The estimator of the correlation matrix of the residuals based on proteine abundances obtained with our methodology is displayed in Figure 19 once the rows and the columns have been permuted according to the ordering provided by the hierarchical clustering to make visible the latent block structure. To better understand the underlying block structure of , we applied a hierarchical clustering with 9 groups to this matrix in order to split it into blocks. The corresponding dendogram is on the left part of Figure 19.

Figure 19. Estimator of the correlation matrix of the residuals of the protein accumulation measures once the rows and the columns of the residual matrix have been permuted according to the ordering provided by the hierarchical clustering.

The matrix containing the correlation means within and between the blocks or groups of proteins is displayed in Figure 20. We can see from this figure that Group 8 has the highest correlation mean equal to 0.47. It consists of 34 proteins which are given in Appendix 6.2.

Figure 20. Means of the correlations between the groups of proteins.

A basic gene ontology analysis (http://geneontology.org/) showed that proteins involved in response to stress (biotic and abiotic), in nitrogen and phosphorus metabolic processes, in photosynthesis and carbohydrate metabolic process and in oxidation-reduction process are overrepresented in this group, see Figure 21. Thus, the correlation estimated within Group 8 seems to reflect a functional coherence of the proteins of this group.

Figure 21. Gene ontology (GO) term enrichment analysis of the 34 proteins belonging to Group 8. Data from PANTHER overrepresentation test (http://www.geneontology.org); One uploaded id (i.e. AT5G50370) mapped to two genes. Thus, GO term enrichment was performed on 35 elements. Blue bars: observed proteins in Group 8; Orange bars: expected result from the reference Arabidopsis genome.

The variable selection in the multivariate linear model using the R package MultiVarSel provided 31 proteins differentially accumulated in seeds produced under low, medium or elevated temperature. An aspartyl protease (AT3G54400), belongs to both, the Group 8 and to the proteins selected by MultiVarSel. This cell wall associated protein was up-acccumulated in dry seeds produced under low temperature. The gene encoding for this protease was described as a cold responsive gene assigned to the C-repeat binding factor (CBF) regulatory pathway, see [36]. This pathway is requested for regulation of dormancy induced by low temperatures, see [25]. Consistently, in Figure 22, two other proteins related to cell wall organization, a beta-glucosidase (BGLC1, AT5G20950) and a translation elongation factor (eEF-1B1, AT1G30230) were differentially accumulated in seeds produced under contrasted temperature. eEF-1B1 is associated to plant development and is involved in cell wall formation, see [21]. These results suggest that cell wall rearrangements occur under temperature effect during seed maturation.

As displayed in Figure 22, 6 other proteins involved in mRNA translation: AT1G02780, AT1G04170, AT1G18070, AT1G72370, AT2G04390 and AT3G04840 were selected. The absolute failure of seed germination in the presence of protein synthesis inhibitors underlines the essential role of translation for achieving this developmental process, see [33]. Previous studies highlighted the importance of selective and sequential mRNA translation during seed germination and seed dormancy, see [15], [2] and [1]. Thus, exploring translational regulation during seed maturation and germination through the dynamic of mRNA recruitment on polysomes or either neosynthesized proteome are emerging fields in seed research.

Figure 22. Values of the coefficients obtained using the package MultiVarSel with a threshold of 0.95 on the proteomic dataset.

5. Conclusion

In this paper, we propose a fully data-driven methodology for estimating large block structured sparse covariance matrices in the case where the number of variables is much larger than the number of samples without limiting ourselves to block diagonal matrices. Our methodology can also deal with matrices for which the block structure only appears if the columns and rows are permuted according to an unknown permutation. Our technique is implemented in the R package BlockCov which is available from the Comprehensive R Archive Network and from GitHub. In the course of this study, we have shown that BlockCov is a very efficient approach both from the statistical and numerical point of view. Moreover, its very low computational load makes its use possible even for very large covariance matrices having several thousands of rows and columns.

Acknowledgments

We thank the members of the EcoSeed European project (FP7 Environment, Grant/Award Number: 311840 EcoSeed, Coord. I. Kranner). IJPB was supported by the Saclay Plant Sciences LABEX (ANR-10-LABX-0040-SPS). We also thank the people who produced the biological material and the proteomic and metabolomic analysis. In particular, we would like to thank the Warwick University (UWAR, Finch-Savage WE and Awan S) for the production of seeds, the Plant Observatory-Biochemistry platform (IJPB, Versailles; Bailly M, Cueff G) for having prepared the samples for the proteomics and metabolomics, the PAPPSO Proteomic Plateform (GQE-Moulon; Balliau T, Zivy M) for mass spectrometry-based proteome analysis and the Plant Observatory-Chemistry/Metabolism platform (IJPB, Versailles; Clement G) for the analysis of GC/MS-based metabolome analyses.

References

  • [1] B. Bai, O. Novák, K. Ljung, J. Hanson, and L. Bentsink. Combined transcriptome and translatome analyses reveal a role for tryptophan-dependent auxin biosynthesis in the control of DOG1-dependent seed dormancy. New Phytologist, 217(3):1077–1085, 2018.
  • [2] B. Bai, A. Peviani, S. Horst, M. Gamm, L. Bentsink, and J. Hanson. Extensive translational regulation during seed germination revealed by polysomal profiling. New Phytologist, 214(1):233–244, 2017.
  • [3] O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res., 9:485–516, 2008.
  • [4] D. Bates and M. Maechler. Matrix: Sparse and Dense Matrix Classes and Methods, 2018. R package version 1.2-13.
  • [5] P. J. Bickel and E. Levina. Covariance regularization by thresholding. Ann. Statist., 36(6):2577–2604, 12 2008.
  • [6] J. Bien and R. J. Tibshirani. Sparse estimation of a covariance matrix. Biometrika, 98(4):807–820, 2011.
  • [7] T. T. Cai and M. Yuan. Adaptive covariance matrix estimation through block thresholding. Ann. Statist., 40(4):2014–2042, 08 2012.
  • [8] R. B. Cattell. The scree test for the number of factors. Multivariate behavioral research, 1(2):245–276, 1966.
  • [9] E. Devijver and M. Gallopin. Block-diagonal covariance selection for high-dimensional gaussian graphical models. Journal of the American Statistical Association, 113(521):306–314, 2018.
  • [10] E. Dobriban. Permutation methods for factor analysis and PCA. arXiv:1710.00479, 2018.
  • [11] T. C. Durand, G. Cueff, B. Godin, B. Valot, G. Clément, T. Gaude, and L. Rajjou. Combined proteomic and metabolomic profiling of the arabidopsis thaliana vps29 mutant reveals pleiotropic functions of the retromer in seed development. International journal of molecular sciences, 20(2):362, 2019.
  • [12] C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936.
  • [13] J. Fan, L. Yuan, and L. Han. An overview of the estimation of large covariance and precision matrices. The Econometrics Journal, 19(1):C1–C32, 2016.
  • [14] B. Field, G. Cardon, M. Traka, J. Botterman, G. Vancanneyt, and R. Mithen. Glucosinolate and amino acid biosynthesis in arabidopsis. Plant Physiology, 135(2):828–839, 2004.
  • [15] M. Galland, R. Huguet, E. Arc, G. Cueff, D. Job, and L. Rajjou. Dynamic proteomics emphasizes the importance of selective mrna translation and protein turnover during arabidopsis seed germination. Molecular & Cellular Proteomics, 13(1):252–268, 2014.
  • [16] C. Giraud.

    Introduction to High-Dimensional Statistics

    .

    Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis, 2014.

  • [17] D. Harville. Matrix Algebra: Exercises and Solutions: Exercises and Solutions. Springer New York, 2001.
  • [18] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
  • [19] N. J. Higham. Computing the nearest correlation matrix - a problem from finance. IMA Journal of Numerical Analysis, 22(3):329–343, 2002.
  • [20] J. L. Horn. A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2):179–185, Jun 1965.
  • [21] Z. Hossain, L. Amyot, B. McGarvey, M. Gruber, J. Jung, and A. Hannoufa. The translation elongation factor eef-1b1 is involved in cell wall biosynthesis and plant development in arabidopsis thaliana. PLoS One, 2012. e30425.
  • [22] M. J. Hosseini and S.-I. Lee. Learning sparse gaussian graphical models with overlapping blocks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3808–3816. Curran Associates, Inc., 2016.
  • [23] Z. Huang, S. Footitt, and W. E. Finch-Savage. The effect of temperature on reproduction in the summer and winter annual arabidopsis thaliana ecotypes bur and cvi. Annals of botany, 113(6):921–929, 2014.
  • [24] R. A. Johnson and D. W. Wichern, editors. Applied Multivariate Statistical Analysis. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988.
  • [25] S. L. Kendall, A. Hellwege, P. Marriot, C. Whalley, I. A. Graham, and S. Penfield. Induction of dormancy in arabidopsis summer annuals requires parallel regulation of dog1 and hormone metabolism by low temperature and cbf transcription factors. The Plant Cell, 23(7):2568–2580, 2011.
  • [26] E. Kerdaffrec and M. Nordborg. The maternal environment interacts with genetic variation in regulating seed dormancy in swedish arabidopsis thaliana. PloS one, 12(12), 2017. e0190242.
  • [27] O. Ledoit and M. Wolf. A well-conditioned estimator for large-dimensional covariance matrices.

    Journal of Multivariate Analysis

    , 88(2):365 – 411, 2004.
  • [28] D. R. MacGregor, S. L. Kendall, H. Florance, F. Fedi, K. Moore, K. Paszkiewicz, N. Smirnoff, and S. Penfield. Seed production temperature regulation of primary dormancy occurs through control of seed coat phenylpropanoid metabolism. New Phytologist, 205(2):642–652, 2015.
  • [29] M. Perrot-Dockès, C. Lévy-Leduc, L. Sansonnet, and J. Chiquet. Variable selection in multivariate linear models with high-dimensional covariance matrix estimation. Journal of Multivariate Analysis, 166:78 – 97, 2018.
  • [30] M. Perrot-Dockès, C. Lévy-Leduc, and J. Chiquet. MultiVarSel: Variable Selection in a Multivariate Linear Model, 2019. R package version 1.1.3.
  • [31] M. Perrot-Dockès, C. Lévy-Leduc, J. Chiquet, and L. Sansonnet. A variable selection approach in the multivariate linear model: an application to lc-ms metabolomics data. Statistical Applications in Genetics and Molecular Biology, 17(5), 2018.
  • [32] N. J. Provart, J. Alonso, S. M. Assmann, D. Bergmann, S. M. Brady, J. Brkljacic, …, and J. Dangl. 50 years of arabidopsis research: highlights and future directions. New Phytologist, 209(3):921–944, 2016.
  • [33] L. Rajjou, K. Gallardo, I. Debeaujon, J. Vandekerckhove, C. Job, and D. Job. The effect of -amanitin on the arabidopsis seed proteome highlights the distinct roles of stored and neosynthesized mrnas during germination. Plant physiology, 134(4):1598–1613, 2004.
  • [34] A. J. Rothman. Positive definite estimators of large covariance matrices. Biometrika, 99(3):733–740, 2012.
  • [35] R. Tibshirani. Regression shrinkage and selection via the Lasso. J. Royal. Statist. Soc B., 58(1):267–288, 1996.
  • [36] J. T. Vogel, D. Cook, S. G. Fowler, and M. F. Thomashow. The cbf cold response pathways of arabidopsis and tomato. Cold Hardiness in Plants: Molecular Genetics, Cell Biology and Physiology, pages 11–29, 2006.
  • [37] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, Dec 2007.
  • [38] F. Wen, Y. Yang, P. Liu, and R. C. Qiu. Positive definite estimation of large covariance matrix using generalized nonconvex penalties. IEEE Access, 4:4168–4182, 2016.
  • [39] U. Wittstock and B. A. Halkier. Glucosinolate research in the arabidopsis era. Trends in plant science, 7(6):263–270, 2002.

6. Appendix

6.1. Groups of metabolites

Group 1 Group 2 Group 3 Group 4
Alanine Arginine Glutamate beta.Sitosterol
Asparagine Cystein alpha.Tocopherol Campesterol
Aspartate Gaba gamma.Tocopherol Eicosanoate
Glycine Glutamine Linolenic.acid Heptadecanoate
Isoleucine Tryptophan H2SO4 Stearic.acid
Leucine Linoleic.acid X2.Oxoglutarate Tetracosanoate
Lysine Quercetin Mannitol BenzoylX.Glucosinolate.3
Phenylalanine BenzoylGlucosinolate.3Breakdown Urea Sulfite
Proline Nonanenitrile.9.methylthio Fructose.6.P U2609.4.361
Serine UGlucosinolatebreakdown140.5 Digalactosylglycerol U3122.4.202.I3M.
Threonine X2.Hydroxyglutarate Galactinol dihydroxybenzoate
Tyrosine Citrate Galactosylglycerol beta.indole.3.acetonitrile
Valine Erythronate Rhamnose U1837.6.368
X5..methylthio.pentanenitrile Galactonate Stachyose U1841.9.394
Octanenitrile.8.methylthio Gluconate Sucrose U2003.8.293
UGlucosinolatebreakdown140.4 Glycerate U1093.6.147 U2371.1.361
Succinate Malate U1124.3.140 U2375.6.191
Threonate Allantoin U1530.2.314 U2513.2.296
Arabitol Erythritol U2053.6.321.1 U2513.2.296.1
myo.Inositol Ethanolamine U2109.3.305 U2692.9.361
Glycerol.3.P Sorbitol U2197.2.494 U2798.377
myo.Inositol.1.P Threitol U2315.2.245 U2942.2.556
Phosphate Xylitol U3898.1.204 U3063.0.361
U2206.2.299 Ethylphosphate U3415.9.498
Fructose Glucose
Glucopyranose..H2O. Mannose
U1154.3.156 Raffinose
U1393.172 Ribose
U1541.8.263 U1127.5.140
U1647.2.403 U1172.9.281
U1705.2.319.pentitol. U1559.4.217
U1729.0.273 U1628.9.233
U1816.2.228 U1849.2.285
U1859.2.246 U1927.0.204
U2076.9.204 U1939.1.210
U2170.6.361 U1983.0.217
U2184.1.299 U1983.0.217.1
U2251.5.361 U2012.7.361
U2278.6.361 U2282.4.349
U2550.7.149 U2400.1.179
U2731.2.160 U2779.9.361
U2857.8.342
U2929.1.297
U3041.1.361
U3080.7.361
U3100.8.361
Group 5 Group 6 Group 7 Group 8
Quercitrin X4MTB BenzoylGlucosinolate.2Breakdown Maleate
Dehydroascorbate X5MTP Hexanenitrile.6methylthio Pentonate.4
Fumarate X6MTH Sinapinate.trans U1408.4.298
Sinapinate.cis X7MTH Anhydroglucose U1617.8.146
Arabinose X8MTO U1125.1.140 U1767.3.243
Galactose UGlucosinolate140.1 U1290.198 U1904.9.204
U1127.4.169 U1129.9.184 U1371.5.151 U2828.8.361
U1718.0.157 U1270.1.240 U1549.7.130 U2839.3.312
U1931.5.202 U1897.2.327 U1568.5.313 U2882.5.297
U2261.0.218 U2473.361 U1592.8.217 U3008.3.457
U2412.1.157 U2529.8.361 U1700.6.288 U3168.2.290
U2588.9.535 U2756.4.271 U1759.4.331 U3218.5.297
U2688.5.333 U2924.3.361 U1852.0.217 U3910.6.597.Trigalactosylglycerol.
U3213.1.400 U3279.7.361 U1872.1.204.methyl.hexopyranoside. U2443.7.217
U1380.5.184 U1958.217
U2053.6.321
U2087.6.321
U2150.9.279
U2271.6.249
U3188.1.361
U3701.368
U4132.5.575

6.2. Groups of proteins

Proteins Group
AT1G14170.1 8
AT1G20260.1 8
AT1G42970.1 8
AT1G47980.1 8
AT1G55210.1 8
AT1G75280.1 8
AT2G19900.1 8
AT2G22240.1 8
AT2G28900.1 8
AT2G32920.1 8
AT2G37970.1 8
AT3G12580.1 8
AT3G13930.1 8
AT3G26650.1 8
AT3G26720.1 8
AT3G44300.1 8
AT3G47930.1 8
AT3G54400.1 8
AT3G55800.1 8
AT4G16760.1 8
AT4G20830.1 8
AT4G25740.1 8
AT4G34870.1 8
AT4G35790.1 8
AT5G11880.1 8
AT5G12040.1 8
AT5G14030.1 8
AT5G17380.1 8
AT5G22810.1 8
AT5G26000.1 8
AT5G50370.1 8
AT5G66190.1 8
AT5G67360.1 8
ATCG00480.1 8