1. Introduction
Plant functional genomics refers to the description of the biological function of a single or a group of genes and both the dynamics and the plasticity of genome expression to shape the phenotype. Combining multiomics such as transcriptomic, proteomic or metabolomic approaches allows us to address in a new light the dimension and the complexity of the different levels of gene expression control and the delicacy of the metabolic regulation of plants under fluctuation environments. Thus, our era marks a real conceptual shift in plant biology where the individual is no longer considered as a simple sum of components but rather as a system with a set of interacting components to maximize its growth, its reproduction and its adaptation. Plant systems biology is therefore defined by multidisciplinary and multiscale approaches based on the acquisition of a wide range of data as exhaustive as possible.
In this context, it is crucial to propose new methodologies for integrating heterogeneous data explaining the coregulations/coaccumulations of products of gene expression (mRNA, proteins) and metabolites. In order to better understand these phenomena, our goal will thus be to propose a new approach for estimating block structured covariance matrix in a highdimensional framework where the dimension of the covariance matrix is much larger than the sample size. In this setting, it is well known that the commonly used sample covariance matrix performs poorly. In recent years, researchers have proposed various regularization techniques to consistently estimate large covariance matrices or the inverse of such matrices, namely precision matrices. To estimate such matrices, one of the key assumptions made in the literature is that the matrix of interest is sparse, namely many entries are equal to zero. A number of regularization approaches including banding, tapering, thresholding and minimization, have been developed to estimate large covariance matrices or their inverse such as, for instance, [27], [5], [3], [6] and [34] among many others. For further references, we refer the reader to [7] and to the review of [13].
In this paper, we shall consider the following framework. Let , zeromean i.i.d.
dimensional random vectors having a covariance matrix
such that the number of its rows and columns is much larger than . The goal of the paper is to propose a new estimator of and of the square root of its inverse, , in the particular case where is assumed to have a block structure without limiting ourselves to diagonal blocks. An accurate estimator of can indeed be very useful to better understand the links between the columns of the observation matrix and may highlight some biological processes. Moreover, an estimator of can be very useful in the general linear model in order to remove the dependence that may exist between the columns of the observation matrix. For further details on this point, we refer the reader to [29], [31] and to the R package MultiVarSel in which such an approach is proposed and implemented for performing variable selection in the multivariate linear model in the presence of dependence between the columns of the observation matrix.More precisely, in this paper, we shall assume that
(1) 
where is a sparse matrix with , denotes the transpose of the matrix and is a diagonal matrix such that the diagonal terms of are equal to one. Two examples of such matrices and are given in Figure 1 in the case where and and in the case where the columns of do not need to be permuted in order to see the block structure. Based on (1), our model could seem to be close to factor models described in [24] and [13]. However, in [24], the highdimensional aspects are not considered and in [13] the sparsity constraint is not studied. Note also that the block diagonal assumption has already been recently considered by [9] for estimating the inverse of large covariance matrices in highdimensional Gaussian Graphical Models (GGM).
We also propose a methodology to estimate in the case where the block structure is latent; that is, permuting the columns and rows of renders visible its block structure. An example of such a matrix is given in Figure 2 in the case where and .
Our approach is fully datadriven and consists in providing a low rank matrix approximation of the part of and then in using a regularization to obtain a sparse estimator of
. When the block structure is latent, a hierarchical clustering step must be applied first. With this estimator of
, we explain how to obtain an estimator of .2. Statistical inference
The strategy that we propose for estimating and can be summarized as follows.

First step: Low rank approximation. In this step, we propose to approximate the part of
by a low rank matrix using a Singular Value Decomposition (SVD).

Second step: Detecting the position of the non null values. In this step, we use a Lasso criterion to yield a sparse estimator of .

Third step: Positive definiteness. We apply the methodology of [19] to to ensure that the final estimator of is positive definite.

Fourth step: Estimation of . In this step, is estimated from the spectral decomposition of obtained in the previous step.
2.1. Low rank approximation
By definition of in (1), is a low rank matrix having its rank smaller or equal to . In the first step, our goal is thus to propose a low rank approximation of an estimator of .
Let be the sample covariance matrix defined by
where . The corresponding sample correlation matrix is defined by:
(2) 
where
Let us also consider the matrix defined by:
(3)  
If was the real matrix , the corresponding matrix would have a rank less than or equal to . Since is an estimator of , we shall use a rank approximation of . This will be performed by considering in its singular value decomposition only the largest singular values and by replacing the other ones by 0. By [12], this corresponds to the best rank approximation of . The choice of will be discussed in Section 2.5.
2.2. Detecting the position of the non null values
Let us first explain the usual framework in which the Lasso approach is used. We consider a linear model of the following form
(4) 
where , and are vectors and is sparse meaning that it has a lot of null components.
In such models a very popular approach initially proposed by [35] is the Least Absolute Shrinkage eStimatOr (Lasso), which is defined as follows for a positive :
(5) 
where, for , and , i.e. the norm of the vector . Observe that the first term of (5) is the classical leastsquares criterion and that can be seen as a penalty term. The interest of such a criterion is the sparsity enforcing property of the norm ensuring that the number of nonzero components of the estimator of is small for large enough values of . Let
(6) 
where defined in Section 16.4 of [17] is such that for a matrix ,
where is the subvector of the column of obtained by striking out the first elements. In order to estimate the sparse matrix , we need to propose a sparse estimator of . To do this we apply the Lasso criterion described in (5), where
is the identity matrix. In the case where
is an orthogonal matrix it has been shown in
[16] that the solution of (5) is:where denotes the th column of . Using the fact that is the identity matrix we get
(7) 
We then reestimate the non null coefficients using the leastsquares criterion and get:
(8) 
where is defined in (6).
It has to be noticed that obtained in (7) satisfies the following criterion:
where denotes the Frobenius norm defined for a matrix by , denotes the norm of the vector formed by stacking the columns of . It is thus closely related to the generalized thresholding estimator defined in [38] and to the one defined in [34] with except that in our case is replaced by where corresponds to the matrix in which the diagonal terms are replaced by 0. The diagonal terms of were indeed already removed in . Hence, we get by elementwise softthresholding that is by putting to zero the value of that are under a given threshold and by multiplying the non null values by a coefficient containing this threshold.
Here, we choose to estimate by defined through in (8) which corresponds to a hardthresholding and we set the upper triangular part of the estimator of to be equal to . Since the diagonal terms of are assumed to be equal to 1, we take the diagonal terms of equal to 1. The lower triangular part of is then obtained by symmetry.
The choice of the best parameter denoted in the following will be discussed in Section 3.2.
2.3. Positive definiteness
2.4. Estimation of
Even if providing an estimator of a large covariance matrix can be very useful in practice, it may also be interesting to efficiently estimate . Such an estimator can indeed be used in the general linear model in order to remove the dependence that may exist between the columns of the observations matrix. For further details on this point, we refer the reader to [29], [31] and to the R package MultiVarSel in which such an approach is proposed and implemented for performing variable selection in the multivariate linear model in the presence of dependence between the columns of the observation matrix.
Since is a symmetric matrix, it can be rewritten as , where is a diagonal matrix and is an orthogonal matrix. The matrix can thus be estimated by where is a diagonal matrix having its diagonal terms equal to the square root of the inverse of the singular values of
. However, inverting the square root of too small eigenvalues may lead to poor estimators of
. This is the reason why we propose to estimate by(9) 
where is a diagonal matrix such that its diagonal entries are equal to the square root of the inverse of the diagonal entries of except for those which are smaller than a given threshold which are replaced by 0 in . The choice of will be further discussed in Section 3.6.
2.5. Choice of the parameters
Our methodology for estimating depends on two parameters: The number of singular values kept for defining and the parameter which controls the sparsity level namely the number of zero values in defined in (8).
For choosing , we shall compare two strategies in Section 3.1:

The Cattell criterion based on the Cattell’s scree plot described in [8] and
To choose the parameter in (8), we shall compare two strategies in Section 3.2:

The BL approach proposed in [5] based on crossvalidation and

the Elbow method which consists in computing for different values of the Frobenius norm , where and are defined in (2) and at the end of Section 2.2
, respectively. Then, it fits two simple linear regressions and chooses the value of
achieving the best fit.
3. Numerical experiments
Our methodology described in the previous section is implemented in the R package BlockCov and is available from the CRAN (Comprehensive R Achive Network) and from GitHub.
We propose hereafter to investigate the performance of our approach for different types of matrices defined in (1) and for different values of and . The four following cases considered correspond to different types of matrices , the matrices being chosen accordingly to ensure that the matrix has its diagonal terms equal to 1.

DiagonalEqual case. In this situation, has the structure displayed in the left part of Figure 1, namely it has 5 columns such that the numbers of the non values in the five columns are equal to , , , and , respectively and the non null values are equal to , , , and , respectively.

DiagonalUnequal case. In this scenario, has the same structure as for the DiagonalEqual case except that the non null values in the five columns are not fixed but randomly chosen in except for the third column for which its values are randomly chosen in .

ExtraDiagonalEqual case. Here, has the structure displayed in the right part of Figure 1. The values of the columns of are the same as those of the DiagonalEqual case except for the fourth column which is assumed to contain additional non values equal to 0.5 in the range .

ExtraDiagonalUnequal case. has the same structure as in the ExtraDiagonalEqual case except that the values are randomly chosen as in the DiagonalUnequal case except for the fourth column where the additional non values are still equal to 0.5 in the range .
For and , 100 matrices were generated such that its rows are i.i.d. dimensional zeromean Gaussian vectors having a covariance matrix chosen according to the four previous cases: DiagonalEqual, DiagonalUnequal, ExtraDiagonalEqual or ExtraDiagonalUnequal.
3.1. Low rank approximation
The approaches for choosing described in Section 2.5 are illustrated in Figure 3 in the ExtraDiagonalUnequal case. We can see from this figure that both methodologies find the right value of which is here equal to 5.
To go further, we investigate the behavior of our methodologies from 100 replications of the matrix for the four different types of . Figure 4 displays the barplots associated to the estimation of made in the different replications by the two approaches for the different scenarii. We can see from this figure that the PA criterion seems to be slightly more stable than the Cattell criterion when . However, in the case where , the PA criterion underestimates the value of . Moreover, in terms of computational time, the performance of Cattell is much better, see Figure 5.
3.2. Positions of the non null values
For the four scenarios, the performance of the two approaches: BL and Elbow described in Section 2.5 for choosing and hence the number of non null values in is illustrated in Figure 6. This figure displays the True Positive Rate (TPR) and the False Positive Rate (FPR) of the methodologies from 100 replications of the matrix for the four different types of and for different values of and .
We can see from this figure that the performance of Elbow is on a par with the one of BL except for the case where for which the performance of Elbow is slightly better in terms of True Positive Rate. Moreover, in terms of computational time, the performance of Elbow is much better, see Figure 7.
3.3. Comparison with other methodologies
The goal of this section is to compare the statistical performance of our approach with other methodologies.
Since our goal is to estimate a covariance matrix containing blocks, we shall compare our approach with clustering techniques. Once the groups or blocks have be obtained, is estimated by assuming that the corresponding matrix estimator is blockwise constant except for the diagonal blocks for which the diagonal entries are equal to 1 and the extradiagonal terms are assumed to be equal. This gives a great advantage to these methodologies in the DiagonalEqual and in the ExtraDiagonalEqual scenarii. More precisely, let denote the value of the entries in the block having its rows corresponding to Group (or Cluster) and its columns to Group (or Cluster) . Then, for a given clustering :
(10) 
where denotes the cluster , denotes the number of elements in the cluster and is the entry of the matrix defined in Equation (2).
For the matrices corresponding to the four scenarios previously described, we shall compare the statistical performance of the following methods:

empirical which estimates by defined in (2),

blocks which estimates using the methodology described in this article with the criteria PA and BL for choosing and , respectively,

blocks_fast which estimates using the methodology described in this article with the criteria Cattell and Elbow for choosing and , respectively,

blocks_real which estimates using the methodology described in this article when and the number of non null values are assumed to be known which gives access to the best value of ,

Specc which estimates
by determining clusters using spectral clustering described in
[37] and estimates with Equation (10),
In order to improve the performance of the clustering approaches: hclust, Specc and kmeans, the real number of clusters has been provided to these methods. The performance of the different approaches is assessed using the Frobenius norm of the difference between and its estimator.
Figure 8
displays the mean and standard deviations of the Frobenius norm of the difference between
and its estimator for different values of and in the four different cases: DiagonalEqual, DiagonalUnequal, ExtraDiagonalEqual and ExtraDiagonalUnequal. We can see from this figure that in the case where , the performance of blocks_fast is on a par with the one of blocks_real and is better than the one of blocks. In the case where , the performance of blocks is slightly better than the one of blocks_fast and is similar to the one of blocks_real. Moreover, in all cases, either blocks_fast or blocks outperforms the other approaches.Then, the estimators of derived from blocks, blocks_fast and blocks_real were compared to the PDSCE estimator proposed by [34] and implemented in the R package PDSCE. Since the computational burden of PDSCE is high for large values of , we limit ourselves to the ExtraDiagonalEqual case when and for the comparison. Figure 9 displays the results. We can see from this figure that blocks, blocks_fast and blocks_real provide better results than PDSCE. However, it has to be noticed that the latter approach is not designed for dealing with block structured covariance matrices but just for providing sparse estimators of large covariance matrices.
3.4. Columns permutation
In practice, it may occur that the columns of consisting of the rows are not ordered in a way which makes blocks appear in the matrix . To address this issue, we propose to perform a hierarchical clustering on beforehand and use the obtained permutation of the observations which guarantees that a cluster plot using this ordering will not have crossings of the branches. Let us denote the matrix in which the columns have been permuted according to this ordering and the covariance matrix of each row of . Then, we apply our methodology to which should provide an efficient estimator of . In order to get an estimator of the columns and rows are permuted according to the ordering coming from the hierarchical clustering.
To assess the corresponding loss of performance, we generated for each matrix used for making Figure 8 a matrix in which the columns of were randomly permuted. The associated covariance matrix is denoted . Then, we applied the methodology described in the previous paragraph denoted blocks_samp and blocks_fast_samp in Figure 10 thus providing . The performance of this new methodology was compared to the methodology that we proposed in the previous sections (denoted blocks and blocks_fast in Figure 10) when the columns of were not permuted. The results are displayed in Figure 10. We can see from this figure that the performance of our approach does not seem to be altered by the permutation of the columns.
3.5. Numerical performance
Figure 11 displays the computational times for estimating with the methods blocks and blocks_fast for different values of ranging from 100 to 3000 and . The timings were obtained on a workstation with 16 GB of RAM and Intel Core i7 (3.66GHz) CPU. Our methodology is implemented in the R package BlockCov which uses the R language (R Core Team, 2017) and relies on the R package Matrix. We can see from this figure that it takes around 3 minutes to estimate a correlation matrix.
3.6. Choice of the threshold for estimating
Since we are interested in assessing the ability of defined in (9) to remove the dependence that may exist between the columns of , we shall consider the Frobenius norm of which should be close to zero, where denotes the identity matrix of . Figure 12 displays the Frobenius norm of for different threshold . A threshold of 0.1 seems to provide a small error in terms of Frobenius norm. Hence, in the following, will be equal to 0.1 and will be referred as .
This technique was applied to all of the estimators of discussed in Section 3.3 to get different estimators of . The Frobenius norm of the error is used to compare the different estimators obtained by considering the different estimators of . The results are displayed in Figure 13. We observe from this figure that in the case where the estimators of derived from the empirical, the blocks_fast and the blocks_real estimators of perform similarly and seem to be more adapted than the others to remove the dependence among the columns of . However, when , the behavior is completely different. Firstly, in the DiagonalEqual case, the estimator of derived from the hclust estimator of seems to perform better than the others. Secondly, in the DiagonalUnequal case, the estimator derived from blocks, blocks_fast and blocks_real perform similarly than the one obtained from hclust. Thirdly, in the ExtraDiagonal case, the estimators derived from blocks, blocks_fast and blocks_real methodology perform better than the other estimators.
Then, the estimators of derived from blocks, blocks_fast and blocks_real were compared to the GRAB estimator proposed by [22]. Since the computational burden of GRAB is high for large values of , we limit ourselves to the ExtraDiagonalEqual case when and for the comparison. Figure 14 displays the results. We can see that blocks and blocks_real provide better results than GRAB. However, it has to be noticed that the latter approach depends on a lot of parameters that were difficult to choose, thus we used the default ones.
4. Application to “multiomic” approaches to study seed quality
Climate change could lead to major crop failures in world. In the present study, we addressed the impact of mother plant environment on seed composition. Indeed, seed quality is of paramount ecological and agronomical importance. They are the most efficient form of dispersal of flowering plants in the environment. Seeds are remarkably adapted to harsh environmental conditions as long as they are in a quiescent state. Dry mature seeds (so called “orthodox seeds”) are an appropriate resource for preservation of plant genetic diversity in seedbanks. It has been reported that the temperature regime during seed production affects agronomical traits such as seed germination potential, see [23],[28] and [26]. In order to highlight biomarkers of seed quality according to thermal environment of the mother plant, Arabidopsis seeds were produces under three temperature regimes (1416 C, 1822 C or 2528 C under a longday photoperiod). Dry mature seeds were analysed by shotgun proteomic and GC/MSbased metabolomics [11]. The choice to use the model plant, Arabidopsis, was motivated by the colossal effort of the international scientific community for its genome annotation. This plant remains at the forefront of modern genetics, genomics, plant modelling and system biology, see [32]. Arabidopsis provides a very useful basis to study gene regulatory networks, and develop modelling and systems biology approaches for translational research towards agricultural applications.
In this section, we apply our R packages BlockCov and MultiVarSel [30] to metabolomic and proteomic data to better understand the impact of the temperature on the seed quality. More precisely, we use the following modeling for our observations:
(11) 
where is a matrix containing the responses of the metabolites (resp. the proteins) for the samples with , (resp. ) for the metabolomic (resp. proteomic) dataset, is a design matrix of a oneway ANOVA model, such that its first (resp. second, resp. third) column is a vector which is equal to 1 if the corresponding sample grows under low (resp medium, resp. elevated) temperatures and 0 otherwise. is a coefficient matrix and is such that its rows are zeromean i.i.d. dimensional random vectors having a covariance matrix . We used our R package BlockCov to estimate and assuming that there exists a latent block structure in the covariance matrix of the rows of . More precisely, we assume that there exists some groups of metabolites (resp. proteins) having the same behavior since they belong to the same biological process. Then, we plugged this estimator into our R package MultiVarSel to obtain a sparse estimation of . Thanks to this estimator of , we could identify the metabolites (resp. proteins) having a higher (resp. lower) concentration when the temperature is high or low.
4.1. Results obtained for the metabolomic data
We first estimated the matrices and associated to defined in Equation (11) by using the methodology developed in this paper, namely the BlockCov package. By the results of Section 3, we know that the PA and BL approaches performed poorly when . Since here , we used the Cattell and Elbow criteria to choose and , respectively. The results are displayed in Figure 15. The Cattell criterion chooses and the Elbow criterion chooses , which implies that among the 19701 coefficients of the correlation matrix only 6696 values are considered as non null values.
The estimation of obtained with our methodology is displayed in Figure 16 once the rows and the columns have been permuted according to the ordering provided by the hierarchical clustering to make visible the latent block structure.
Using the estimator of provided by the BlockCov package in the R package MultiVarSel provides the sparse estimator of the matrix defined in Model 11 and displayed in Figure 17. We can see from this figure that for the metabolite X5MTP the coefficient of the matrix is positive when the temperature is high which means that the production of the metabolite X5MTP is larger in high temperature conditions than in low temperature conditions.
In order to go further in the biological interpretation, we wanted to better understand the underlying block structure of the estimator of the correlation matrix of the residuals based on metabolite abundances . Thus, we applied a hierarchical clustering with 8 groups to this matrix in order to split it into blocks. The corresponding dendogram is on the left part of Figure 16. The matrix containing the correlation means within and between the blocks or groups of metabolites is displayed in Figure 18. The composition of the metabolites groups is available in Appendix 6.1.
Interestingly, we could observe that X5MTP belongs to Group 6 which displays an high correlations mean equal to 0.8 between the 14 metabolites that make it up. At least 6 metabolites of this group belong to the same family, namely glucosinolates (i.e. X4MTB, 4methylthiobutyl glucosinolate; X5MTP, 5methylthiopentyl glucosinolate; X6MTH, 6methylthiohexyl glucosinolate; X7MTH, 7methylthiohexyl glucosinolate; X8MTO, 8methylthiooctyl glucosinolate; UGlucosinolate140.1, unidentified glucosinolate). Glucosinolates (GLS) are specialized metabolites found in Brassicaceae and related families (e.g. Capparaceae), containing a thioglucose moiety, a sulfonated oxime moiety, and a variable aglycone side chain derived from a amino acid. These compounds contribute to the plant’s overall defense mechanism, see [39]. MethylthioGLS are derivated from methionine. Methionine is elongated through condensation with acetyl CoA and then, are converted to aldoximes through the action of individual members of the cytochrome P450 enzymes belonging to CYP79 family, see [14]. The aldoxime undergoes condensation with a sulfur donor, and stepwise converted to GLS, followed by the side chain modification. The present results suggest that the accumulation of methioninederived glucosinolate family is strongly coordinated in Arabidopsis seed. Moreover, we can see that they are influenced by the effect of the mother plant thermal environment.
4.2. Results obtained for the proteomic data
The same study was conducted on the proteomic data. The estimator of the correlation matrix of the residuals based on proteine abundances obtained with our methodology is displayed in Figure 19 once the rows and the columns have been permuted according to the ordering provided by the hierarchical clustering to make visible the latent block structure. To better understand the underlying block structure of , we applied a hierarchical clustering with 9 groups to this matrix in order to split it into blocks. The corresponding dendogram is on the left part of Figure 19.
The matrix containing the correlation means within and between the blocks or groups of proteins is displayed in Figure 20. We can see from this figure that Group 8 has the highest correlation mean equal to 0.47. It consists of 34 proteins which are given in Appendix 6.2.
A basic gene ontology analysis (http://geneontology.org/) showed that proteins involved in response to stress (biotic and abiotic), in nitrogen and phosphorus metabolic processes, in photosynthesis and carbohydrate metabolic process and in oxidationreduction process are overrepresented in this group, see Figure 21. Thus, the correlation estimated within Group 8 seems to reflect a functional coherence of the proteins of this group.
The variable selection in the multivariate linear model using the R package MultiVarSel provided 31 proteins differentially accumulated in seeds produced under low, medium or elevated temperature. An aspartyl protease (AT3G54400), belongs to both, the Group 8 and to the proteins selected by MultiVarSel. This cell wall associated protein was upacccumulated in dry seeds produced under low temperature. The gene encoding for this protease was described as a cold responsive gene assigned to the Crepeat binding factor (CBF) regulatory pathway, see [36]. This pathway is requested for regulation of dormancy induced by low temperatures, see [25]. Consistently, in Figure 22, two other proteins related to cell wall organization, a betaglucosidase (BGLC1, AT5G20950) and a translation elongation factor (eEF1B1, AT1G30230) were differentially accumulated in seeds produced under contrasted temperature. eEF1B1 is associated to plant development and is involved in cell wall formation, see [21]. These results suggest that cell wall rearrangements occur under temperature effect during seed maturation.
As displayed in Figure 22, 6 other proteins involved in mRNA translation: AT1G02780, AT1G04170, AT1G18070, AT1G72370, AT2G04390 and AT3G04840 were selected. The absolute failure of seed germination in the presence of protein synthesis inhibitors underlines the essential role of translation for achieving this developmental process, see [33]. Previous studies highlighted the importance of selective and sequential mRNA translation during seed germination and seed dormancy, see [15], [2] and [1]. Thus, exploring translational regulation during seed maturation and germination through the dynamic of mRNA recruitment on polysomes or either neosynthesized proteome are emerging fields in seed research.
5. Conclusion
In this paper, we propose a fully datadriven methodology for estimating large block structured sparse covariance matrices in the case where the number of variables is much larger than the number of samples without limiting ourselves to block diagonal matrices. Our methodology can also deal with matrices for which the block structure only appears if the columns and rows are permuted according to an unknown permutation. Our technique is implemented in the R package BlockCov which is available from the Comprehensive R Archive Network and from GitHub. In the course of this study, we have shown that BlockCov is a very efficient approach both from the statistical and numerical point of view. Moreover, its very low computational load makes its use possible even for very large covariance matrices having several thousands of rows and columns.
Acknowledgments
We thank the members of the EcoSeed European project (FP7 Environment, Grant/Award Number: 311840 EcoSeed, Coord. I. Kranner). IJPB was supported by the Saclay Plant Sciences LABEX (ANR10LABX0040SPS). We also thank the people who produced the biological material and the proteomic and metabolomic analysis. In particular, we would like to thank the Warwick University (UWAR, FinchSavage WE and Awan S) for the production of seeds, the Plant ObservatoryBiochemistry platform (IJPB, Versailles; Bailly M, Cueff G) for having prepared the samples for the proteomics and metabolomics, the PAPPSO Proteomic Plateform (GQEMoulon; Balliau T, Zivy M) for mass spectrometrybased proteome analysis and the Plant ObservatoryChemistry/Metabolism platform (IJPB, Versailles; Clement G) for the analysis of GC/MSbased metabolome analyses.
References
 [1] B. Bai, O. Novák, K. Ljung, J. Hanson, and L. Bentsink. Combined transcriptome and translatome analyses reveal a role for tryptophandependent auxin biosynthesis in the control of DOG1dependent seed dormancy. New Phytologist, 217(3):1077–1085, 2018.
 [2] B. Bai, A. Peviani, S. Horst, M. Gamm, L. Bentsink, and J. Hanson. Extensive translational regulation during seed germination revealed by polysomal profiling. New Phytologist, 214(1):233–244, 2017.
 [3] O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res., 9:485–516, 2008.
 [4] D. Bates and M. Maechler. Matrix: Sparse and Dense Matrix Classes and Methods, 2018. R package version 1.213.
 [5] P. J. Bickel and E. Levina. Covariance regularization by thresholding. Ann. Statist., 36(6):2577–2604, 12 2008.
 [6] J. Bien and R. J. Tibshirani. Sparse estimation of a covariance matrix. Biometrika, 98(4):807–820, 2011.
 [7] T. T. Cai and M. Yuan. Adaptive covariance matrix estimation through block thresholding. Ann. Statist., 40(4):2014–2042, 08 2012.
 [8] R. B. Cattell. The scree test for the number of factors. Multivariate behavioral research, 1(2):245–276, 1966.
 [9] E. Devijver and M. Gallopin. Blockdiagonal covariance selection for highdimensional gaussian graphical models. Journal of the American Statistical Association, 113(521):306–314, 2018.
 [10] E. Dobriban. Permutation methods for factor analysis and PCA. arXiv:1710.00479, 2018.
 [11] T. C. Durand, G. Cueff, B. Godin, B. Valot, G. Clément, T. Gaude, and L. Rajjou. Combined proteomic and metabolomic profiling of the arabidopsis thaliana vps29 mutant reveals pleiotropic functions of the retromer in seed development. International journal of molecular sciences, 20(2):362, 2019.
 [12] C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936.
 [13] J. Fan, L. Yuan, and L. Han. An overview of the estimation of large covariance and precision matrices. The Econometrics Journal, 19(1):C1–C32, 2016.
 [14] B. Field, G. Cardon, M. Traka, J. Botterman, G. Vancanneyt, and R. Mithen. Glucosinolate and amino acid biosynthesis in arabidopsis. Plant Physiology, 135(2):828–839, 2004.
 [15] M. Galland, R. Huguet, E. Arc, G. Cueff, D. Job, and L. Rajjou. Dynamic proteomics emphasizes the importance of selective mrna translation and protein turnover during arabidopsis seed germination. Molecular & Cellular Proteomics, 13(1):252–268, 2014.

[16]
C. Giraud.
Introduction to HighDimensional Statistics
.Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis, 2014.
 [17] D. Harville. Matrix Algebra: Exercises and Solutions: Exercises and Solutions. Springer New York, 2001.
 [18] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
 [19] N. J. Higham. Computing the nearest correlation matrix  a problem from finance. IMA Journal of Numerical Analysis, 22(3):329–343, 2002.
 [20] J. L. Horn. A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2):179–185, Jun 1965.
 [21] Z. Hossain, L. Amyot, B. McGarvey, M. Gruber, J. Jung, and A. Hannoufa. The translation elongation factor eef1b1 is involved in cell wall biosynthesis and plant development in arabidopsis thaliana. PLoS One, 2012. e30425.
 [22] M. J. Hosseini and S.I. Lee. Learning sparse gaussian graphical models with overlapping blocks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3808–3816. Curran Associates, Inc., 2016.
 [23] Z. Huang, S. Footitt, and W. E. FinchSavage. The effect of temperature on reproduction in the summer and winter annual arabidopsis thaliana ecotypes bur and cvi. Annals of botany, 113(6):921–929, 2014.
 [24] R. A. Johnson and D. W. Wichern, editors. Applied Multivariate Statistical Analysis. PrenticeHall, Inc., Upper Saddle River, NJ, USA, 1988.
 [25] S. L. Kendall, A. Hellwege, P. Marriot, C. Whalley, I. A. Graham, and S. Penfield. Induction of dormancy in arabidopsis summer annuals requires parallel regulation of dog1 and hormone metabolism by low temperature and cbf transcription factors. The Plant Cell, 23(7):2568–2580, 2011.
 [26] E. Kerdaffrec and M. Nordborg. The maternal environment interacts with genetic variation in regulating seed dormancy in swedish arabidopsis thaliana. PloS one, 12(12), 2017. e0190242.

[27]
O. Ledoit and M. Wolf.
A wellconditioned estimator for largedimensional covariance
matrices.
Journal of Multivariate Analysis
, 88(2):365 – 411, 2004.  [28] D. R. MacGregor, S. L. Kendall, H. Florance, F. Fedi, K. Moore, K. Paszkiewicz, N. Smirnoff, and S. Penfield. Seed production temperature regulation of primary dormancy occurs through control of seed coat phenylpropanoid metabolism. New Phytologist, 205(2):642–652, 2015.
 [29] M. PerrotDockès, C. LévyLeduc, L. Sansonnet, and J. Chiquet. Variable selection in multivariate linear models with highdimensional covariance matrix estimation. Journal of Multivariate Analysis, 166:78 – 97, 2018.
 [30] M. PerrotDockès, C. LévyLeduc, and J. Chiquet. MultiVarSel: Variable Selection in a Multivariate Linear Model, 2019. R package version 1.1.3.
 [31] M. PerrotDockès, C. LévyLeduc, J. Chiquet, and L. Sansonnet. A variable selection approach in the multivariate linear model: an application to lcms metabolomics data. Statistical Applications in Genetics and Molecular Biology, 17(5), 2018.
 [32] N. J. Provart, J. Alonso, S. M. Assmann, D. Bergmann, S. M. Brady, J. Brkljacic, …, and J. Dangl. 50 years of arabidopsis research: highlights and future directions. New Phytologist, 209(3):921–944, 2016.
 [33] L. Rajjou, K. Gallardo, I. Debeaujon, J. Vandekerckhove, C. Job, and D. Job. The effect of amanitin on the arabidopsis seed proteome highlights the distinct roles of stored and neosynthesized mrnas during germination. Plant physiology, 134(4):1598–1613, 2004.
 [34] A. J. Rothman. Positive definite estimators of large covariance matrices. Biometrika, 99(3):733–740, 2012.
 [35] R. Tibshirani. Regression shrinkage and selection via the Lasso. J. Royal. Statist. Soc B., 58(1):267–288, 1996.
 [36] J. T. Vogel, D. Cook, S. G. Fowler, and M. F. Thomashow. The cbf cold response pathways of arabidopsis and tomato. Cold Hardiness in Plants: Molecular Genetics, Cell Biology and Physiology, pages 11–29, 2006.
 [37] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, Dec 2007.
 [38] F. Wen, Y. Yang, P. Liu, and R. C. Qiu. Positive definite estimation of large covariance matrix using generalized nonconvex penalties. IEEE Access, 4:4168–4182, 2016.
 [39] U. Wittstock and B. A. Halkier. Glucosinolate research in the arabidopsis era. Trends in plant science, 7(6):263–270, 2002.
6. Appendix
6.1. Groups of metabolites
Group 1  Group 2  Group 3  Group 4 

Alanine  Arginine  Glutamate  beta.Sitosterol 
Asparagine  Cystein  alpha.Tocopherol  Campesterol 
Aspartate  Gaba  gamma.Tocopherol  Eicosanoate 
Glycine  Glutamine  Linolenic.acid  Heptadecanoate 
Isoleucine  Tryptophan  H2SO4  Stearic.acid 
Leucine  Linoleic.acid  X2.Oxoglutarate  Tetracosanoate 
Lysine  Quercetin  Mannitol  BenzoylX.Glucosinolate.3 
Phenylalanine  BenzoylGlucosinolate.3Breakdown  Urea  Sulfite 
Proline  Nonanenitrile.9.methylthio  Fructose.6.P  U2609.4.361 
Serine  UGlucosinolatebreakdown140.5  Digalactosylglycerol  U3122.4.202.I3M. 
Threonine  X2.Hydroxyglutarate  Galactinol  dihydroxybenzoate 
Tyrosine  Citrate  Galactosylglycerol  beta.indole.3.acetonitrile 
Valine  Erythronate  Rhamnose  U1837.6.368 
X5..methylthio.pentanenitrile  Galactonate  Stachyose  U1841.9.394 
Octanenitrile.8.methylthio  Gluconate  Sucrose  U2003.8.293 
UGlucosinolatebreakdown140.4  Glycerate  U1093.6.147  U2371.1.361 
Succinate  Malate  U1124.3.140  U2375.6.191 
Threonate  Allantoin  U1530.2.314  U2513.2.296 
Arabitol  Erythritol  U2053.6.321.1  U2513.2.296.1 
myo.Inositol  Ethanolamine  U2109.3.305  U2692.9.361 
Glycerol.3.P  Sorbitol  U2197.2.494  U2798.377 
myo.Inositol.1.P  Threitol  U2315.2.245  U2942.2.556 
Phosphate  Xylitol  U3898.1.204  U3063.0.361 
U2206.2.299  Ethylphosphate  U3415.9.498  
Fructose  Glucose  
Glucopyranose..H2O.  Mannose  
U1154.3.156  Raffinose  
U1393.172  Ribose  
U1541.8.263  U1127.5.140  
U1647.2.403  U1172.9.281  
U1705.2.319.pentitol.  U1559.4.217  
U1729.0.273  U1628.9.233  
U1816.2.228  U1849.2.285  
U1859.2.246  U1927.0.204  
U2076.9.204  U1939.1.210  
U2170.6.361  U1983.0.217  
U2184.1.299  U1983.0.217.1  
U2251.5.361  U2012.7.361  
U2278.6.361  U2282.4.349  
U2550.7.149  U2400.1.179  
U2731.2.160  U2779.9.361  
U2857.8.342  
U2929.1.297  
U3041.1.361  
U3080.7.361  
U3100.8.361 
Group 5  Group 6  Group 7  Group 8 

Quercitrin  X4MTB  BenzoylGlucosinolate.2Breakdown  Maleate 
Dehydroascorbate  X5MTP  Hexanenitrile.6methylthio  Pentonate.4 
Fumarate  X6MTH  Sinapinate.trans  U1408.4.298 
Sinapinate.cis  X7MTH  Anhydroglucose  U1617.8.146 
Arabinose  X8MTO  U1125.1.140  U1767.3.243 
Galactose  UGlucosinolate140.1  U1290.198  U1904.9.204 
U1127.4.169  U1129.9.184  U1371.5.151  U2828.8.361 
U1718.0.157  U1270.1.240  U1549.7.130  U2839.3.312 
U1931.5.202  U1897.2.327  U1568.5.313  U2882.5.297 
U2261.0.218  U2473.361  U1592.8.217  U3008.3.457 
U2412.1.157  U2529.8.361  U1700.6.288  U3168.2.290 
U2588.9.535  U2756.4.271  U1759.4.331  U3218.5.297 
U2688.5.333  U2924.3.361  U1852.0.217  U3910.6.597.Trigalactosylglycerol. 
U3213.1.400  U3279.7.361  U1872.1.204.methyl.hexopyranoside.  U2443.7.217 
U1380.5.184  U1958.217  
U2053.6.321  
U2087.6.321  
U2150.9.279  
U2271.6.249  
U3188.1.361  
U3701.368  
U4132.5.575 
6.2. Groups of proteins
Proteins  Group 

AT1G14170.1  8 
AT1G20260.1  8 
AT1G42970.1  8 
AT1G47980.1  8 
AT1G55210.1  8 
AT1G75280.1  8 
AT2G19900.1  8 
AT2G22240.1  8 
AT2G28900.1  8 
AT2G32920.1  8 
AT2G37970.1  8 
AT3G12580.1  8 
AT3G13930.1  8 
AT3G26650.1  8 
AT3G26720.1  8 
AT3G44300.1  8 
AT3G47930.1  8 
AT3G54400.1  8 
AT3G55800.1  8 
AT4G16760.1  8 
AT4G20830.1  8 
AT4G25740.1  8 
AT4G34870.1  8 
AT4G35790.1  8 
AT5G11880.1  8 
AT5G12040.1  8 
AT5G14030.1  8 
AT5G17380.1  8 
AT5G22810.1  8 
AT5G26000.1  8 
AT5G50370.1  8 
AT5G66190.1  8 
AT5G67360.1  8 
ATCG00480.1  8 
Comments
There are no comments yet.